Estimating external data volumes
Estimate the content in a 3rd party data source using extrapolation methods.
Helpful Resource
This resource is a supporting guide for use cases in or surrounding the Platform. They may not be indicative of platform features and are meant to serve as a guide.
What is volume extrapolation?
A common issue in sourcing data is the question of volumes at the source. To estimate the amount of volume at the source, volume extrapolation is the most effective method. It operates off of a Fixed Time Sampling Methodology in which you collect data in consistent, predetermined time blocks (e.g., 1-hour, 1-day, 1-week samples).
Advantages:
- Provides a repeatable and scalable way to collect data.
- Reduces variability that might arise from sampling at irregular intervals.
- Captures content patterns specific to these time blocks, which can then be extrapolated.
The accuracy of your estimation has a direct correlation to the sample size and time blocks. Three approaches are listed in this article.
Reduced Accuracy
The reduced accuracy approach is the simplest, quickest, and cheapest of the approaches. It is however the lowest in accuracy. This technique provides a quick, high-level estimate, ideal for initial project scoping or feasibility checks. It does have risks in missing the variability of volumes, geographic spread of content creation, and other events. In this approach, we are looking at the time spread in collecting a fixed number of posts. (For example: In collecting 1000 posts from a platform, what is the difference between the time stamps between the first and last in the sorted list?)
Extrapolation Technique:: Linear scaling
How to:
- Set a sample goal (such as 1,000 posts) from a historical collection.
- Create a pipeline using Unify component to standardize the data and time format, and egress utilize the Datastreamer Searchable Storage component for easy API access to analyze the data. For a smaller sample size, the Document Inspector would also be a viable option.
- Analyze the time difference in the results and multiply the count as needed to reach the month volume estimations.
Use historical data (when available).
While these extrapolation methods can be used for either historical or real-time content, utilizing historical content (when available) can allow you to perform a multi-day test (such as the "High Accuracy") in under an hour.
Medium Accuracy
This method balances accuracy and simplicity, making it suitable for moderate-scale projects. It consists of time-based collection in 1-hour blocks spread across one or more days. It gives the advantage of being able to measure the variability in day cycles, as well as the different geographic, language, and topics variability that happens intra-day.
Extrapolation Technique: Average Rate Calculation
How to:
- Set a fixed time samples. Generally 1-hour samples in a regular 6 hours intervals for 1-3 days. It can be expanded into High Accuracy by extending to a week.
- Create a pipeline as in Reduced Accuracy
- Create Periodic Jobs. These Jobs would be searching for 1-hour timeframes and a separate job for each block is ideal. Each Job should have a tag as well to apply that tag to the collected data for analysis.
- Analyze the volume of content in each of the segments to create an average post-per-hour of peak and off-peak periods.
You can further enrich the data with classifiers and other enrichments. This would allow you to perform deep analysis into the demographics, topics, languages, and countries.
High Accuracy
The Medium Accuracy method required the collection of specific controlled snapshots. With the high accuracy, we are going to be running the pipeline continuous in a production-like approach. For the highest accuracy, 24 hours a day for a period of 7 days is ideal.
Extrapolation Technique: Weighted Extrapolation
How to:
- Create the Pipeline as described above.
- Create a Job to collect all data for a 7-day period.
- Analyze the volume of content by:
- First look at average total daily volumes to identify peak / off-peak days. (e.g: weekend vs weekday, day with a NFL game vs not).
- With these on/off-peak days, identify the peak/off-peak by averaging the per-hour across multiple days, separating by if it is a peak or off-peak day.
- If a major event has occurred in the timeframe, use the search APIs to remove any related keywords to prevent bias.
- Apply weighted averages across the traffic patterns and volumes to create an overall estimation, For example: If weekday traffic averages 15,000 posts/day and weekend traffic averages 25,000 posts/day, apply these weights to estimate monthly totals. In this example, the formula would be:
Monthly estimate = (15,000 posts 20 weekdays) + (25,000 posts 8 weekend days) = 300,000 + 200,000 = 500,000 posts/month.
Updated 2 months ago