Best Practices for Data Collection Jobs
This guide outlines best practices to help you get the most from your Job, manage the costs, and maintain multi-source data.
Best Practices for Creating Collection Jobs
Datastreamer Jobs are the foundation of your external data pipeline. Whether you're collecting social content, monitoring websites, or aggregating entity activity across platforms, well-structured Jobs help you collect more useful data while keeping system costs and complexity under control.
This guide outlines best practices to help you get the most from your Jobs.
Tips to Ensure Maximum Coverage in Your Jobs
To maximize the amount and quality of data returned, consider the following recommendations:
-
Keep keyword lists short. Jobs with more than 8 keywords may return less data from certain sources. This is due to the nature of how those source operate. Consider splitting broad queries into multiple Jobs for better completeness.
-
Use smaller date ranges. When collecting historical data, break long ranges into daily Jobs rather than submitting a single large request. For example, create one Job for each day in a month instead of a single Job spanning the entire month. This helps as some sources will default to delivering by relevance over large timeframes, instead of by time.
-
Use source-aware query formatting. Boolean logic and query behavior can vary by source. A query that works on one platform may behave differently on another. If you are not familiar with the intricacies, use the Job Builderwithin Portal or Job Creation Agent within the MCP server to ensure your query is valid for the selected data source.
-
Review new sources performance. If it is your first time using a new data source, check the Jobs status via API or within Portal to see if any errors have appeared. The Datastreamer team works to add detailed validation to help understand any errors.
-
Use Periodic Jobs As Periodic Jobs self-manage their frequency and adapt to rate limits, outages or timeouts; using Periodic Jobs can ensure constant flow of content where applicable. Using your own scheduling is completely acceptable, but does not have the same level of integration to the platform.
-
Use offsets in periodic Jobs to refresh post metrics.
If your destination supports updates (like Searchable Storage), configure periodic Jobs with a slight offset (e.g., collecting data from 6 hours ago). This allows your Jobs to revisit posts and refresh their metrics, such as likes, shares, or views, without needing to manually re-collect individual posts.
Best Practices to Manage Data Source Costs
Some connectors and collection types may consume more resources than others. Use the following strategies to manage usage efficiently:
-
Set a reasonable
max_documents
. Start with a lower document limit (e.g., 1,000) to avoid over-fetching. Increase the limit if needed after reviewing initial results. Document limits run on the Jobs and call to the source to stop after the Jobs system is detecting max documents will be exceeded in that run. -
Be intentional with periodic jobs. Frequent polling (e.g., every 5–15 minutes) may be necessary for some use cases but can lead to higher usage. Choose scheduling intervals based on how timely the data needs to be.
-
Use precise time windows. Set
query_from
andquery_to
values that match your intended collection window. These are automated within Periodic Jobs, so use those when applicable. -
Use labels and tags to track costs. Applying labels to Jobs tags each piece of content with that metadata. Consider using tags to mark content by project, customer, or campaign. As these labels are applied to every document collected, it makes it easy to trace and report where data and usage originated.
Maintaining Data from Many Sources and Jobs
As the number of Jobs grows, using structure and naming conventions can make your data easier to manage.
-
Apply tags and labels to all Jobs. Labels are applied to each document collected, helping you filter, search, and group data downstream. Tags can be used to describe job intent, source, client, or workflow.
-
Use consistent naming. Choose naming conventions like
source_topic_frequency
orclient_campaign_date
. This makes Jobs easier to identify and audit. -
Use filters in the Portal. The Portal allows you to search and filter Jobs by label or tag, helping you locate or manage groups of Jobs quickly. The available filter is also Lucene-based, so you can create complex filters and export the data as well for deeper analysis.
-
Use Unify Unify converts all data source schemas to a common format, so using Unify after your data sources will drastically reduce integration efforts.
Following these practices will help ensure your Jobs collect the most complete data possible, reduce the risk of failures, and keep your usage aligned with your goals and cost expectations.
Updated about 23 hours ago