Data Volume Units
A unifying model for multiple different data consumption metrics.
What is a Data Volume Unit (DVU)
The biggest challenge in comparing and estimating consumption costs from unstructured and semi-structured sources lies in the varied methods of estimation. Many of Datastreamer's customers are using multiple components in their pipelines that traditionally have very varied pricing models.
Datastreamer works with their integrated partners to best align their unique pricing methodology with a common Data Volume Unit (DVU) methodology so that customers can easily assess, estimate, and use a unified pricing and measurement approach.
Rule of Thumb: 1 DVU of any support metric is estimated to align to the size, complexity, and effort to process 100 Twitter posts.
Data Volume Units in Billing
For Integrated billing, the Data Volume Units (DVUs) are counted using the metric of that component, and converted in a direct conversion format to the next DVU. Estimation, billing, and pricing tables are built off the DVUs.
For example: If 200 kilobytes is equal to 1 DVU of a component, and 850 kilobytes were processed in a pipeline, 5 DVUs would be present in billing.
Supported Metrics
Datastreamer supports the following metrics in conversion to DVUs.
Metric | Description | Supported |
---|---|---|
Tokens (General) | Commonly used in AI products, Tokens are a measurement of elements in text data. | Yes |
Input Tokens | Some components and AI products separately measure Tokens used in the input of data. | Yes |
Output Tokens | Some components and AI products separately measure Tokens used in the output of data. | Yes |
Documents per Second | Count of documents being processed per second. Present in some firehose data sources. | Yes |
Bytes | Measurement of the size of a documents. | Yes |
Compute Time (milliseconds, hours) | Measurements of the computational resources used to process the information. Common with analysis and some NLP products. | Yes |
Document Count (results, post count) | Count of documents. Most common in data sources that are not performing ad-hoc data collection. | Yes |
Requests | Count of documents. Most common in data sources that are performing ad-hoc data collection. | Yes |
Mentions, Credits | Some providers use a seperate credit or mention system. This is custom per provider and source. | Yes |
CSV Rows | Rows of a CSV documents. | Yes |
Field Count | Count of the number of fields in returned data. Often used in conjunctions with other metrics. | Yes |
Characters | Count of the characters used. | Yes |
Words | Count of the words used. | Yes |
PDF Pages | Count of the pages of a PDF document, as extracted from the metadata of the document. In the case of alternate page sizes, a US Letter sizing is treated as default. | Yes |
Common Conversions
This is an illustrative table of basic conversions. Many sources will have more specific details.
Rule of Thumb: 1 DVU of any support metric is estimated to align to the size, complexity, and effort to process 100 Twitter posts.
Metric | Measurement of Metric | DVU Count | Details |
---|---|---|---|
Characters | 28,000 | 1 | Based on English language. |
Words | 6,000 | 1 | |
CSV Rows | 100 | 1 | |
Bytes | 100KB | 1 | |
Tokens | 6,000 | 1 | |
Documents (Social) | 100 | 1 | Based on short-form social content. |
PDF Page | 1 | 1 |
Component Specific Conversions
Datastreamer supports many third party components and integrations. This helpful table provides an approximate conversion rate from the 3rd parties pricing metric to 1 Data Volume Unit (DVU).
Please note that table is approximate.
Component | Amount of 3rd party metric = 1 DVU | 3rd Party Pricing Metric |
---|---|---|
WebSightLine Instagram | 100 | Documents |
WebSightLine Threads | 100 | Documents |
WebSightLine File Fetcher | 200 | Kilobytes |
Data365 Data Sources | 120 | Mentions |
PrivateAI PII Redaction | 28,000 | Characters |
AI Classifiers (OpenAI) | 7,000 Input tokens, 2,000 Output Tokens | Tokens (Various) |
AI Classifiers (Gemini) | 7,000 Input tokens, 2,000 Output Tokens | Tokens (Various) |
CleanDNS Data Sources | 100 | Documents |
Dark Owl Search APIs | 5 | Documents |
Socialgist Data Sources | 100 | Documents |
Average Custom Data Ingestion | 200 | Kilobytes |
Location Inference Classifier | 100 | Documents |
PDF Data Integration | 1 | Page |
Datastreamer NLP Classifers (Bundle) | 100 | Documents |
PDF Table Conversion | 1 | Table |
Bright Data Specialty Source (Bundle) | 20 | Records |
Bright Data High Result Sources (Bundle) | 150 | Records |
Vetric High Request Sources (Bundle) | 5 | Requests |
Vetric Low Request Sources (Bundle) | 10 | Requests |
Vetric Detailed Request Sources (Bundle) | 1 | Requests |
Twingly VK | 165 | Documents |
Twingly Blogs | 70 | Documents |
Opoint News | 60 | Documents |
Vital4 Adverse Media | 50 | Documents |
Vital4 Politically Exposed Persons | 670 | Documents |
Vital4 Watchlist | 35 | Documents |
Vital4 Crime | 530 | Documents |
Cohere Sentiment | x | Tokens |
ChatGPT Prompt Application | x | Tokens |
Updated 19 days ago