Data Streams Overview
Data Streams Overview
This section covers everything involved in building and running a Data Stream: sources, transformation, enrichment, and destinations.
If you are new to the platform, start with What is a Data Stream? for an introduction to the core concepts.
In This Section
- Sources: Configure data inputs for your Data Stream
- Transformation and Enrichment: Shape data and apply AI/NLP operations
- Destinations: Deliver processed data to warehouses, storage, or endpoints
Components of a Data Stream
Sources
Sources define where data comes from. Datastreamer connects to major social media platforms, news sources, and web content automatically. You configure what data you want (keywords, accounts, date ranges, etc.) and the platform handles provider selection and retrieval.
Supported sources include Facebook, Instagram, Twitter/X, TikTok, Reddit, YouTube, Threads, and Bluesky, among others.
Transformation
Raw data from sources arrives in varied formats. Transformation converts it into a consistent structure using Datastreamer's unified schema, making it compatible with downstream enrichments and destinations.
The Unify Transformer handles this automatically for supported sources. Custom transformation is also available via the JSON Schema Transformer.
Enrichment
Enrichments are optional AI and NLP operations applied to data as it flows through the pipeline. They add structured metadata to each document without replacing the source content. Enrichments are billed at per-component DVU rates, in addition to your base Data Stream usage.
Available enrichments include sentiment analysis, entity recognition, categorization, language detection, location inference, and more.
Operations and Enrichments Overview
Pipeline Logic
The pipeline defines the path data takes from sources to destinations. It supports routing, filtering, deduplication, batching, and branching. Complex logic can be built from the same component set as a simple linear pipeline.
Destinations
Destinations are where processed data is delivered. Supported destinations include cloud data warehouses (BigQuery, Snowflake, Databricks), cloud storage (S3, Azure Blob, GCS), streaming endpoints (Pub/Sub, Firehose, Webhook), and Datastreamer's own Searchable Storage.
Jobs
Jobs are how a Data Stream collects data from sources. Each Job runs a query, retrieves content, and passes it into the pipeline. Jobs handle scheduling, retries, volume limits, and source failover automatically.
Pricing
Data Streams are priced in DVUs (Data Volume Units). DVU usage accumulates across all Jobs and components in a stream.
Need More Control?
If you need to connect to a specific data provider directly, use your own API credentials, or access provider-specific features, see Direct Integrations.
