Datastreamer Architecture Overview
This overview describes the high-level overview of Datastreamer's architecture
Datastreamer is a fully managed real-time data ingestion and delivery platform designed to create and manage data pipelines. These pipelines stream, filter, normalize, and deliver data from diverse sources into cloud storage, databases, or APIs.
It focuses on low-latency delivery, schema consistency, dynamic scalability, and minimal operational burden.
Key Elements of Datastreamer's Architecture
Element | Purpose |
---|---|
Dynamic Pipelines | The key element of the Datastreamer platform. Pipelines are user-defined workflows that dictate the flow or data from source to destination. They are defined by the user and are self-managing, auto-scaling, and highly efficient data pipelines . |
Pipeline Orchestrator | The Orchestrator dynamically manages the resource allocation, service scaling, workload management, and error recovery across every pipeline and each in-pipeline layer to maintain real-time flow under varying loads. |
Portal | The UI control panel for configuring pipelines, managing sources, destinations, filtering rules, and monitoring system health. |
API Library | A set of APIs for managing data ingestion, flow, and delivery within deployed pipelines. |
Streaming Infrastructure | The real-time backbone that ensures events are routed from each component in Pipelines from source to target with low latency, high throughput, and reliability guarantees. |
Security Layer | Enforced through authentication/authorization at every entry/exit point, full data encryption (TLS, at-rest), and compliance controls built into every layer. |
Layers of Pipeline's Architecture
Datastreamer’s Pipelines are modular and horizontally scalable, with clear separation across major system layers:
Layer | Purpose |
---|---|
Sources Connectors (Ingress) | Handles connections to external data providers. Each source connector runs in an independent service environment, ensuring isolation and individual failure recovery. |
Job Management Layer | The Job Management Layer specifically manages the lifecycle of source-side jobs — including event detection, scheduling extraction runs, and state tracking for incremental ingestion.. |
Transformation Layer | Source data is normalized into user-defined or Datastreamer Default Schema. Changes in upstream fields are versioned and managed to ensure downstream schema stability. Raw source is directly streamed into the Transformation layer capable of structuring unstructured or semi-structured data formats into structured JSON. |
Operations Layer | Post-transformation application of operations to data in the pipeline, such as filtering, deduplication, enrichment (e.g., language detection, entity extraction, AI models), and transformation to destination schemas. Designed to be horizontally scalable and modular across pipelines. |
Delivery and Destination Layer (Egress) | Final processed data is streamed to user-defined destinations (e.g., AWS S3, BigQuery, Azure Blob Storage, webhooks) with optimization for each target system's delivery characteristics. |
Pipeline Orchestrator | The Pipeline Orchestrator dynamically manages resource allocation, service scaling, workload management, and error recovery across pipeline layers to maintain real-time flow under varying loads. |
Flow through Key Layers
The general flow through the key layers can be visualized as below. Note that the Pipeline Orchestrator is not pictured for simplicity as it interacts with each layer.
flowchart TD A[Data Source] --- B[Ingress Connector] B --> C[Transformation Layer] C --> D[Operations Layer] D --> E[Delivery Egress] G[Job Management] G -.-> B B -.-> G B --> A
Pipeline Scalability
Aspect | Strategy |
---|---|
Horizontal Scaling | All critical services (connectors, processors, delivery engines) are stateless and scale horizontally, meaning instances can be added/removed dynamically based on load by the Orchestrator. |
Elastic Queues | Ingestion queues automatically expand under load, buffering surges to protect downstream systems without data loss. |
Auto-Sharding | Large pipeline consumption or high-volume sources are automatically divided into multiple shards/partitions to balance workload and avoid bottlenecks. |
Partitioned Workloads | Large volume pipelines automatically shard streams across processors and delivery workers for parallelism. |
Backpressure and Flow Control | Controlled at every layer to prevent overloads. Connectors apply flow control to upstream services if downstream systems slow down. |
Failure Recovery | Retries with exponential backoff, dead-letter queues for irrecoverable errors, and checkpointing for safe resumption. |
Multi-region Support | (Where configured) services can be deployed across multiple regions/zones for higher availability and lower latency. |
Real-time Metrics | Continuous collection of event counts, processing lag, delivery statuses, and error rates for proactive monitoring. |
External Alerting | Available connectors and configurable external alerting on Pipeline performance or exceeding volume thresholds. |
Delivery Optimization | Dynamic delivery batch sizing and adaptive retry policies help maintain throughput without overwhelming destinations. |
Pipeline Resiliency and Fault Tolerance
Aspect | Strategy |
---|---|
Scaling Tolerance | Queues are persisted and replicated across multiple nodes during scaling or upgrading events. |
Processing Fault Tolerance | Idempotent design ensures that retries do not cause duplicates or corruption. Streaming engine also guarantees at-least-once or exactly-once delivery semantics depending on the pipeline configuration. |
Destination Resilience | Delivery retries with exponential backoff; fallback to dead-letter if persistent failures occur. |
Components | Individual services with pipeline layers operate independently, allowing the individual services (e.g., Connectors, Processors, Delivery Agents) to restart or recover without impacting queued data or other active services.. |
In Summary
Datastreamer is a serverless, self-healing pipeline platform where users can determine data workflows, and the platform continuously and automatically maintains them with minimal manual involvement.
- Scalable: Designed for growth — Orchestrator handles the scaling of ingestion workers, processors, or delivery agents.
- Resilient: Survives individual service failures automatically.
- Secure: Encryption, auth, and access control by default.
- Efficient: Supports very high event throughput with minimal latency.
Updated 1 day ago