Datastreamer Architecture Overview

This overview describes the high-level overview of Datastreamer's architecture

Datastreamer is a fully managed real-time data ingestion and delivery platform designed to create and manage data pipelines. These pipelines stream, filter, normalize, and deliver data from diverse sources into cloud storage, databases, or APIs.

It focuses on low-latency delivery, schema consistency, dynamic scalability, and minimal operational burden.

Key Elements of Datastreamer's Architecture

ElementPurpose
Dynamic PipelinesThe key element of the Datastreamer platform. Pipelines are user-defined workflows that dictate the flow or data from source to destination. They are defined by the user and are self-managing, auto-scaling, and highly efficient data pipelines .
Pipeline OrchestratorThe Orchestrator dynamically manages the resource allocation, service scaling, workload management, and error recovery across every pipeline and each in-pipeline layer to maintain real-time flow under varying loads.
PortalThe UI control panel for configuring pipelines, managing sources, destinations, filtering rules, and monitoring system health.
API LibraryA set of APIs for managing data ingestion, flow, and delivery within deployed pipelines.
Streaming InfrastructureThe real-time backbone that ensures events are routed from each component in Pipelines from source to target with low latency, high throughput, and reliability guarantees.
Security LayerEnforced through authentication/authorization at every entry/exit point, full data encryption (TLS, at-rest), and compliance controls built into every layer.

Layers of Pipeline's Architecture

Datastreamer’s Pipelines are modular and horizontally scalable, with clear separation across major system layers:

Layer

Purpose

Sources Connectors (Ingress)

Handles connections to external data providers. Each source connector runs in an independent service environment, ensuring isolation and individual failure recovery.

Job Management Layer

The Job Management Layer specifically manages the lifecycle of source-side jobs — including event detection, scheduling extraction runs, and state tracking for incremental ingestion..

Transformation Layer

Source data is normalized into user-defined or Datastreamer Default Schema. Changes in upstream fields are versioned and managed to ensure downstream schema stability.

Raw source is directly streamed into the Transformation layer capable of structuring unstructured or semi-structured data formats into structured JSON.

Operations Layer

Post-transformation application of operations to data in the pipeline, such as filtering, deduplication, enrichment (e.g., language detection, entity extraction, AI models), and transformation to destination schemas. Designed to be horizontally scalable and modular across pipelines.

Delivery and Destination Layer (Egress)

Final processed data is streamed to user-defined destinations (e.g., AWS S3, BigQuery, Azure Blob Storage, webhooks) with optimization for each target system's delivery characteristics.

Pipeline Orchestrator

The Pipeline Orchestrator dynamically manages resource allocation, service scaling, workload management, and error recovery across pipeline layers to maintain real-time flow under varying loads.

Flow through Key Layers

The general flow through the key layers can be visualized as below. Note that the Pipeline Orchestrator is not pictured for simplicity as it interacts with each layer.

flowchart TD 
 A[Data Source] --- B[Ingress Connector] 
 B --> C[Transformation Layer] 
 C --> D[Operations Layer]
 D --> E[Delivery Egress]
 G[Job Management]
 G -.-> B
 B -.-> G
 B --> A

Pipeline Scalability

AspectStrategy
Horizontal ScalingAll critical services (connectors, processors, delivery engines) are stateless and scale horizontally, meaning instances can be added/removed dynamically based on load by the Orchestrator.
Elastic QueuesIngestion queues automatically expand under load, buffering surges to protect downstream systems without data loss.
Auto-ShardingLarge pipeline consumption or high-volume sources are automatically divided into multiple shards/partitions to balance workload and avoid bottlenecks.
Partitioned WorkloadsLarge volume pipelines automatically shard streams across processors and delivery workers for parallelism.
Backpressure and Flow ControlControlled at every layer to prevent overloads. Connectors apply flow control to upstream services if downstream systems slow down.
Failure RecoveryRetries with exponential backoff, dead-letter queues for irrecoverable errors, and checkpointing for safe resumption.
Multi-region Support(Where configured) services can be deployed across multiple regions/zones for higher availability and lower latency.
Real-time MetricsContinuous collection of event counts, processing lag, delivery statuses, and error rates for proactive monitoring.
External AlertingAvailable connectors and configurable external alerting on Pipeline performance or exceeding volume thresholds.
Delivery OptimizationDynamic delivery batch sizing and adaptive retry policies help maintain throughput without overwhelming destinations.

Pipeline Resiliency and Fault Tolerance

AspectStrategy
Scaling ToleranceQueues are persisted and replicated across multiple nodes during scaling or upgrading events.
Processing Fault ToleranceIdempotent design ensures that retries do not cause duplicates or corruption. Streaming engine also guarantees at-least-once or exactly-once delivery semantics depending on the pipeline configuration.
Destination ResilienceDelivery retries with exponential backoff; fallback to dead-letter if persistent failures occur.
ComponentsIndividual services with pipeline layers operate independently, allowing the individual services (e.g., Connectors, Processors, Delivery Agents) to restart or recover without impacting queued data or other active services..

In Summary

Datastreamer is a serverless, self-healing pipeline platform where users can determine data workflows, and the platform continuously and automatically maintains them with minimal manual involvement.

  • Scalable: Designed for growth — Orchestrator handles the scaling of ingestion workers, processors, or delivery agents.
  • Resilient: Survives individual service failures automatically.
  • Secure: Encryption, auth, and access control by default.
  • Efficient: Supports very high event throughput with minimal latency.