Pipeline Core Concepts

About Dynamic Pipelines

Unlike the pipelines created internally by many companies, Datastreamer allows you to create truly Dynamic Pipelines. A pipeline within Datastreamer is a Dynamic Pipeline as each component you use within the pipeline is interchangeable, self-sustaining, and configurable.

Every second, over 100,000 enrichment are done to over 50,000 documents in realtime through a customers’ Dynamic Pipeline.

Beneifts of Dynamic Pipelines

Dynamic Pipelines means that you can:

  • Build pipelines with multiple parts and complex routing in minutes. (So you can focus on your product, not your plumbing).
  • Add new enrichments or data sources without any integration effort. (So you can ship faster).
  • Full transparency and insight into the data pipelines operations. (So you can sleep better).
  • Avoid any data leakage or unnecessary data noise with unlimited pipelines, not jamming everything into one technical solution (So you can look 4-5 years ahead).

Anatomy of a Dynamic Pipeline

Each pipeline has 4 primary parts. These are support by the underlying platform, and comprehensive routing capabilities.

“Ingestion” or “Ingress”

This first part is how data gets into the pipeline. Datastreamer pipelines are able to connect to existing sources you work with (existing providers, internal stores of data, etc.). In addition to existing sources, the Datastreamer partner catalog is a great way to discover new data sources.

“Unify”

After data has been ingested, it can be in any format, structure, or size. Unify is a combination of capabilities that work to understand the existing state of the data and convert it to where it is required to be. (For example: a PDF may require OCR or table extractions, social content may require transforming, raw html may require boilerplate removal, etc.). Unify works to break the content to a JSON format, and matching the schema that you define. A special trick of Unify is its’ AI components. These AI elements can construct and fill missing metadata.

“Enrichments”

With many enrichments pre-integrated into Datastreamer’s pipelines, you can perform realtime preprocessing of the data. These enrichments can include NLP, AI, or even data cleanup processes. These enrichments can range from general sentiment detection, to translation, ChatGPT prompt execution, PII redaction, and more. Our platform catalog has a whole set of options.

“Egress”

Data has been tranformed, understood, and enriched; it now is ready for leaving the pipelines. Many options are present from direct database connections, Datastreamer’s own storage offering, or even into another pipeline for further processing and routing.

Data Routing:

In addition to the 4 main parts above, Datastreamer pipelines have the ability to add complex routing to send data down different paths (and even different pipelines) based on a number of capabilities. These can routing based on the document metadata, document properties, lucene queries looking within the document itself, and more. This routing can take place between any component.