Core Concepts

What is the anatomy of a Dynamic Pipeline?

Generally you can think of a data pipelines as having these key four stages:

Ingress: How data enters the pipeline.
Transformation: How the data is structured for usage.
Operations: Ways that data is then filtered, enriched, or augmented.
Egress: How the data is delivered out of the pipeline, either directly into a product or into a database.

flowchart LR
    Ingress --- Transform1["fa:fa-spinner Ingress"]
    Transform1 --- Transform2["fa:fa-spinner Transformation"]
    Operations --- Egress["fa:fa-spinner Operations"]
    Egress --> YourProduct["fa:fa-check Your Product"]

How to add data to a Pipeline?

Ingress connectors are often how data gets into the pipeline. Datastreamer pipelines are able to connect to existing sources you work with (existing providers, internal stores of data, etc.), and can do this in either a "pull" or "push" manner. These means that a Dynamic Pipeline can essentially connect to any data source, including other Dynamic Pipeline!

You can either push data to a Pipeline, or use "Jobs" to connect to the source to pull data using the Job system to handle the collection of the data.

What is Data Structuring? (Transformation, "Unify")

After data has been ingested, it can be in any format, structure, or size. Different transformation tools can convert the format, adapt to schemas, utilize AI structuring capabilites and more.

A special Component within the Datastreamer platform is "Unify". This is a combination of capabilities that work to understand the existing state of the data and convert it to where it is required to be. (For example: a PDF may require OCR or table extractions, social content may require transforming, raw html may require boilerplate removal, etc.).

Many transformation Components works to break the content to a JSON format, and matching the schema that you define.

A special trick of Unify is its’ AI components. These AI elements can construct and fill missing metadata.

What are "Operations" within my pipeline?

Many of the users of a Dynamic Pipeline perform multiple enrichments and operations to the data in Pipeline. These Operations can range from routing Components, NLP, AI, deduplication, or many other components. While most Pipelines tend to utilize Components already present in Registry, you can also connect your own Operations.

Dynamic Pipelines have the ability to add complex routing to send data down different paths (and even different pipelines) based on a number of capabilities. These can routing based on the document metadata, document properties, lucene queries looking within the document itself, and more. This routing can take place between any component.

How to deliver data out of my pipeline? (“Egress”)

Within your Dynamic Pipelines, Data has been transformed, understood, and enriched; it now is ready for leaving the pipelines. Many options are present from direct database connections, Datastreamer’s own storage offering, or even into another pipeline for further processing and routing.