πŸ“˜ Pipeline Anatomy

This overview describes the core concepts to understand about the Pipeline Orchestration, as well as Dynamic Pipelines, components, and more.

What is the anatomy of a Dynamic Pipeline?

Generally you can think of a data pipelines as having these key four stages:

  1. Ingress: How data enters the pipeline.
  2. Transformation: How the data is structured for usage.
  3. Operations: Ways that data is then filtered, enriched, or augmented.
  4. Egress: How the data is delivered out of the pipeline, either directly into a product or into a database.
flowchart LR
    Ingress --- Transform1["fa:fa-spinner Ingress"]
    Transform1 --- Transform2["fa:fa-spinner Transformation"]
    Operations --- Egress["fa:fa-spinner Operations"]
    Egress --> YourProduct["fa:fa-check Your Product"]

Adding Data to a Pipeline (β€œIngestion” or β€œIngress”)

Ingestion is how data gets into the pipeline. Datastreamer pipelines are able to connect to existing sources you work with (existing providers, internal stores of data, etc.), and can do this in either a "pull" or "push" manner. These means that a Dynamic Pipeline can essentially connect to any data source, including other Dynamic Pipelines!

You can either push data to a Pipeline, or use "Jobs" to connect to the source to pull data using the Job system to handle the collection of the data.

Data Structuring (Transformation, "Unify")

After data has been ingested, it can be in any format, structure, or size. Different transformation tools can convert the format, adapt to schemas, utilize AI structuring capabilites and more.

A special Component within the Datastreamer platform is "Unify". This is a combination of capabilities that work to understand the existing state of the data and convert it to where it is required to be. (For example: a PDF may require OCR or table extractions, social content may require transforming, raw html may require boilerplate removal, etc.). Unify works to break the content to a JSON format, and matching the schema that you define. A special trick of Unify is its’ AI components. These AI elements can construct and fill missing metadata.

Data Operations (β€œEnrichments”, "Routing")

Many of the users of a Dynamic Pipeline perform multiple enrichments and operations to the data in Pipeline. This can range from routing Components, NLP, AI, deduplication, or many other components. While most Pipeline tend to utilize Components in the Datastreamer Catalog, you can also connect your own Operations.

Dynamic Pipelines have the ability to add complex routing to send data down different paths (and even different pipelines) based on a number of capabilities. These can routing based on the document metadata, document properties, lucene queries looking within the document itself, and more. This routing can take place between any component.

Delivering Data (β€œEgress”)

Within your Dynamic Pipelines, Data has been transformed, understood, and enriched; it now is ready for leaving the pipelines. Many options are present from direct database connections, Datastreamer’s own storage offering, or even into another pipeline for further processing and routing.


What’s Next

Now that you know the parts of a Pipelines, let's create our first Pipeline.