Platform Core Concepts
This overview describes the core concepts to understand about the Pipeline Orchestration, as well as Dynamic Pipelines, components, and more.
What is a Pipeline Orchestration Platform?
Datastreamer's Pipeline Orchestration Platform is the underlying system that powers the biggest products in the threat intelligence, trend prediction, social listening, unstructured data systems, and more!
A Pipeline Orchestration Platform is a solution designed to manage, coordinate, and streamline the flow of data across multiple sources, processing stages, and destinations. In simpler terms, it’s a system that helps companies control the entire journey of their data, from collection to final output, allowing for flexible integration and transformation at each stage.
Datastreamer's Pipeline Orchestration Platform has two main elements:
- Components: Each source, stage, and destination with the Datastreamer platform are self-contained building blocks called "Components". Similar to Legos, Components allow you to assemble the ideal flow and stages of the journey of data through the platform.
- Dynamic Pipelines: Data pipelines refer to the overall flow of data, but Datastreamer's pipelines are different! Unlike the pipelines created internally by many companies, Datastreamer allows you to create truly Dynamic Pipelines.
In summary: A pipeline within Datastreamer is a Dynamic Pipeline as each Component you use within the pipeline is interchangeable, self-sustaining, and configurable.
Fun Fact:
Datastreamer's Dynamic Pipelines allows interchangeability of components, and no limits to size of pipelines. As a result, there are over 2.281 x 10^49 (22 septillion) unique pipeline layouts possible.
What benefits do I get from Dynamic Pipelines?
Dynamic Pipelines means that you can:
- Build pipelines with multiple parts and complex routing in minutes. (So you can focus on your product, not your plumbing).
- Add new enrichments or data sources without any integration effort. (So you can ship faster).
- Full transparency and insight into the data pipelines operations. (So you can sleep better).
- Allow you to avoid any data leakage or unnecessary data noise with unlimited pipelines, not jamming everything into one technical solution (So you can look 4-5 years ahead).
- Data process in a dedicated and scalable environment.
What is the anatomy of a Dynamic Pipeline?
Adding Data to a Pipeline (“Ingestion” or “Ingress”)
Ingestion is how data gets into the pipeline. Datastreamer pipelines are able to connect to existing sources you work with (existing providers, internal stores of data, etc.), and can do this in either a "pull" or "push" manner. These means that a Dynamic Pipeline can essentially connect to any data source, including other Dynamic Pipelines!
You can either push data to a Pipeline, or use "Jobs" to connect to the source to pull data using the Job system to handle the collection of the data.
Data Structuring (Transformation, "Unify")
After data has been ingested, it can be in any format, structure, or size. Different transformation tools can convert the format, adapt to schemas, utilize AI structuring capabilites and more.
A special Component within the Datastreamer platform is "Unify". This is a combination of capabilities that work to understand the existing state of the data and convert it to where it is required to be. (For example: a PDF may require OCR or table extractions, social content may require transforming, raw html may require boilerplate removal, etc.). Unify works to break the content to a JSON format, and matching the schema that you define. A special trick of Unify is its’ AI components. These AI elements can construct and fill missing metadata.
Data Operations (“Enrichments”, "Routing")
Many of the users of a Dynamic Pipeline perform multiple enrichments and operations to the data in Pipeline. This can range from routing Components, NLP, AI, deduplication, or many other components. While most Pipeline tend to utilize Components in the Datastreamer Catalog, you can also connect your own Operations.
Dynamic Pipelines have the ability to add complex routing to send data down different paths (and even different pipelines) based on a number of capabilities. These can routing based on the document metadata, document properties, lucene queries looking within the document itself, and more. This routing can take place between any component.
Delivering Data (“Egress”)
Within your Dynamic Pipelines, Data has been transformed, understood, and enriched; it now is ready for leaving the pipelines. Many options are present from direct database connections, Datastreamer’s own storage offering, or even into another pipeline for further processing and routing.
Updated about 2 months ago
The Platform Glossary has a great way to review the terms used here.