Get Started

Cloud Storage Ingress

Configuring data ingestion for a cloud storage service

Component Configuration

You can import data from Azure blob containers, S3 Buckets or Google Cloud Storage. All providers have the below common configuration.

Json Codec

This is one of two values:

  • application/json-lines Line feed separated single line JSON documents
  • application/json A traditional JSON array wrapped in []

Processing Container / Processing Folder

Folder to store files temporarily while processing when using a Monitor workflow.

Ingested Container / Processing Folder

Folder to store files after they have been ingested when using a Monitor workflow.

Credentials

The specific credentials entered will depend on the cloud provider, for example Google Cloud Storage is configured by choosing a secret containing the google service account (this can be provided by us for you to permission or you can create it yourself)

All providers have a Use Proxy option. If your security policy requires whitelisting IP addresses to be given access then you should select this option. Ask Datastreamer support for the IP address of our proxy for you to allow.

Data Ingestion

Once a pipeline is deployed you can add jobs that perform data ingestion

Data Type

This is one of two options :-

  • JSON Documents : Your files contain JSON documents that will be extracted and processed individually
  • Files : Everything else (e.g. PDF's). These files will be processed individually

Ingestion Type

This is the workflow that will be used when pulling files from cloud storage. The options are :-

Monitor Folder

Datastreamer can periodically scan your folder and automatically ingest any files placed there. This will move files between the processing and ingested locations so that files are not processed twice.

Permissions required: LIST, READ and WRITE permission on the buckets/containers used

Monitor Folder Daily

Similar to Monitor Folder except the folder scan is once per day at midnight.

Folder

Scan a folder once.

Permissions required: LIST, READ and on the buckets/containers used. If using the processing/complete workflow WRITE is also required.

Files

List specific files to ingest.

Permissions required: READ and on the buckets/containers used.

Workflow

This has two options available when Files or Folder Ingestion Type is used

Leave files where they are

Files are just read and left as they are.

Use processing/complete locations

Files are moved to the configured Processing while being ingested and then to the configured Ingested location once ingestion is completed.

Bucket / Container

The bucket or container where your files are

Folder

An optional sub-folder or prefix.

Include Sub-Folders

If set then any folder scan will recurse into all sub-folders, otherwise only the specific configured folder or 'root' folder will be scanned.

File Count

This applies to the Folder and Monitor Folder Daily ingestion types and is the maximum number of files to ingest when the job is ran.