Cloud Storage Ingress
Configuring data ingestion for a cloud storage service
Component Configuration
You can import data from Azure blob containers, S3 Buckets or Google Cloud Storage. All providers have the below common configuration.
Json Codec
This is one of two values:
application/json-lines
Line feed separated single line JSON documentsapplication/json
A traditional JSON array wrapped in[]
Processing Container / Processing Folder
Folder to store files temporarily while processing when using a Monitor
workflow.
Ingested Container / Processing Folder
Folder to store files after they have been ingested when using a Monitor
workflow.
Credentials
The specific credentials entered will depend on the cloud provider, for example Google Cloud Storage is configured by choosing a secret containing the google service account (this can be provided by us for you to permission or you can create it yourself)
All providers have a Use Proxy option. If your security policy requires whitelisting IP addresses to be given access then you should select this option. Ask Datastreamer support for the IP address of our proxy for you to allow.
Data Ingestion
Once a pipeline is deployed you can add jobs that perform data ingestion
Data Type
This is one of two options :-
JSON Documents
: Your files contain JSON documents that will be extracted and processed individuallyFiles
: Everything else (e.g. PDF's). These files will be processed individually
Ingestion Type
This is the workflow that will be used when pulling files from cloud storage. The options are :-
Monitor Folder
Datastreamer can periodically scan your folder and automatically ingest any files placed there. This will move files between the processing and ingested locations so that files are not processed twice.
Permissions required: LIST, READ and WRITE permission on the buckets/containers used
Monitor Folder Daily
Similar to Monitor Folder
except the folder scan is once per day at midnight.
Folder
Scan a folder once.
Permissions required: LIST, READ and on the buckets/containers used. If using the processing/complete workflow WRITE is also required.
Files
List specific files to ingest.
Permissions required: READ and on the buckets/containers used.
Workflow
This has two options available when Files
or Folder
Ingestion Type is used
Leave files where they are
Files are just read and left as they are.
Use processing/complete locations
Files are moved to the configured Processing
while being ingested and then to the configured Ingested
location once ingestion is completed.
Bucket / Container
The bucket or container where your files are
Folder
An optional sub-folder or prefix.
Include Sub-Folders
If set then any folder scan will recurse into all sub-folders, otherwise only the specific configured folder or 'root' folder will be scanned.
File Count
This applies to the Folder
and Monitor Folder Daily
ingestion types and is the maximum number of files to ingest when the job is ran.
Updated 5 days ago