Creating Jobs (Portal, API)

Learn how to create Jobs for data collection.

📘

Pipeline Required

If you do not yet have a Pipeline created, you should create a pipeline first. The Pipeline ID and Step ID will be required for adding Jobs to the correct Pipeline and step.

If you wish to manage Jobs via API, you will need to setup the Pipeline initially in the Portal UI. After setting up the Pipeline, you can create new Jobs through APIs.

Creating a Job within Portal

Within Portal, navigate to your Pipeline. If the Pipeline is running, you will need to select "Edit".

Then, you can select the Component that you wish to create Jobs for, and it will open a configuration Sidebar. Below setup you will have the ability to create a new job.

Example for Socialgist Blogs Ingress Component.

Example for Socialgist Blogs Ingress Component.

Jobs created will be visible below in "Current Jobs" and also the Pipeline Jobs Page, which is linkedin within.

Using APIs to Create a Job

Follow the Portal steps above to the Component configuration screen, while you will not need this screen to create further jobs, it is necessary to get the Step ID and Pipeline ID which form part of the request URL.

API calls to create Jobs are interacting with this Pipeline Component; so it is required to have both correct Pipeline ID and Step ID.

📘

Quick Tip

While creating an initial Job in the UI is not required, this allows the generated Code to already have much of the information you would need.

Selecting the "Code" button will provide a Snippet that you can use to generate additional Jobs through API. Selecting "Add" will manually add a Job.

Understanding Job Creation API Request

Here is a sample Job Creation API Request.

curl --location 'https://api.platform.datastreamer.io/api/pipelines/3f8be588/components/d5wykwcclfi/jobs?ready=true' \
      --header 'apikey: <your-api-key>' \
      --header 'Content-Type: application/json' \
      --data \
        '{
          "job_name": "bf5912e9-5e09-45c8-9a61-3ea75fbc7248",
          "data_source": "socialgist_blogs",
          "query": {
            "query": "cats AND dogs",
            "country": "FR",
            "language": "fr"
          },
          "job_type": "periodic",
          "schedule": "0 0 0/6 1/1 * ? *",
          "label": "Test Label"
        }'

URL (Required)

'https://api.platform.datastreamer.io/api/pipelines/3f8be588/components/d5wykwcclfi/jobs?ready=true' \

The location is the same for each new job for that Step. It contains the ID for the Pipeline, as well as the Component.

Job Name (Required)

"job_name": "bf5912e9-5e09-45c8-9a61-3ea75fbc7248",

Job Name is used to identify this job in analytics and job management screens. It does not have specific requirements in format, and can be used to organize and track the names. Created jobs in the UI will have a randomly generated Job Name.

Data Source (Required)

"data_source": "socialgist_blogs",

Some components may allow multiple sources, however most will only operate with a single data source. This is required.

Query (Required)

          "query": {
            "query": "cats AND dogs",
            "country": "FR",
            "language": "fr"

The Query is required and contains the criteria of data collection. The available additional filters are option and can vary by source. Common filters include language, country, content type, etc. It is recommended to see the available additional filters in the Components settings.

Job Type (Required)

          "job_type": "periodic",
          "schedule": "0 0 0/6 1/1 * ? *",

Job type specifies if the Job is "onetime" or "periodic". If you do not specify, then the Job will run a single time.

Period: This job will run on the specified schedule (format is cron expression), and collect new data between the times it has ran. If the specified schedule is less that the source supports, the schedule will use the lowest available.

One Time This job will run once and collect data from a specified time frame. If no timeframe is listed, it will default to the source default. Time frames are determined by a "From" and "To" dates in this format: "2024-06-01T00:00:00Z"

          "query_from": "2024-05-01T00:00:00Z",
          "query_to": "2024-06-01T00:00:00Z"

Labels and Tags

          "label": "Test Label"

As data is received from a Job, the platform can append a Label to allow for additional sorting, analytics. It is a single string. Tags can also be used, which are an array of key value pairs, but are not yet supported by all routing components.

Tags for Egress

Tags can be used to direct results of a job to a specific folder when using Azure Blob Storage Egress, Google Cloud Storage Egress and Amazon S3 Storage Egress components. The Egress components can be configured to use Tag "value" for the folder name.


          "tags": [
            {
              "name": "default-folder",
              "value": "test"
            }
          ]

Max Documents (Only Available in the API)

Max documents is available for Jobs, and acts as a soft cap on the documents ingested. It operates by measuring the amount of documents ingested, and ceasing additional collection within that Job after the max has been reached. Integers are the supported input.

          "max_documents": 10000