Extraction API

An Overview of what the Extraction API offers within Datastreamer

Background & Prerequisite

Before conducting an Extraction using the Extraction API, it is suggested to use the Search API to test and optimize your queries before conducting an Extraction.

  • SearchAPI: Use the SearchAPI to ensure that the results over a shorter timeframe resolve and provide the desired response.
  • CountAPI: The billing for Extraction is based on the volume of extracted content. Using the Count API allows you to see the total number of results

🚧

Available Use

Extraction is only available for Stream Integrated partner sources or your own data sources that use Datastreamer for storage.

Endpoint Details

EndpointTypeDescriptionStatus
/extraction/queryPOSTEndpoint for Extraction API enables developers to retrieve the total results based on the provided parameters in a compressed file.Live
/extraction/jobsGETLists all the jobs in progress, failed or completed in the last 30 days.Live
/extraction/jobs/{jobId}/progressGETView the progress of any specific job.Live
/extraction/jobs/{jobId}/cancelPOSTCancel the extraction job.Live

Request Object Details

A sample request body looks like this:

{
  "query": {
    "query": "New York",
    "data_sources": [
         "wsl_twitter"
    ]
  }
}

Data Source and Query are extensively used within the JSON request body to filter out the results as per the requirements. Apache Lucene based syntax is used to write down the queries within the query parameter of the request body. The format and content of the query are the same as the Search API format.

πŸ“˜

Ignored Fields

As "from" and "size" are handled automatically. These fields are ignored if present in the request.

Response Details

Based on the data_source provided in the request body and the query; the response of the Extraction API is the total resulting documents available at the sources provided.

Unlike other APIs, Extraction API places the results in compressed files on accessible storage. You will be able to access the results in zipped JSON format, and they will be present under the "job ID" (given upon launching the query).

🚧

Bucket Accessibility

If you do not have a Cloud Storage Bucket established for you. Conducting an extraction job will create one. It is suggested to reach out to Datastreamer to setup a Cloud Storage Bucket prior to conducting any Extraction jobs.

Possible job status

  • Unknown - default state - you should not see this
  • InProgress - job is currently being executed
  • Complete - job has completed successfully
  • Failed - job has failed after retry
  • NotStarted - job has yet to be executed
  • Abandoned - job has been cancelled