Job Failure Handling

This page describes how to handle failures where the root cause is transient such as service outages etc.

How to handle failures with transient root causes

When a Job fails, the Datastreamer platform offers two two modes for handling for work item failures.

  • "Restart" restarts the job work item from the beginning.
  • "Recover" will run the job work item from where it failed.

📘

What is a Job?

A job is a collection of work items that are executed asynchronously and as such can succeed or fail independently.

When (and how) to "Recover" your failed Job

Continue a failed work item from where the failure occurred.

📘

Tip

Using this mode when the document ingestion has distinct collection and ingestion phases (e.g bright data) can help to prevent collecting previously collected data.

Restart

Start the work item from the beginning, collecting and processing all documents including any previously collected.


When (and how) to "Restart" your failed Job

A "Restart" will start the work item from the beginning, collecting and processing all documents including any previously collected.

It is best to use "Restart" if your Jobs may be broken down into many parts, and continuing from the moment of failure may cause a loss of data due to changing parameters.


How to Restart or Recover from the Jobs Page

Retrying a job in the portal UI by either selecting the Retry option from the menu, or selecting the failed work item and clicking retry.



How to Restart or Recover from the using the REST API

The following request can be used to retry failed work items. The type query parameter can be restart or recover

curl --request PUT \  
  --url 'https://api.platform.datastreamer.io/api/pipelines/work-items/retry?type=restart' \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --header 'apikey: ***' \
  --data '{
	"work_item_ids": [
		"<work-item-id1>",
		"<work-item-id2",
		"<work-item-id3",
    ...
	]
}'

This will return the number of work items where the retry was started.

{
	"total": 1
}