Jobs retries and checkpoints best practices

Individual job tasks or even job executions can fail for a variety of reasons. This page contains best practices to handle these failures, centered around task restarts and job checkpointing.

Plan for job task restarts

Make your jobs idempotent, so that a task restart does not result in corrupt or duplicate output. That is, write repeatable logic that has the same behavior for a given set of inputs no matter how many times it is repeated or when it is executed.

Write your output to a different location than the input data, leaving input data intact. This way, if the job runs again, the job can repeat the process from the beginning and get the same result.

Avoid duplicating output data by reusing the same unique identifier or checking if the output already exists. Duplicate data represents collection-level data corruption.

Use checkpointing

Where possible, checkpoint your jobs so that if a task restarts after a failure, it can pick up where it left off, instead of restarting work at the beginning. Doing this will speed up your jobs as well as minimize unnecessary costs.

Periodically write partial results and an indication of progress made to a persistent storage location such as Cloud Storage or a database. When your task starts, look for partial results upon startup. If partial results are found, begin processing where they left off.

If your job does not lend itself to checkpointing, consider breaking it up into smaller chunks and run a larger number of tasks.

Checkpointing example 1: calculating Pi

If you have a job that executes a recursive algorithm, such as calculating Pi to many decimal places, and uses parallelism set to a value of 1:

  • Write your progress every 10 minutes or whatever your lost work tolerance allows, to a pi-progress.txt Cloud Storage object.
  • When a task starts, query the pi-progress.txt object and load the value as a starting place. Use that value as the initial input to your function.
  • Write your final result to Cloud Storage as an object named pi-complete.txt to avoid duplication via parallel or repeated execution or pi-complete-DATE.txt to differentiate by completion date.

Checkpointing example 2: processing 10,000 records from Cloud SQL

If you have a job processing 10,000 records in a relational database such as Cloud SQL:

  • Retrieve records to be processed with a SQL query such as SELECT * FROM example_table LIMIT 10000
  • Write out updated records in batches of 100 so significant processing work is not lost on interruption.
  • When records are written, note which ones have been processed. You might add a boolean column processed to the table which is set to 1 only if processing is confirmed.
  • When a task starts, the query used to retrieve items for processing should add the condition processed = 0.
  • In addition to clean retries, this technique also supports breaking up work into smaller tasks, such as by modifying your query to select 100 records at a time: LIMIT 100 OFFSET $CLOUD_RUN_TASK_INDEX*100, and running 100 tasks to process all 10,000 records. CLOUD_RUN_TASK_INDEX is a built-in environment variable present inside the container running Cloud Run jobs.

Using all these pieces together, the final query might look like this: SELECT * FROM example_table WHERE processed = 0 LIMIT 100 OFFSET $CLOUD_RUN_TASK_INDEX*100

What's next