Bulk upload with the Cloud Storage ingest pipeline

This document describes how to perform bulk upload, which triggers the Cloud Storage ingest pipeline behind the scene.

Preprocessing options

Currently, the bulk upload provides three preprocessing options:

  1. Bulk upload without preprocessing: This triggers runPipeline API with GcsIngestPipeline without processing the documents with Document AI processors.

  2. Extract entities with Document AI processors: This triggers runPipeline API with GcsIngestWithDocAiProcessorsPipeline. The pipeline will call the given Document AI processor first, and then ingest the documents with the processed results.

  3. Classify document types and extract entities for each type: This also triggers runPipeline API with GcsIngestWithDocAiProcessorsPipeline, which first calls a classifier. Then, for each document type, you can specify a corresponding schema and processor to process those particular document types. They're ingested with the results and set to this schema.

Each of the preprocessing types correspond to the following options in the UI:

Step 0

Example: Trigger bulk upload with an OCR processor

This example illustrates the second use of the pipeline.

Create an OCR processor and get processor ID

If you have created an OCR processor before, just find it in the processor list, and go into the details page of the processor and get the processor ID.

If you have not created one, follow these steps:

  1. At the top of the processor list, click the Processor Gallery:

    Step 4

  2. Find the Document OCR processor in the gallery, and at the bottom of the card, click Create Processor:

    Step 5

  3. Enter a processor display name:

    Step 6

  4. Click Create and when you're redirected to the Processor Details page, find the ID:

    Step 7

    This is what you need to copy to the input fields in the bulk upload view.

Trigger bulk upload

  1. Open the bulk upload view.

    Next to Add New, click Bulk Upload:

    Step 1

  2. Find the correct processor.

    1. Select the second preprocessing option.

    2. Choose a schema and specify a processor and Cloud Storage bucket path for saving the extraction results in JSON format.

  3. Find the processor ID through the link in the description text:

    Step 2

  4. Trigger upload:

    1. With the processor ID copied from the last step, specify the input fields. The source file bucket path can be a bucket or a folder or subfolder in the bucket.

    2. When the input fields are valid, to trigger bulk upload, at the top right, click Upload.

Check progress in the status page

After the bulk upload is triggered, you are redirected to the status tracking page:

Step 9

The first table shows any pending or processed documents. After they're ingested, the document is not listed in the first table anymore. Documents that failed to upload appear in the second table. On the right, the statistics shows the number of ingested, failed, and pending documents.

Step 10

After the job is complete, the status page shows 100% complete without any pending documents:

Step 11

Examine the uploaded documents

  1. Find the newly ingested documents by going back to the search view. Click the Document AI Warehouse logo or Search on the top navigation bar:

    Step 12

  2. Open any of the newly ingested documents by clicking the document name. In the document viewer, you can open the AI View.

    Step 13

  3. Go to the Text block tab. The OCR results are stored in the document:

    Step 13

Next step

Update existing documents with the extract with Document AI pipeline.