Refresh structured and unstructured data

This page describes refreshing structured and unstructured data.

To refresh your website apps, see Refresh your web page.

Refresh structured data

You can refresh the data in a structured data store as long as you use a schema that is the same or backward compatible with the schema in the data store. For example, adding only new fields to an existing schema is backward compatible.

You can refresh structured data in the Google Cloud console or using the API.

Console

To use the Google Cloud console to refresh structured data from a branch of a data store, follow these steps:

  1. In the Google Cloud console, go to the Agent Builder page.

    Agent Builder

  2. In the navigation menu, click Data stores.

  3. In the Name column, click the data store that you want to edit.

  4. On the Documents tab, click Import data.

  5. To refresh from Cloud Storage:

    1. In the Select a data source pane, select Cloud Storage.
    2. In the Import data from Cloud Storage pane, click Browse, select the bucket that contains your refreshed data, and then click Select. Alternatively, enter the bucket location directly in the gs:// field.
    3. Under Data Import Options, select an import option.
    4. Click Import.
  6. To refresh from BigQuery:

    1. In the Select a data source pane, select BigQuery.
    2. In the Import data from BigQuery pane, click Browse, select a table that contains your refreshed data, and then click Select. Alternatively, enter the table location directly in the BigQuery path field.
    3. Under Data Import Options, select an import option.
    4. Click Import.

REST

Use the documents.import method to refresh your data, specifying the appropriate reconciliationMode value.

To refresh structured data from BigQuery using the command line, follow these steps:

  1. Find your data store ID. If you already have your data store ID, skip to the next step.

    1. In the Google Cloud console, go to the Agent Builder page and in the navigation menu, click Data stores.

      Go to the Data stores page

    2. Click the name of your data store.

    3. On the Data page for your data store, get the data store ID.

  2. Import your structured data using BigQuery.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
    -d '{
      "bigquerySource": {
        "projectId": "PROJECT_ID",
        "datasetId":"DATASET_ID",
        "tableId": "TABLE_ID",
        "dataSchema": "DATA_SCHEMA",
      },
      "reconciliationMode": "RECONCILIATION_MODE",
      "autoGenerateIds": "AUTO_GENERATE_IDS",
      "idField": "ID_FIELD",
      "errorConfig": {
        "gcsPrefix": "ERROR_DIRECTORY"
      }
    }'
    
    • PROJECT_ID: The ID of your project.
    • DATA_STORE_ID: The ID of your data store.
    • DATASET_ID: The name of your BigQuery dataset.
    • TABLE_ID: The name of your BigQuery table.
    • DATA_SCHEMA Optional. Values are document and custom. The default is document.
      • If you specify document, the BigQuery table that you use must conform to the following default BigQuery schema. You can define the ID of each document yourself, while wrapping all the data in the jsonData string.
      • If you specify custom, any BigQuery table schema is accepted, and Vertex AI Agent Builder automatically generates the IDs for each document that is imported.
    • ERROR_DIRECTORY: Optional. A Cloud Storage directory for error information about the import—for example, gs://<your-gcs-bucket>/directory/import_errors. Google recommends leaving this field empty to let Vertex AI Agent Builder automatically create a temporary directory.
    • RECONCILIATION_MODE: Optional. Values are FULL and INCREMENTAL. Default is INCREMENTAL. Specifying INCREMENTAL causes an incremental refresh of data from BigQuery to your data store. This does an upsert operation, which adds new documents and replaces existing documents with updated documents with the same ID. Specifying FULL causes a full rebase of the documents in your data store. In other words, new and updated documents are added to your data store, and documents that are not in BigQuery are removed from your data store. The FULL mode is helpful if you want to automatically delete documents that you no longer need.
    • AUTO_GENERATE_IDS: Optional. Specifies whether to automatically generate document IDs. If set to true, document IDs are generated based on a hash of the payload. Note that generated document IDs might not remain consistent over multiple imports. If you auto-generate IDs over multiple imports, Google highly recommends setting reconciliationMode to FULL to maintain consistent document IDs.

      Specify autoGenerateIds only when bigquerySource.dataSchema is set to custom. Otherwise an INVALID_ARGUMENT error is returned. If you don't specify autoGenerateIds or set it to false, you must specify idField. Otherwise the documents fail to import.

    • ID_FIELD: Optional. Specifies which fields are the document IDs. For BigQuery source files, idField indicates the name of the column in the BigQuery table that contains the document IDs.

      Specify idField only when: (1) bigquerySource.dataSchema is set to custom, and (2) auto_generate_ids is set to false or is unspecified. Otherwise an INVALID_ARGUMENT error is returned.

      Note that the value of the BigQuery column name must be of string type, must be between 1 and 63 characters, and must conform to RFC-1034. Otherwise, the documents fail to import.

    Here is the default BigQuery schema. Your BigQuery table must conform to this schema when you set dataSchema to document.

    [
     {
       "name": "id",
       "mode": "REQUIRED",
       "type": "STRING",
       "fields": []
     },
     {
       "name": "jsonData",
       "mode": "NULLABLE",
       "type": "STRING",
       "fields": []
     }
    ]
    

Refresh unstructured data

You can refresh unstructured data in the Google Cloud console or using the API.

Console

To use the Google Cloud console to refresh unstructured data from a branch of a data store, follow these steps:

  1. In the Google Cloud console, go to the Agent Builder page.

    Agent Builder

  2. In the navigation menu, click Data stores.

  3. In the Name column, click the data store that you want to edit.

  4. On the Documents tab, click Import data.

  5. To ingest from a Cloud Storage bucket (with or without metadata):

    1. In the Select a data source pane, select Cloud Storage.
    2. In the Import data from Cloud Storage pane, click Browse, select the bucket that contains your refreshed data, and then click Select. Alternatively, enter the bucket location directly in the gs:// field.
    3. Under Data Import Options, select an import option.
    4. Click Import.
  6. To ingest from BigQuery:

    1. In the Select a data source pane, select BigQuery.
    2. In the Import data from BigQuery pane, click Browse, select a table that contains your refreshed data, and then click Select. Alternatively, enter the table location directly in the BigQuery path field.
    3. Under Data Import Options, select an import option.
    4. Click Import.

REST

To refresh unstructured data using the API, re-import it using the documents.import method, specifying the appropriate reconciliationMode value. For more information about importing unstructured data, see Unstructured data.