Generate data profile insights with scans

This document explains how to use BigQuery and Dataplex together to better understand your data. BigQuery uses Dataplex to analyze the statistical characteristics of your data, such as average values, unique values, and maximum values. Dataplex also uses this information to recommend rules for data quality checks.

For more information about data profiling, see About data profiling.

Before you begin

  • For creating and modifying scans in your project: Enable the Dataplex API.

    Enable the Dataplex API

  • For cross-project scans: Create a Dataplex service identifier with the gcloud beta services identity create command. If a Dataplex service identifier does not exist, this command returns a new one. If a service identifier already exists, the command returns the existing one. This command might prompt you to install the gcloud CLI beta commands component.

    gcloud beta services identity create
    --service=dataplex.googleapis.com
    

Required roles

Ask your administrator to grant the following roles to the appropriate account principals based on the use cases they need access to. For more information about granting roles, see Manage access.

BigQuery roles

  • BigQuery Data Viewer on a table to create a scan on that table without publishing the results.
  • BigQuery Data Editor on a table to create a scan on that table with publishing.
  • If the BigQuery table and the data profile scan are in different projects, then you need to give the associated principal or Dataplex service account the read permission bigquery.tables.getData (or the role BigQuery Data Viewer) on the corresponding BigQuery table. To get the service identity for a service account, see Before you begin.
  • If you are scanning a BigQuery external table from Cloud Storage, then assign the Cloud Storage role (roles/storage.objectViewer) to the Dataplex service account.

Dataplex roles

  • Dataplex DataScan Administrator on the project level - to create scans.
  • Dataplex DataScan Editor on a scan - to edit any properties of a scan (except permissions), execute the scan, and delete the scan.
  • Dataplex DataScan DataViewer on a scan to view results of a scan.

These roles contain the permissions needed for the prior use cases. To see the exact permissions that are required, expand the Required permissions section.

Required permissions

The following permissions are required to use various aspects of data profile scans:

  • To change the configuration of a data scan: dataplex.datascans.update - on the datascan resource
  • To change the policy of a data scan: dataplex.datascans.setIamPolicy - on the datascan resource
  • To create a data scan on a BigQuery table: bigquery.tables.getData - the table to scan
  • To create data scans in a project: dataplex.datascans.create - on the project
  • To delete a data scan: dataplex.datascans.delete - on the datascan resource
  • To export the data scan results to a BigQuery dataset: bigquery.datasets.get, bigquery.tables.create, bigquery.tables.get, bigquery.tables.update, bigquery.tables.updateData - the destination dataset
  • To publish the results of a data scan to a table: bigquery.tables.update - the destination table
  • To run a data scan: dataplex.datascans.run - on the datascan resource
  • To scan an external table from Cloud Storage: storage.buckets.get, storage.objects.get - the bucket containing the tables to scan
  • To view the results of a data scan: dataplex.datascans.getData - on the data scan resource
  • To view the results of a data scan: dataplex.datascans.get - on the datascan resource
  • To view the results of a data scan: dataplex.datascans.list - on the datascan resource

You might also be able to get these permissions with custom roles or other predefined roles.

Create a data profile scan

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, click a table for the data profile scan.

  3. Click the Data profile tab.

  4. Click Data profile scan > Create new scan.

  5. Optional: Edit the following values:

    • Display name: the mutable name of the resource in the console.
    • ID: a unique identifier for the scan. It cannot be changed after the scan is created.
    • Description: a description of the scan.
    • Region: defines the region that the data scan is processed in.
    • Scope: the range of data available for scanning. Select either Incremental or Entire data. If you choose Incremental, we recommend including a DATE or TIMESTAMP column that increases linearly. This column can be used to identify new records. For tables partitioned on columns of type DATE or TIMESTAMP, it's recommended to use the partition column as the timestamp field.
    • Filters: filters to apply to the data before the scan runs. You can select Filter rows, Filter columns, or both.
      • To filter rows, select the Filter rows checkbox and enter a valid SQL expression in the input text field. The expression must be in BigQuery standard SQL syntax and can be used in a WHERE clause.
      • To filter columns, select the Filter columns checkbox and fill in the Include columns field, the Exclude columns field, or both.
    • Sampling size: the percentage of data that you want to sample. For incremental data scans, only the latest increment is sampled.
    • Publish results to the BigQuery and Dataplex Catalog UI: this option makes the latest results of the data profiling scan available in the BigQuery UI, under the Data profile tab for the source table. If a scan is running and set to publish, this option might be unavailable.
    • Schedule: either On demand (default) or Repeat. If you select Repeat, specify the frequency of the scheduled scan with Daily, Weekly, Monthly, or Custom. Custom uses the cron time format to specify the schedule. For example, a scan set to run on the second Tuesday of the month at 1:00 AM would look like this: 0 1 8-14 * 2.
  6. Optional: To advance the pane to show additional optional settings, click Continue and edit the following values:

    • Export scan results to BigQuery table: select a BigQuery dataset and a table to save the results of the profile scan. If a dataset is defined but no table is defined, Dataplex creates a table for you. Tables created in this manner might incur storage costs.
    • Labels: add a label to the scan.
  7. Click one of the following buttons based on your needs:

    • To save the scan settings, click Create.
    • To save and run the scan, click Run.

Manage data profile scan permissions

To change the access permissions of existing profile scans, do the following:

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, select a table for the data profile scan.

  3. Click the Data profile tab.

  4. Click Data profile scan > Manage scan permissions. This opens Dataplex in a new tab.

  5. Click the Permissions tab.

    • To grant access to a principal, click Grant access and grant Dataplex DataScan DataViewer to the associated principal.
    • To remove access from a principal, click Remove access and remove Dataplex DataScan DataViewer from the associated principal.

Edit an existing data profile scan

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, select a table for the data profile scan.

  3. Click Data profile scan > Edit scan configuration.

This opens the data profile scan settings, which can be modified and saved for future scans.

View data profile scan results

There are multiple ways to view data profile scan results. Select the option that best suits your needs.

View published results

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, select a table for the data profile scan.

  3. Click the Data profile tab.

The latest published results are shown in this view.

View historical scan results

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, select a table for the data profile scan.

  3. Click the Data profile tab.

  4. Click Data profile scan > View historical results.

View all data profile scans on a table

To open Dataplex with a scan history for a specific table, do the following:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, select a table for the data profile scan.

  3. Click Data profile scan > View all scans.