Export data as Protobuf columns

This document describes how you can export BigQuery data as Protocol Buffers (Protobuf) columns by using BigQuery user-defined functions (UDFs).

When to use Protobuf columns

BigQuery offers a number of built-in functions to format selected data. One option is to merge multiple column values into a single Protobuf value, which has the following benefits:

  • Object type safety.
  • Improved compression, data transfer time, and cost as compared with JSON.
  • Flexibility, as most programming languages have libraries to handle Protobuf.
  • Less overhead when reading from multiple columns and building a single object.

While other column types can also provide type safety, using Protobuf columns provides a fully typed object, which can reduce the amount of work that needs to be done on the application layer or on another part of the pipeline.

However, there are limitations to exporting BigQuery data as Protobuf columns:

  • Protobuf columns are not well indexed or filtered. Searching by the content of the Protobuf columns can be less effective.
  • Sorting data in Protobuf format can be difficult.

If these limitations apply to your export workflow, you might consider other methods of exporting BigQuery data:

  • Use Dataflow to export BigQuery data in either Avro or CSV file formats.
  • Use scheduled queries to sort your exported BigQuery data by date or time, and to schedule exports on a recurring basis.

Required roles

To get the permissions that you need to export BigQuery data as Protobuf columns, ask your administrator to grant you the following IAM roles on your project:

For more information about granting roles, see Manage access.

You might also be able to get the required permissions through custom roles or other predefined roles.

Create a UDF

The following procedure shows how you can create a user-defined function that converts a BigQuery STRUCT data type into a Protobuf column:

  1. In a command line, clone the bigquery-utils.git repository:

    $ git clone https://github.com/GoogleCloudPlatform/bigquery-utils.git
    
  2. Navigate into the Protobuf export folder

    $ cd bigquery-utils/tools/protobuf_export
    
  3. Add your proto files to the ./protos folder.

  4. Install the necessary packages from the GitHub repository:

    $ npm install
    
  5. Bundle the package by using webpack:

    $ npx webpack --config webpack.config.js --stats-error-details
    
  6. Locate the pbwrapper.js file in your dist/ folder and copy the file to a Cloud Storage bucket.

  7. Create a UDF that builds a Protobuf column from your existing BigQuery columns:

    CREATE FUNCTION
      DATASET_ID.toMyProtoMessage(input STRUCT<field_1 TYPE1, field_2 TYPE2>)
      RETURNS BYTES
        LANGUAGE js OPTIONS ( library=["gs://BUCKET_NAME/pbwrapper.js"]
    ) AS r"""
    let message = pbwrapper.setup("PROTO_PACKAGE.PROTO_MESSAGE")
    return pbwrapper.parse(message, input)
      """;
    

    Replace the following:

    • DATASET_ID: the ID of the dataset where you store your function
    • BUCKET_NAME: the name of your Cloud Storage bucket
    • PROTO_PACKAGE: the package name for your proto file
    • PROTO_MESSAGE: the message type for your proto file

For more information about using packages in the proto language, see Packages.

Format columns as Protobuf values

  • After you create the UDF, run the function:

    SELECT
      DATASET_ID.toMyProtoMessage(STRUCT(COLUMN_TYPE1, COLUMN_TYPE2)) AS protoResult
    FROM
      `DATASET_NAME`
    LIMIT
      100;
    

    Replace the following:

    • DATASET_ID: the ID of the dataset where you store your function
    • DATASET_NAME: the name your dataset—for example, dataset_name.table_name
    • COLUMN_TYPE1: the name of a column. Columns can include any supported scalar value type or non-scalar type, including ARRAY and STRUCT
    • COLUMN_TYPE2: the name of a column. Columns can include any supported scalar value type or non-scalar type, including ARRAY and STRUCT

Work with Protobuf values

With your BigQuery data exported in the Protobuf format, you can now work with your data as a fully typed object or struct.

The following code samples provide several examples of ways that you can process or work with your exported data:

Go

// package Main queries Google BigQuery.
package main

import (
	"context"
	"fmt"
	"io"
	"log"
	"os"

	"cloud.google.com/go/bigquery"
	"google.golang.org/api/iterator"
	"google.golang.org/Protobuf/proto"

	pb "path/to/proto/file_proto"
)

const (
	projectID = "your-project-id"
)

// Row contains returned row data from bigquery.
type Row struct {
	RowKey string `bigquery:"RowKey"`
	Proto  []byte `bigquery:"ProtoResult"`
}

func main() {
	ctx := context.Background()

	client, err := bigquery.NewClient(ctx, projectID)
	if err != nil {
		log.Fatalf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	rows, err := query(ctx, client)
	if err != nil {
		log.Fatal(err)
	}
	if err := printResults(os.Stdout, rows); err != nil {
		log.Fatal(err)
	}
}

// query returns a row iterator suitable for reading query results.
func query(ctx context.Context, client *bigquery.Client) (*bigquery.RowIterator, error) {

	query := client.Query(
		`SELECT 
  concat(word, ":", corpus) as RowKey, 
  <dataset-id>.toMyProtoMessage(
    STRUCT(
      word, 
      CAST(word_count AS BIGNUMERIC)
    )
  ) AS ProtoResult 
FROM 
  ` + "` bigquery - public - data.samples.shakespeare `" + ` 
LIMIT 
  100;
`)
	return query.Read(ctx)
}

// printResults prints results from a query.
func printResults(w io.Writer, iter *bigquery.RowIterator) error {
	for {
		var row Row
		err := iter.Next(&row)
		if err == iterator.Done {
			return nil
		}
		if err != nil {
			return fmt.Errorf("error iterating through results: %w", err)
		}
		message := &pb.TestMessage{}
		if err = proto.Unmarshal(row.Proto, message); err != nil {
			return err
		}
		fmt.Fprintf(w, "rowKey: %s, message: %v\n", row.RowKey, message)
	}
}

Java

package proto;

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.FieldValueList;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobId;
import com.google.cloud.bigquery.JobInfo;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.TableResult;
import path.to.proto.TestMessage;
import java.util.UUID;

/** Queries Google BigQuery */
public final class Main {
  public static void main(String[] args) throws Exception {
    String projectId = "your-project-id";
    BigQuery bigquery = BigQueryOptions.newBuilder().setProjectId(projectId).build().getService();

    QueryJobConfiguration queryConfig =
        QueryJobConfiguration.newBuilder(
                " SELECT "
                    + "concat(word , \":\",corpus) as RowKey,"
                    + "<dataset-id>.toMyProtoMessage(STRUCT(word, "
                    + "CAST(word_count AS BIGNUMERIC))) AS ProtoResult "
                    + "FROM "
                    + "`bigquery-public-data.samples.shakespeare` "
                    + "ORDER BY word_count DESC "
                    + "LIMIT 20")
            .setUseLegacySql(false)
            .build();

    // Create a job ID so that we can safely retry.
    JobId jobId = JobId.of(UUID.randomUUID().toString());
    Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());

    // Wait for the query to complete.
    queryJob = queryJob.waitFor();

    // Check for errors
    if (queryJob == null) {
      throw new RuntimeException("Job no longer exists");
    } else if (queryJob.getStatus().getError() != null) {
      // You can also look at queryJob.getStatus().getExecutionErrors() for all
      // errors, not just the latest one.
      throw new RuntimeException(queryJob.getStatus().getError().toString());
    }

    // Get the results.
    TableResult result = queryJob.getQueryResults();

    // Print all pages of the results.
    for (FieldValueList row : result.iterateAll()) {
      String key = row.get("RowKey").getStringValue();
      byte[] message = row.get("ProtoResult").getBytesValue();
      TestMessage testMessage = TestMessage.parseFrom(message);
      System.out.printf("rowKey: %s, message: %s\n", key, testMessage);
    }
  }
}

Python

"""Queries Google BigQuery."""

from google.cloud import bigquery
from path.to.proto import awesome_pb2


def main():
  project_id = "your-project-id"
  client = bigquery.Client(project=project_id)
  query_job = client.query(query="""
               SELECT
			concat(word , ":",corpus) as RowKey,
			<dataset-id>.toMyProtoMessage(
			    STRUCT(
			      word, 
			      CAST(word_count AS BIGNUMERIC)
			    )
			  ) AS ProtoResult 
		FROM
				  `bigquery-public-data.samples.shakespeare`
		ORDER BY word_count DESC
		LIMIT 20
    """)
  rows = query_job.result()
  for row in rows:
    message = awesome_pb2.TestMessage()
    message.ParseFromString(row.get("ProtoResult"))
    print(
        "rowKey: {}, message: {}".format(row.get("RowKey"), message)
    )