[go: up one dir, main page]

BigQuery Surrogate Keys

Mark Scannell
Google Cloud - Community
5 min readJul 24, 2019

--

Surrogate Keys are a common concept in traditional Data Warehouse, but are not common in BigData. Generating a sequential increasing number is difficult to parallelise, and alternatives, such as a HASH or CONCAT, can provide a similar unique identifier.

I will explore a specific scenario in BigQuery with the following schema:

  • Three keys (key1, key2, key3) and the need to create a surrogate key (id).
  • Incoming data and existing data with the pre-allocated surrogate keys.

NOTE: You need to create surrogatekey as a dataset in your project for the SQL to run.

Creating Test Data

Incoming Data

Create a table surrogatekey.incoming_data with (key1, key2, key3) and the data. This will have a distribution of keys that will partially match existing surrogate keys.

CREATE OR REPLACE TABLE `surrogatekey.incoming_data` AS
SELECT
-- Actual keys; need a synthetic key
CAST(k1 AS STRING) AS key1,
CAST(k2 AS STRING) AS key2,
CAST(k3 AS STRING) AS key3,
-- Data
CAST(FLOOR(1000 * RAND()) AS STRING) AS col1,
CAST(FLOOR(1000 * RAND()) AS STRING) AS col2,
'x' AS col3,
'y' AS col4
FROM
UNNEST(GENERATE_ARRAY(5, 20)) k1
CROSS JOIN UNNEST(GENERATE_ARRAY(5, 20)) k2
CROSS JOIN UNNEST(GENERATE_ARRAY(5, 20)) k3;

Existing Data — Incrementing

Create an existing data table that has pre-allocated (incrementing) surrogate key. This will partially match the incoming data.

CREATE OR REPLACE TABLE `surrogatekey.existing_data` AS
SELECT
-- Existing surrogate
k3+(k2*10)+(k1*100) AS id,
-- Actual keys; need a synthetic key
CAST(k1 AS STRING) AS key1,
CAST(k2 AS STRING) AS key2,
CAST(k3 AS STRING) AS key3,
-- Data
CAST(FLOOR(1000 * RAND()) AS STRING) AS col1,
CAST(FLOOR(1000 * RAND()) AS STRING) AS col2,
'x' AS col3,
'y' AS col4
FROM
UNNEST(GENERATE_ARRAY(1, 10)) k1
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) k2
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) k3;

Existing Data — HASH

Create an existing data table surrogatekey.existing_data_hash, but, this time, create a HASH. This will take the keys in a STRUCT(key1, key2, key3), use TO_JSON_STRING to turn it into a single string and FARM_FINGERPRINT. This will result in an INT64 value with an extremely low chance of collision.

CREATE OR REPLACE TABLE `surrogatekey.existing_data_hash` AS
SELECT
-- Existing surrogate is a HASH
FARM_FINGERPRINT(TO_JSON_STRING(STRUCT(
CAST(k1 AS STRING),
CAST(k2 AS STRING),
CAST(k3 AS STRING)
))) AS id,
-- Actual keys
CAST(k1 AS STRING) AS key1,
CAST(k2 AS STRING) AS key2,
CAST(k3 AS STRING) AS key3,
-- Data
CAST(FLOOR(1000 * RAND()) AS STRING) AS col1,
CAST(FLOOR(1000 * RAND()) AS STRING) AS col2,
'x' AS col3,
'y' AS col4
FROM
UNNEST(GENERATE_ARRAY(1, 10)) k1
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) k2
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) k3;

Update and Insert Data — Incrementing

Processing the Incoming Data

First the incoming data will be processed into surrogatekey.staging_data. This will have the pre-existing surrogate key joined or a new surrogate key allocated (incrementing).

For allocating a new key we use the MAX of the existing keys and the incrementing ROW_NUMBER() of the incoming data. We order the incoming data by id, so that NULL appear first.

CREATE OR REPLACE TABLE `surrogatekey.staging_data` AS
SELECT
e.id IS NULL AS IsNew,
IFNULL(e.id,
maxId + ROW_NUMBER() OVER (
ORDER BY
e.id
)
) AS id,
i.*
FROM
`surrogatekey.incoming_data` i
LEFT JOIN `surrogatekey.existing_data` e
USING (key1, key2, key3)
CROSS JOIN (
SELECT
MAX(id) AS maxId
FROM
`surrogatekey.existing_data`
) m

Processing the Incoming Data — Large Data

In the previous step, ROW_NUMBER() was computed for all of the incoming_data. This is isn’t parallelised and, if it takes up too much memory, you may get Resources Exceeded. We can amend the query to compute ROW_NUMBER() only for the new rows where a new surrogate key is needed.

CREATE OR REPLACE TABLE `surrogatekey.staging_data` AS
WITH
-- Join with existing surrogatekey
JoinedData AS (
SELECT
e.id,
i.*
FROM
`surrogatekey.incoming_data` i
LEFT JOIN `surrogatekey.existing_data` e
USING (key1, key2, key3)
),
-- Replace NULL using ROW_NUMBER and previous max
NewKeyedData AS (
SELECT
m.maxId + ROW_NUMBER() OVER () AS id,
j.* EXCEPT (id)
FROM
JoinedData j
CROSS JOIN (
SELECT
IFNULL(MAX(id), 0) AS maxId
FROM
`surrogatekey.existing_data`
) m
WHERE
j.id IS NULL
)
-- Union together existing and new surrogate keys
SELECT
*
FROM
JoinedData
WHERE
id IS NOT NULL
UNION ALL
SELECT
*
FROM
NewKeyedData;

Processing the Incoming Data — Larger Data

In the large data step, ROW_NUMBER() was computed for the incoming_data that needed new surrogate keys. However, all of the columns were processed. This time just the key columns will be processed and re-joined with the final data.

CREATE OR REPLACE TABLE `surrogatekey.staging_data` AS
WITH
JoinedData AS (
SELECT
e.id,
i.*
FROM
`surrogatekey.incoming_data` i
LEFT JOIN `surrogatekey.existing_data` e
USING (key1, key2, key3)
),
NewKeyedData AS (
SELECT
m.maxId + ROW_NUMBER() OVER () AS id,
j.key1,
j.key2,
j.key3
FROM
JoinedData j
CROSS JOIN (
SELECT
IFNULL(MAX(id), 0) AS maxId
FROM
`surrogatekey.existing_data`
) m
WHERE
j.id IS NULL
)
SELECT
IFNULL(j.id, n.id) AS id,
j.* EXCEPT (id)
FROM
JoinedData j
LEFT JOIN NewKeyedData n
USING (key1, key2, key3);

Updating the Existing Data

Update the existing data by id. We don’t need to filter by whether it is new or not, as this will only apply where the JOIN is valid.

UPDATE `surrogatekey.existing_data` e
SET
col1 = s.col1,
col2 = s.col2,
col3 = s.col3,
col4 = s.col4
FROM
`surrogatekey.staging_data` s
WHERE
e.id = s.id;

Inserting the New Data

Insert new rows where it is a IsNew row.

INSERT INTO  `surrogatekey.existing_data`
(id, key1, key2, key3, col1, col2, col3, col4)
SELECT
* EXCEPT (IsNew)
FROM
`surrogatekey.staging_data`
WHERE
IsNew;

Merge Data — Incrementing

Alternatively, one MERGE statement generates a new surrogate key as needed and updates/inserts into the existing table. This is pulling together the previous SQL without the complexity of tracking whether it is a new row.

MERGE INTO `surrogatekey.existing_data` AS e
USING (
SELECT
IFNULL(e.id,
maxId + ROW_NUMBER() OVER (
ORDER BY
e.id
)
) AS Id,
i.*
FROM
`surrogatekey.incoming_data` i
LEFT JOIN `surrogatekey.existing_data` e
USING (key1, key2, key3)
CROSS JOIN (
SELECT
MAX(id) AS maxId
FROM
`surrogatekey.existing_data`
) m
) s
ON e.id=s.id
WHEN MATCHED THEN
UPDATE SET
col1 = s.col1,
col2 = s.col2,
col3 = s.col3,
col4 = s.col4
WHEN NOT MATCHED THEN
INSERT ROW

Merge Data — HASH

Using a HASH it is even more straightforward — simply UPDATE or INSERT as appropriate from the incoming data, as the HASH can be calculated by each row.

MERGE INTO `surrogatekey.existing_data_hash` AS e
USING (
SELECT
FARM_FINGERPRINT(TO_JSON_STRING(STRUCT(key1, key2, key3)))
AS id,
*
FROM
`surrogatekey.incoming_data` AS i
) AS i
ON (e.id=i.id)
WHEN MATCHED THEN
UPDATE SET
col1 = i.col1,
col2 = i.col2,
col3 = i.col3,
col4 = i.col4
WHEN NOT MATCHED THEN
INSERT ROW

Merge Data — CONCAT

A HASH can be very powerful and simple, but another alternative is to CONCAT all of the keys together as a large string. This has the benefit of guaranteeing uniqueness (as long as it is constructed well), and is also explanable by looking at the key.

Using id as a STRING rather than INT64, it could be calculated as the following:

CONCAT(
CAST(key1 AS STRING), '-',
CAST(key2 AS STRING), '-',
CAST(key3 AS STRING)
) AS id,

Conclusion

Surrogate keys can be useful, but it is also worth thinking about how they are being used and the complexity that they introduce. While performance can be important, querying patterns can be the most important.

Looking into HASH alternatives — or, even better, not using surrogate keys — is worth thinking about.

--

--