Auto PII detection at scale
Scanning a whole BigQuery table using DLP (The data loss prevention product on GCP) to detect possible, likely or very likely PII in Google Analytics (GA) data is simple. This post describes a solution to automate Personally Identifiable Information (PII – sensitive data) discovery at scale. Some familiarity with Google Cloud Platform (GCP) is assumed but there are many excellent guides out there on how to set up projects and so on.
This solution is appropriate for Google Marketing Platform (GMP) users who have BigQuery (BQ) linked to GA.
The solution is repeatable for multiple properties and BQ instances.
Access to the Google Cloud Platform (GCP) project is required.
Enable the DLP API.
Billing may be incurred so a separate GCP project is advised.
The Data Loss Prevention (DLP) tool scans one table at a time. Scanning multiple tables requires the creation of a single table. Scanning yesterday’s GA data in BQ requires a table to be created every day.
InBigQuery create a new Scheduled Query:
SELECT * FROM `[project name].[data set id].ga_sessions_*` WHERE _TABLE_SUFFIX = REPLACE(CAST(DATE_SUB(CURRENT_DATE(), INTERVAL 1 day) AS STRING),”-“,””)
Schedule the query to execute at a time where you are confident the GA data will have landed – 09:00 every day is a reasonable starting point but be certain the data is there – this will vary. It’d be nice to be able to trigger queries using PubSub one day….maybe…
Define the destination for the query results as a new table in the same dataset called ga_yesterday.
Ensure the table write preference is set to Overwrite the table.
This will create the table that DLP will scan.
Create a new job.
The following section only covers the settings that require attention. If other settings are not mentioned, leave the defaults as is or feel free to play and tune to your requirements.
Configure the job input data to use the table created by the scheduled query.
Configure the sampling to “No Sampling”.
Configure the infoTypes as shown below – creating an infoTypes template will save time at scale:
Set the Confidence threshold to Likely. Possible is too loose.
Specify the action to Save to BigQuery.
Include the quote.
Create a DLP dataset in the GCP project.
Configure the table ID to include the property tracking ID with hyphens replaced with underscores for ease of use – change the tracking ID in the example shown below. Do not copy this value exactly. Use the appropriate tracking ID.
This means each DLP output table in the DLP data set will be easily recognisable as belonging to a particular GA property.
Schedule the scan for every 24 hours. The schedule is taken from when the job is created. Currently the start time isn’t configurable. Set the job up at the time you want it to execute.
Save the job and execute it.
This will create an output table in the dataset. Configure the table expiry time to flush the data before the scheduled query creates the table for yesterday’s GA data and the DLP job runs. Use the pen icon on the table info to adjust the table expiry.
Data studio dashboard
Create a new report with a BigQuery custom query datasource. Use the following query to extract the DLP output from the appropriate DLP output table:
SELECT quote, table_counts.field_name as gaField, table_counts.name as matchType, count_total, likelihood FROM ( SELECT quote, locations.record_location.field_id.name AS field_name, info_type.name, COUNT(*) AS count_total, likelihood FROM `[project id].[data set].ga_yesterday_UA_XXXXXX_Y`, UNNEST(location.content_locations) AS locations WHERE (likelihood = 'LIKELY' OR likelihood = 'VERY_LIKELY' OR likelihood = 'POSSIBLE') GROUP BY quote, locations.record_location.field_id.name, info_type.name, likelihood HAVING count_total>0 ) AS table_counts GROUP BY quote, likelihood, table_counts.field_name, table_counts.name, count_total ORDER BY Table_counts.field_name
Add a table to the report:
The ga Field dimension is a custom field:
Add filters for likelihood, matchType and ga Field values.
Repeat for each property being scanned.
Test and tune this. There are a few moving parts so timing may need tweaking depending on data volumes.
Think about scanning intraday tables on an hourly basis. Scan the output and alert in close-to-real-time for PII ingestion. Perhaps automatically publish the empty GTM container version we all keep for emergencies? We all have the “big red panic button” version don’t we? That’s a handy takeaway at least!