Scanning BigQuery Table with Google Cloud Data Loss Prevention (DLP)
We are in the information age, never in our history storage was so cheaper than now. Thus, we created a new behavior of saving first and analyzing later.
But, this new behavior brings an important question: “How much sensitive information we are storing.”
To help with this question and a bunch of others related to security information google built the DLP (https://cloud.google.com/dlp/).
DLP has a lot of great features and today I’ll talk about looking for sensitive data in our BigQuery Table.
It’s easy and you can find all code here (http://bit.ly/2G0GG2W ).
By the way, I assume that you know how to create a service account in GCP, but if not let me know it and I’ll create an article about it.
Ok, let’s do it.
The code was split into 5 steps:
- STEP 1 Install package
- STEP 2 Set properties
- STEP 3 Load packages and Imports
- STEP 4 Invoke DLP (The most important step)
- STEP 5 Analysis Result in BQ
Let’s dive in each one:
STEP 1 Install package: Install all library used in this demo:
!pip install google-cloud
!pip install google-api-python-client
!pip install oauth2client
!pip install google-cloud-bigquery
!pip install google-cloud-dlp
STEP 2 Set properties: In this step, you must change all the information. In this example, I have a service account named jupytersa and the service account file was saved in the file sa_bq.json.
SERVICE_ACCOUNT=’jupytersa’
JSON_FILE_NAME = ‘sa_bq.json’
GCP_PROJECT_ID=’your_project’
DATASET_ID = ‘your_dataset’
TABLE_ID = ‘your_table’
STEP 3 Load packages and Imports
import jsonfrom googleapiclient import discovery
from google.cloud import bigquery
from google.cloud import dlp_v2
from oauth2client.service_account import ServiceAccountCredentials
STEP 4 Invoke DLP: Here the DLP is invoked, that meaning a DLP job is created (assync) , following the configuration defined in inspect_job_data. The inspect_job_data describe the storage to scan, what kind of data it is looking for and what to do after.
Let me deep in inspect_job_data.
- storage_config (storage to scan): indicate the data to scan, the option is: Bigquery, Datastore, Cloud Storage. Using with Bigquery I can define the number of rows will be scanned setting the parameter rows_limit.
- inspect_config (what type data to scan for): In this example I'm using ALL_BASIC that meaning that will trigger all current infoType . You can find more information here
- actions (what to do after): In Cloud Data Loss Prevention (DLP), an action is something that occurs after a DLP job completes successfully. The Cloud DLP supports two types of actions and you can find more information here
# Here the dlp client API was created
client_dlp = dlp_v2.DlpServiceClient.from_service_account_json(JSON_FILE_NAME)
inspect_job_data = {
'storage_config': {
'big_query_options': {
'table_reference': {
'project_id': GCP_PROJECT_ID,
'dataset_id': DATASET_ID,
'table_id': TABLE_ID
},
'rows_limit':10000,
'sample_method':'RANDOM_START',
},
},
'inspect_config': {
'info_types': [
{'name': 'ALL_BASIC'},
],
},
'actions': [
{
'save_findings': {
'output_config':{
'table':{
'project_id': GCP_PROJECT_ID,
'dataset_id': DATASET_ID,
'table_id': '{}_DLP'.format(TABLE_ID)
}
}
},
},
]
}operation = client_dlp.create_dlp_job(parent=client_dlp.project_path(GCP_PROJECT_ID), inspect_job=inspect_job_data)
STEP 5 Analysis Result in BQ: Have fun. In this example, I’m bringing the count of the match between the type of rules with columns with the likelihood LIKELY or VERY_LIKELY. You can find more information about DLP likelihood here.
client_bq = bigquery.Client.from_service_account_json(JSON_FILE_NAME)# Perform a query.
QUERY = (
'WITH result AS ('
'SELECT'
' c1.info_type.name,'
' c1.likelihood,'
' content_locations.record_location.record_key.big_query_key.table_reference as bq,'
' content_locations.record_location.field_id as column '
'FROM '
' `'+ GCP_PROJECT_ID +'.'+ DATASET_ID +'.'+ TABLE_ID +'_DLP` as c1 '
'CROSS JOIN UNNEST(c1.location.content_locations) AS content_locations '
'WHERE c1.likelihood in (\'LIKELY\',\'VERY_LIKELY\'))'
'SELECT r.name as info_type, r.likelihood, r.bq.project_id, r.bq.dataset_id,'
' r.bq.table_id, r.column.name, count(*) as count FROM result r GROUP By 1,2,3,4,5,6 '
'ORDER By COUNT DESC'
)
query_job = client_bq.query(QUERY) # API request
rows = query_job.result()
for row in rows:
print('RULES: {} ({}) | COLUMN: {}.{}.{}:{} | count->{}'.format
(row.info_type, row.likelihood, row.project_id,row.dataset_id,row.table_id,row.name, row.count)