Google DLP for Credit Card certification

Published in

Google for Developers EMEA

8 min readAug 19, 2022

This post is not part of any series and is not aiming to be a full product overview. This post shares my personal experience and thoughts about using Google DLP for one specific task. Also, using DLP this way was only possible due to our specific architecture and probably won’t be possible for other POS applications.

At Teamwork Commerce, we are building an enterprise-level Retail Management solution that is used globally. A big part of that solution — is a POS application (that, in our case, runs on iPad/iPhone).

When building POS software, you have to deal with Credit Card integration. And when building POS software that is used globally, you must deal with it many times (at least once for each country you plan to support).

For those of you who are not connected to this area, it may feel strange how fragmented it is. It is very different from eCommerce payment area since you have to integrate with Credit Card hardware (and, as a result — certify this hardware first). While there are companies that try to cover multiple countries (usually EU or US/Canada), there is no one that has full global coverage. So you have to do a lot of integrations with a lot of different hardware and services. And while one may have hopped that those integrations are more or less similar — I can assure you that there is no limit to the imagination of people who are creating those :)

There is only one area that is a bigger nightmare for POS developers than Credit Card integration, and it is fiscalization. But that is another story…

And if you are connected to those areas — I feel your pain…

Each time you code Credit Card integration — you have to be careful and ensure that you don’t save any sensitive information to the DB (CC number, track data, expiration date, etc.). Some (good) Credit Card devices won’t even expose any of this data to POS (which is a good but not common pattern), but others (not so good) would require POS to at least temporarily deal with this data (and that is when you need to be careful).

Usually, before your POS goes to production (to stores) in each country, the certification process is happening. Some of those are pretty intensive (e.g., PCI Compliance), while others do a little check and let you deal with issues later (which means you, as POS developer, have to do your internal certification and check).

In Teamwork Commerce, we go even further and allow external developers to build plug-ins to our POS (yes, you heard it right, we do have plug-ins for the iOS app. That is another and rather interesting story, but this is GCP blog primary, and I won’t get into iOS development here…).

So, how can you check that no sensitive data is stored in POS (except for analyzing the code or doing spot checks)?

Normally certification companies would have special software that can scan DBs and disks to find sensitive information. And usually, you can get a hold of this software (I guess to prevent you from analyzing it and being able to fool it).

So, what can GCP offer to solve similar tasks? And that is where we get to Google Data Loss Prevention service.

Data Loss Prevention

According to the official GCP documentation, Data Loss Prevention (DLP) is a fully managed service designed to help you discover, classify, and protect your most sensitive data. It allows:

Take charge of your data on or off cloud
Gain visibility into sensitive data risk across your entire organization
Reduce data risk with obfuscation and de-identification methods like masking and tokenization
Seamlessly inspect and transform structured and unstructured data

We would focus on this specific feature from GCP documentation: Automated sensitive data discovery for your data warehouse.

It does (or at least is supposed to do) exactly what you expect it to do based on that description. You can set up a job (one-time or regular) that would scan your data warehouse (Google Cloud Storage, Google Cloud Datastore, BigQuery, and Hybrid) and get the results of that scan.

Hybrid is actually a very interesting but more complicated option to configure. Since, as promised, this is not a full service overview, I won’t go into the Hybrid option here and just leave a link if you want to learn more about it.

Likely for us, we sync all POS data to BigQuery for analytics purposes. So, while normally, during certification, you would be required to scan POS DB itself (as a first “touch” point for CC data), for self-check and the purposes of this post, we can do a scan of data in BigQuery and try to use DLP for it. So, let’s give it a try…

Configuring Data Lass Prevention Inspection

For our particular task, we would configure a one-time inspection in DLP to scan BigQuery data and see if it can find Credit Car-related sensitive information. This is actually pretty straightforward…

First you need to configure what we want to inspect

One thing to note here is that in the case of BigQuery only one table is allowed per Inspection. If you need to scan several tables — you will need to configure several Inspections (one per table).

2. Then you can configure Sampling, Identifying fields and Columns to scan

In our case, we want to scan all of the data, and we would select “No sampling”.

Next goes “Identifying fields”. From my experience, I would really want this field to be specified as required, or at least give a warning if I left it empty. If you leave it empty (as I have done several times by mistake), there would be no way to understand what exact records contain sensitive data (you will know what you have it somewhere, and even know in what column, but not the record). That meant that the results of those scans were useless, and I had to start over… (which is why I’m a little frustrated that this field is not the required field).

3. Next step is to configure what we want to find, by listing InfoTypes that would be used.

There is a long list of different InfoTypes available, including the ability to create your own custom ones. For our task, we would be primarily interested in CREDIT_CARD_NUMBER and CREDIT_CARD_TRACK_NUMBER.

One more parameter is Confidence threshold. Obviously, the lower you set it — the more finding you will get (and more of them would be false positives).

4. Next, you to select what you want to do with your findings. For me the choice is obvious — save to BigQuery (if you can save something to BigQuery, why would you not?! :) ).

5. Finally, you need to set up a Schedule. As already mentioned, we would do a one-time scan for now.

6. When the Inspection is done, you can see summary results in the DLP console and investigate them in more detail using BigQuery.

BigQuery table, created as a result of the Inspection, contains enough details to find what exactly DLP considered as sensitive data (unless you forgot to specify Identifying fields, as I have done several times).

Data Loss Prevention findings

Up until now, DLP looked like a perfect service — easy to use and solving very important problems.

Unfortunately, now we have arrived at the grain of salt…

Note that below I’m sharing my findings, and it is possible that I’ve done something wrong, and the issues below are more my issues than DLP issues.

CREDIT_CARD_NUMBER identification is pretty basic. I may be wrong, but it looks like it identifies are Credit Card Number any number with 12 digits. It does pick up fully unmasked Credit Card number, but also pick up a lot of false positives, and in my case, identified even 6666666666666 as CC number.
CREDIT_CARD_NUMBER does not pick up partially masked Credit Card numbers. Why is this the problem? Even one number been masked was enough for it to be skipped, but Credit Card marking requirements are actually different from country to country and usually require at least 8 or more digits to be masked. (Of course, you can try to set up a custom InfoType for that).
CREDIT_CARD_TRACK_NUMBER actually has not worked for me at all. I put track2 data into some of the fields in the BigQuery table and none of them were found. I don’t know why is that and if something wrong was with my track2 data — but that is how it was for me.
STREET_ADDRESS gave me a lot of false positives. It looks like almost any number in my data was identified as a potential address. Of course, I could have played with the likelihood configuration for this particular one, but since I was focused on CC — I just disabled it.

In summary, I found Data Loss Prevention (and specifically the Inspection feature) very cool, easy to use, and useful on paper. However, at least for my specific problem, InfoTypes definitely require more work to make better identification.

In addition to that:

I wish it were possible to learn how build-in InfoTypes are working. For example, I suspect that CREDIT_CARD_NUMBER InfoType is based only on Regex (and that is what caused all the issues). Also, I would like to know how CREDIT_CARD_TRACK_NUMBER works and why it missed track2 data that I put into BigQuery.
I wish custom InfoTypes supported more options, including options for a custom script (JS or API call). That definitely will cause performance concerns but would allow building proper Credit Card Number InfoType that would include Luhn algorithm, Major Industry Identifier, etc.

Google DLP for Credit Card certification

Data Loss Prevention

Configuring Data Lass Prevention Inspection

Data Loss Prevention findings

Written by Artem Nikulchenko