Find your “high risk files” according to GDPR using our DriveScanner

Jeanine Schoonemann
Cmotions
Published in
5 min readApr 17, 2023

In every company it’s a struggle to make sure we only keep the documents we want and need in the future, to minimize our total amount of files, but even more important, to make sure we’re not violating any GDPR legislation. Not only important for our Privacy Officer, but for all of us of course.

GDPR, we love (to hate) it

We just have to comply with the rules, sounds simple enough! But what sounds simple might be more complicated in practice. The daily hustle and bustle occupies our minds and might make us forget to clean up after a project is done. And isn’t GDPR sometimes just inconvenient too… We just need to share that resume and we need to do it now! And yes, there might be some contact information in that Excel file, but you will remove that as soon as you do not need it anymore, right?

Right… Wrong! Well, at least sometimes. We are all human, which means our actions are not always in line with our intentions. Which doesn’t mean we are willingly violating the GDPR legislation, but it also doesn’t mean this doesn’t happen at all.

Your Privacy Officer may be aware of this and is trying to encourage all users to clean up: check your downloads folder, delete stored attachments, empty your trash, clean up the project folder at the end of the project. But that doesn’t mean an extra check wouldn’t be an excellent idea.

Excuse me, how many?!

At Cmotions, we knew we might be at risk, simply due to the sheer number of files on our filesystem. Even while we are only keeping our own project files and don’t store data of our customers anywhere on our own filesystem. That’s why our Privacy Officer tried to come up with rules to eliminate GDPR sensitive files as much as possible. To the other employees, it felt these rules were not doing what they were meant to be doing and we, as data professionals, were convinced we should be able to do better. This is when we first came up with the idea to create a Python package to do these checks for us. The idea of this package was to make work a lot easier and to solve all our aforementioned problems. With just a few clicks you should be able to see a list of files you would need to check on GDPR sensitive information. Preferably, you should also be able to see which GDPR rule was violated and how.

With this in mind, we started building our Python package ‘DriveScanner’, and now we’re proud to be sharing our first version with you. It might not be perfect yet, it’s work in progress, but what better way to improve than with the help of our community. Check out our code in our repository, or simply start using our package by pip installing it: pip install drivescanner.

The birth of the DriveScanner

So how has this package helped us? First, it gave us an insight into the number of different file types we have saved in our filesystem. A shocking 223,976 files! Assuming it would take you about 10 to 15 minutes to check each file and knowing we only have one Privacy Officer; we now knew for sure that it would be impossible for us to check all these files manually. So, by setting up the GDPR ruling that checks every file automatically, we provided an output in a table that contains the number of times a specific GDPR violation was made for a specific file. Currently the package scans for Dutch social security numbers, bank account information, email addresses, telephone numbers, addresses in general, credentials of any kind, credit card of passport numbers. It will also check for credential tags like login information. Optionally the scan is also able to detect Named Entities in Dutch and other languages.

Based on the scan result, files are given a score based on the severity of the violation. With these scores our Privacy Officer was able to filter files based on a specific violation or on an overall score.

How we used our own DriveScanner

So, now what? Knowing which files contained sensitive information, it might still cost a lot of time to see where and what kind of violation was made. This is why we also added the sort of violation to the output table. This way, our Privacy Officer did not only know which file to look at but also which violation. Within only a few clicks, and some waiting time, we were able to scan 223.976 files on GDPR violations. Not only did this help us to clear some files of some sensitive information but it also saved us a lot of time. For example, it identified that 90% of the files on our Drive don’t need any human assessment. Of the 10% that did, we started with the Excel output and this way we could disregard another 7% of the files. Leaving us with 3% that needed to be opened and assessed. Still a substantial number of files, but a lot less than what we started with.

And you might wonder, were all these files a harm at all? Luckily, not at all! We mostly found some points of improvement for the Drivescanner itself. Although some of the examples were correct from the DriveScanner point of view, like:

  1. A file created by our DataSampler, containing fictional personal information like telephone numbers, addresses and email addresses;
  2. A project file where we had multiple external stakeholders, where the name, telephone number and email address of all reviewers were stated in the document

What’s next? Will you be there with us?

So, it seems our assumptions were right. And why keep something so simple and powerful to ourselves? That is why we would like to share this with you. Have a look at our repo, pip install our package, see how it works and help us improve!

And yes, we are aware that our package definitely has a lot of points to improve on :)

Originally published at https://cmotions.nl on April 17, 2023.

Want to read more about the cool stuff we do at Cmotions and The Analytics Lab? Check out our blogs, projects and videos!

--

--