What are regulations saying about Data Privacy

Fabiana Clemente
21 min readNov 12, 2020

--

Podcast series — When Machine Learning meets Data Privacy

Episode 2

Welcome to the second episode of the podcast series — When Machine Learning meets Data Privacy.

Today the episode will be about privacy regulations. Data privacy it’s a trending topic — since the understanding that the collection of consumer data is crucial for modern businesses came also the awareness that data, can tell me much more than we want about us. Due to this many are the regulations that are nowadays enforcement data privacy such as, GDPR, CCPA, LGPD, and the list goes on but, after all, what are they saying about data privacy?

What is the definition of data privacy and why should organizations comply with?

To answer these questions, I have invited Cat Coode, a specialist in data privacy and regulations that helps companies keeping themselves compliant with regulations. And as per usual do not forget to subscribe to our amazing MLOps.Community slack.

The interview

Fabiana Clemente: Welcome Cat! Thank you for being here with us today, on our podcast on data privacy for machine learning for the MLOps.Community. You have come from an engineering and software background, if I’m correct, I would like you to ask you how have you jumped from a technical perspective to the regulations side and to be an expert on data privacy?

Cat Coode: That’s great! Well thank you first for having me this is great so like you said I started in the software and Engineering side, I actually have a degree in computer and electrical engineering so, I used to work for Blackberry which was Rim at the time designing devices and servers and blackberry as of the handheld device, was always security first and everything we did, from the architecture of Designing how the applications work with the operating system, everything had a security first focus.

I worked there for a long time, starting in development and Senior Management architecture, and when I left, I wanted to help people to understand data!

I wanted people to have a better understanding of what they were putting into the internet, what they were putting into applications and things have really just started off with Facebook and some of the Twitter-like all of the all social network that we know

But then a few years ago the regulations started to get bigger and a lot of the corporate companies that I was working with on Cyber education came back asking, what do you know about this GDPR? And so I guess that one point in my life I had wanted to be a lawyer, although I never went down that route, but I’ve started reading about regulations and I’ve found them to be fascinating! So I did a certification in Privacy Law and, I realized a lot which will talk about, is a lot of his privacy regulation they come from both sides, they come from a legal side that they also come from a technology side, and people don’t always know how to implement the technology piece of it so, a lot of what I’ve been doing is helping people understand, companies understand how to apply regulations in their company how to become compliant with those privacy regulations!

Fabiana: That’s amazing and very curious to see how — of course, your experience took you through this path — but it’s very interesting how you see you jumping from a technical perspective towards something more related to regulations, but, the perspective as you mention regulations both side both the technical perspective but also the more legal side. I dare to ask you then, how can we define data privacy? Is that even possible?

Cat: That’s always a really hard question! I get it a lot in cybersecurity circles and people are really what’s the difference between security and privacy.

Security is about protecting your assets. Security is keeping people out of where they don’t belong. For privacy, is keeping the information and the individual safe and unfortunately the difference between security and privacy, I mean,

people always use the castle analogy, but like with security you know you’ve got a moute you got us a big wall and if someone gets in, then you can probably kick him out again, but with privacy, is what is the king doing right now, if somebody finds out what the king is doing, once they know, they know.

So privacy, to me, is really critical because once the damage is done, is done. There’s no way to put it back in the box. You can’t shove the toothpaste back in the tube! It’s really important that people prioritize privacy in addition to security because, if someone gets in only one time out of a million from a security perspective in a risk analysis as low-risk but again, once the data is out there, the harm as been done for the individual.

Fabiana: Knowing this do you think that current regulations and from the ones that you know like GDPR and even CCPA, for example, do you think they do cover everything they need to, in order to ensure privacy?

Cat: I do think they’re a really good step in the right direction! So, if anyone’s listening isn’t aware of all of this stuff, I mean, we’ve things like Nist and ISO that you hear about a lot in the security world, have been around for a long time right? And they are again security focus. So GDPR, which is European Union, CCAP which is California, Pipeda which is what I have in Canada, all of these regulations are trying to put individuals first, so, it’s privacy but it’s always, let’s consider you a user and instead of let’s consider how to innovate, because I find, especially coming from a Tech Community, in a tech background, there’s this thing about innovation — we can innovate so quickly online these days, like it’s so easy to create a new app in a day that, people aren’t stopping to think about what the impact is on the individual who’s using the app. So I feel like these regulations are doing that well because they’re putting the light back on the individual!

So they’re saying before you design this, I know it’s cool and it’ll make you money but, what happens to your user and their data when they use this app?

Fabiana: On that sense, and now that you’ve mentioned, at least from my perspective, and this might be a bit controversial — Although these laws and regulations are saying and are putting those concerns and putting the individual first, what happens in a lot of cases, is that companies, especially the smaller ones and the ones that are a bit more disruptive, are not taking the regulations so seriously. It seems like, conceptually, this all laws only to apply to big organizations, or it seems like the market is doing that, at least. What’s your perspective?

Cat: Unfortunately, I tend to agree with you. I think we got companies in the spotlight, big companies in the spotlight, they’re are keeping themselves compliant in order to maintain European customers and, then you’ve got new companies that are starting brand new and are trying to implement a lot of this privacy-first thinking, but you’re right I agree, there are all sorts of small and medium businesses which is the majority of what we use in the world and, the small-to-medium businesses, it cost money to implement regulations, because often you have to go back and fix the things that are wrong and, they don’t have the time! So one way, I kind of explain that this is like if all the buildings are crumbling foundation and you needed to fix the foundation which is what we’re essentially doing by reapplying privacy, is you know the big office building that has is a thousand people have to fix the foundation or they won’t be able to rent any of their units because those companies are saying

Hey, you have to do this! But the smaller companies are, if we don’t have the money to fix the foundation right now we just have to run our business. So hopefully, it’s fine, hopefully, it’s not a problem!

This is the feedback that I get when I go into a company and I’m like — Okay you know what, here are the 12 things you need to do for compliance and here’s the order in priority and what you need to do them and they’re like — Yeah we only have budget for one, so if it’s not a priority, it’s not a priority for at least right now.

Fabiana: Yes, that’s a good point! But another thing that I feel, and this might be from a consumer perspective, I tend to see that this new generation is getting more and more aware and more concerned about the privacy and the use of their data. Do you feel like, in the end, it will have to be the consumers forcing companies to adopt privacy in order to not be canceled?

Cat: I think so! Again we’ve become so complacent with a lot of these social networks that people are used to using, like if you look at TikTok, takes so much of your data and, we have a whole generation of people who just love it cuz it’s fun, they don’t want to know! Like, a second out of sight out of mind thing I don’t want to know what it’s taken cuz it’s too much fun!

But I agree! I think as people become more aware of privacy and other options come off, I think it’s their option to say, you could use this product which will protect your privacy. Like you see people picking Brave as a browser or Snow as a texting app. As soon as choices become present I think people are going to move towards that, for sure!

Fabiana: Now a situation, more related to machine learning and even how companies — the ones that are investing in AI — How do these regulations can be framed in the context of data collection? Because I guess that’s one of the major questions or, at least, a lot of companies don’t think about it when, for example, setting up their data collection pipelines, or even their data strategy.

Cat: That’s a lot to impact.

I mean one of the main principles behind all of this, is data classification. That is always to be the first step that all companies should be taking,

is to list all of the data they’re collecting and then classify that data into whether it is personal information or personally identifiable information. If it’s identifiable information if it’s anything that you can use to identify an individual in a single piece of information or a combination of information, that information has to be protected.

In some ways, technology is great because you can use it to anonymize info so that’s great! So you could say luck I’ve got 17 data points on an individual and four of them are identifiable and if I strip them out then I still have 13 great data points and then that way they don’t connect anyone, and then that information can be fed into things like machine learning algorithms. But you have to pull that personal information out, because as soon as you start sticking that in there, again, it has to be protected you have to ensure that it’s being processed in a way that individuals have consented to allow their information to be processed. All of these regulations apply basically to identifiable information and, as soon as you can change that, then everything just becomes easier, because it’s not sensitive anymore and it’s not private!

Fabiana: Yes, makes total sense! Another question that I have, this is related to a study that exists about sensitive information, the study says that’s in the US, although you might be hiding, for example, the age, the name, and the gender of the person, you are still able to identify that person based on other few information, for example, postal code and the combination of any other thing about the person.

So let’s assume that we’ve removed all the PII information, with postal code exception, is that possible for us to understand in an automated way, that we really are privacy compliance when we just left the postal code but it still possible to re-identify someone?

Cat:

The argument is that if you are able to re-identify them in any kind of way shape or form then it’s not anonymous!

I have that question once, when someone was like — What if I have 4 anonymous tables that when I combine them then, I can re-identify someone?

Then they’re not anonymous! By definition, you pulled that data out, there should be no way to reverse it in any way to identify an individual. But what you hit the nail on the head which, I love, people are always trying to automate this stuff and there are amazing tools out there, don’t get me wrong, especially for bi companies that will pull personal information and try and classify it for you, but that is a perfect example with this postal code of how it’s situational. If you are dealing with a rule Community where it’s possible that there’s literally a house at that postal code then like you just said you’ve potentially identify that individual, so if you said man or woman like you said when gender is usually not identifiable but if you said man or woman at this postal code and, in that particular house, there was only one man and one only one woman then you have identified them, and then suddenly postal code is not obscure enough to make an anonymous dataset! You really have to look at the context of your information. A great example I use often is there was a high-school in British Columbia in Canada where they had been using gender, because it was un-identifiable and then one of the children at that school now identifies as nonbinary as gender, and so, where it would have said hey hear the grade 12 chemistry marks for the average for girls the average for boys and the average for non-binary, all of a sudden that’s not an anonymous dataset anymore you know exactly the individual that’s referring to!

So we have to consider the size of the dataset, I mean, there’s a lot of things to look at but, is that information really anonymized?

Fabiana: So in a sense, and as you’ve mentioned, it’s not feasible to apply automation to identify PII information, well, it's not a recipe. It’s not like that script that tells you PII information are these sorts of fields. There’s always contextual information that you can extract from an external database that can make your data not private anymore, is that correct?

Cat: Yeah absolutely!

Fabiana: In a nutshell let’s say that organizations are still being able to collect data being compliant as long as they take into consideration all of this, is that correct?

Cat: Yes. So first, you’ve to say, I’m only collecting the data I require, that’s a legal basis. This is the data I required to do what I do this is not extra demographic data or extra other data. I’m only taking the data I actually need, then you’re asking your user for consent, to collect and process that data so they’re aware that that data is being used, and then when you do use the date you’re ensuring that when possible you anonymized the data.

So, if you were collecting, like if you’re a doctor’s office and you have to collect health data in order to follow your patient, then yes you need access to that personal information that has to be identifiable, but if you’re a hospital and you want to run some kind of learning or something on symptoms or you’re trying to run an algorithm somewhere in the background to better understand you know what’s happening with this particular sympton with COVID-19, then you’ve to find a way to anonymize the data or the individual has to know that you’re using the data for that purpose.

Fabiana: And now that you mentioned COVID, and it’s a very very good example, in Portugal we have this kind of a situation where researchers where asking for access for COVID related data from patients and, there was this question where in a matter of national security or urgency the access to this kind of sensitive data should automatically be approved because, it’s a matter more important than the privacy of an individual.

Cat: There’s a gray area! I know this has come up a lot, there’s a gray area between where benefits the greater good or privacy. So, I mean, if you looked at it, if you could stop a disease from spreading if you knew someone had a disease and you could stop it from killing other people by knowing they had the disease, arguably it’s more important to know that they have it. I’m looking at not even COVID, but way back in time when people knew someone have the black plague right, but with COVID, I look at it and I get my laptop fixed so I had to go to the mall to do that and they took my temperature before I can even go in the store and it was enerving, it was weird, because I was standing in the middle of the mall and so if I’d had a fever and they turned me away anyone in that mall would have had that information so, from a privacy perspective I didn't love that, but at the same time it’s for the greater good of the people. Right now, workplaces are now sticking in thermometers, are like infrared, in order to get into the office in the morning you have to check your temperature so yeah there is a line where you can say, look this is, I appreciate that this is private information but it’s for the greater good and so understanding of course if someone has tested positive, is pretty crucial for the health and well-being of others if it’s. So there is definitely a line in there.

There are numerous lawyers that have written really good articles, so if you’re in a position with your company where you’re like I don’t know where this line is and I need to cross it or not cross it, and there’s some really great outline, sort of when you can publicly disclose information in relation to this kind of thing and when you need to keep it private. So, if an employee calls and says I’ve got COVID, I’m not coming in, do you tell everyone else in the office, why that person is not there, or do you have to protect their privacy and not tell anyone why they’re not there?

Fabiana: Yes, makes sense! Because if the person had contact with others, for example, it might be a situation where it could be good for others to know and take measures. Well, a few years have passed since these regulations were out, GDPR is from 2018, we could expect that some organizations, at least, would have already maturity in what concerns data privacy and protection. Do you feel like we have achieved that maturity or are we still on that path? What are the most usual errors when it comes to this subject?

Cat: I don't think we’ve achieved that maturity. It’s interesting that when CCPA came out, which was in 2020, companies weren’t ready for it, even though most of those companies were serving European citizens and should have been already GDPR compliant anyway, and CCPA is very much like GDPR. I think there’s a scramble to check a lot of boxes, and the biggest issue, while we see, is, again these regulations are coming down from a legal perspective so GDPR you’re supposed to assign a Data Protection Officer and that DPO is supposed to oversee the implementation of the regulation.

But often, the DPO is a lawyer and you can get some highly technical lawyers for sure, but there are elements in GDPR, where, especially the bigger companies that I‘ve worked with, I think it’s the harder it is because they’re siloed.

So, the lawyer will have to go to the development team and say to them — Hey, have you implemented privacy by Design? or Do you have encryption on transfer of information? or Have you anonymized datasets?, and

They have to rely on the development team to just say yes or no and then that DPO is checking off a box.

The issue is that, for me everyone has a role to play in this — Product Managers who are designing products, need to understand these regulations so that they’re not designing products that take sensitive data or that take more data than you need and then, the development teams, who are actually making tech products, need the understanding of how best to protect that information and limited, and then again encrypted or protected transfers and storage, and are you using real data in your test, so, they need to understand it and then customer-facing people need to understand it, because when a customer calls and says “Hey I would like a copy of my data”, everyone needs to be able to answer that properly with are you European citizen, yes, that’s under GDPR, I will get you a copy, and then finally, the CxO team needs to get this, because like I said before, teams are not being given the resources they need their not being given the time or the budget, to actually be able to implement the privacy stuff properly.

The biggest problem I see is that one person is assigned responsibility for this! (The DPO)

They’re trying to basically herd cats, right? They’re trying to get all these different organizations involved in doing their peace, and nobody really knows why or what they’re supposed to do. Smaller companies I have seen, are more successful with this maturity model because they’re not as siloed, and they usually have fewer products right, so it’s much easier for them to change a product and you can get all of those stakeholders in one room and be like, here’s why you need to do this and they’re all on board. But, in a bigger company it is very very difficult to do this, and what I’m seeing is that we’re getting bigger fines from the data protection authorities, that might be what we need to push some of these bigger companies in the right direction. But, it’s a hard hard job for whoever’s assigned that DPO role, because you have to get so many people involved to get this right!

Fabiana: Just like you said, this doesn’t sound like a one-person job, especially when involving big companies, where you have a lot of data spread across the organization. Without wanting to incur the risk of asking for a checkbox, besides all that you have exposed are there any ways or tools that can be used in order to measure the degree of privacy in organizations, for example, so organizations can become more aware of the level they are?

Cat: I do feel like, there are lots of tools out there. Unfortunately, there’s no one way of doing this. ISO released a 27701 which is a privacy addendum to their security, their cybersecurity. That ISO is very GDPR like because GDPR new and some companies have to be ISO compliant because that’s the standard that they set themselves to, there isn’t a real checklist for that, but, it’s long, it’s a really long thing to, I don’t recommend the reading unless you want to fall asleep, but if you look at it, like I said, it might be good for companies that are aiming for ISO to actually come and say, okay well now there’s this privacy piece and now I have a little more guidance, and then they could measure themselves that way.

There is no such thing right now as a certification for GDPR. So, if there’s a company out there telling they will certify you as GDPR compliant that’s a lie, because it doesn’t exist!

Whereas, at some point, the ISO auditing, well, you can’t audit right now you can just get like an addendum to 27000, once that auditing comes in that’s when the company need to, okay, I need to measure myself, then I think that might be the first standardized measurement.

I had to come up with my own checklist, I have a 14-page checklist I use with companies because, if you go online and you type in GDPR checklist you could get ten things you could get like 10 checkboxes. Do you? Do you? Do you? and you can just say yes, and I’m like no it’s not that simple, you can’t just say, Do you use privacy by Design? Yes, Can you access users data? Yes, okay great but, can you access it under load, can you ask how are you authenticating that user when they ask for their data, there are so many underlying questions to that checkbox.

That’s my concern is there is no real standardized way to measure it right now, people really need to go through whoever your DPO is, really needs to pull GDPR,

and actually go through the entire document is not a simple process, it’s an important process but it’s not a one-day check, check, check kind of thing.

Fabiana: It’s not like, I’ve just woke up and I will be GDPR compliant and check all the boxes and that’s it. So, it makes me wonder, in a sense, more related to the implementation of machine learning, there a lot of companies, they’re just collecting data for the sake of collecting, because one day they might be needing it. It seems a bit contradictory to what GDPR says or any other similar regulations, where you have to collect the data that you need and not for the sake of that one day you might need it. But, for the process of doing machine learning, well, it’s important for the data science team to have access to the more data as possible, let’s say. In that perspective, do you think that, Machine learning can still co-exist with data privacy in all existing regulations?

Cat: I totally understand the issue here so, again, if you are specifically asking for users data and they know it’s being used the way it’s being used, I mean if you look at Facebook, they have the number of data points they have is insane, but their terms and conditions pretty much say if you’re agreeing to use our free service the payment for the services, is that we are collecting all of this data about you. They’ve pretty much laid out everything they’re collecting out, like you said, with GDPR they don’t have a legal basis, they don’t have a real product reason to collect a lot of the data they’re collecting and so, that’s going to be the issue.

I think with Machine Learning, what people are going to have to do is, find better ways to make the data less identifiable. So, again, it really depends on your product and your service.

Instead of postal-code, can we use, city, and instead of birthdate can we use age range? Are there ways you can still make it intuitive Machine Learning, then that’s kind of the way to go? But will always depend on the dataset, with smaller datasets, with sensitive information, you’ve to be very careful, not to it re-identifiable.

Fabiana: When thinking about, for example, healthcare, especially when dealing with clinical trial data, and you provide it to third parties it’s the same example you gave on the school. This was a very interesting conversation so far, and we’ve in this podcast this section that we call the “Open Mic”, where I usually like to ask our guests to share an experience as a professional deeply related with Machine Learning, in your case, someone deeply experience in Technology and Data Privacy.

The open mic

Cat: There’s lots of things. Privacy is the foundation of the pain I know that and I know it cost money. One of the good examples I have from, actually, when I was working at Blackberry years ago, is, I was on the original calendar code and a lot of things in there were hard-coded like, dates, and how we did recurring appointment, so a lot of hardcoded elements in there, and because of that we actually had daylight savings time hard-coded and I don’t know if you remember it was probably a good fourteen years ago now but Daylight Saving Time changed, and because everything was hardcoded, we had to go to 10 different old versions of code and update them. I was managing the team that had it, I was a team lead and I kept pleading with our managers and directors to fix the calendar code because we just kept applying bandages to it and so instead of taking apart a hill and rebuilding it, we kept turning it into a mountain. We were just, ff that that doesn’t work so let’s add this new thing to fix it, it was taking people forever to make changes to this code because it was a disaster, and we kept getting pushed back from the company that we didn’t have time to take it apart and rebuild it because we had to keep moving forward on the product. A few years later and I ended up the director of the team, and so I had the authority to do it so we took it apart and we rebuilt it. It took more time and money to rebuild it years later than it would have obviously at the beginning. That once did it saved more time and money then we took to rebuild it. So, again, I’m on the foundation, I always use foundations of houses, because it’s the same thing, nobody cares about the foundation of your house, you will never invite someone over and be like you, ooh come check the foundation of my house is really cool, you won’t do that, you will be like, look at the new paint color I made, and look at my new couch. It’s the same thing with products, nobody ever goes look at how awesome my architecture is and how privacy by Design is embedded in it, nobody does that. What they will do is, look at the cool features and look at how awesome this product is and the cool things it does.

Spending the time and energy and money, to fix your foundation today will come back to you later. If you don’t have privacy at the base of the foundation of your product is it will be harder to maintain it if you’re collecting too much information then you have to protect more information, if you don’t have it encrypted on transfer, then you are more likely to get breached and have a privacy incident. So, the risk of what is going to cost you is going to be higher, if you were to take the time and fix it right now.

I cannot stress enough, that even though it’s going to take time and money, it is worth fixing the privacy today! Privacy by design, which was actually designed by Dr. Ann Cavoukian, who is Canadian and work at the Ontario privacy commission, it is part of the GDPR, also on LGPD, it’s at the base of a lot of privacy regulations and basically got 7 principles in it. Essentially, if you follow those 7 principles, you’ll ensure your system has a privacy-first foundation, and you will make sure you’re putting the individual first.

Useful links

--

--

Fabiana Clemente

Passionate for data. Thriving for the development of data privacy solutions while unlocking new data sources for data scientists at @YData