To deliver responsible AI systems, we need to think about some issues that might arise during the development process. A potential significant issue could derive from the data collected to train the model.
Data could carry great risks to the individuals who contribute their personal data. This is why there are emerging laws and regulations surrounding data collection and use. For example, in 2018, the European Union passed the General Data Protection Regulation (GDPR), a legislative act meant to protect individuals’ privacy and prescribe ways for organizations to handle personal data. It’s predicted that most countries will eventually either adopt the GDPR or create a similar legislation in the near future. There are roughly three key features to data protection: identifiability, data minimization, and notice and consent.
Identifiability refers to a requirement for data collectors to reduce and safeguard identifiable components of data as much as possible. This is done with various de-identification techniques, such as data anonymization or pseudonymization, which creates an alias to hide the subject’s identity with a key retention to reverse this if necessary. To safeguard data, there are three things to keep in mind: a) Data Encryption, which encodes information so it can only be accessed with permission. b) Secure Servers and c) Storage Location. For instance, storage in the cloud offers many benefits, but in what nation and under what laws is the cloud operating in?
It’s also important to distinguish between different classes of data because it will determine how careful we need to be. With non-personal data, like aggregated statistics, you’re in the clear to use it and there’s no regulations. But with personal data, which might include someone’s name, location, IP address, and so forth, we need to be more careful. And sensitive data, like genetic, biometric, and health data, you’ll need to exercise extreme caution.
Minimization is another feature of data protection that aims to limit data collection and the duration of storage to only what’s required to fulfill a specific purpose. It used to be that companies would want to retain data indefinitely. But as the influx of data increased dramatically and the cost of storage became an issue, data minimization has become more important both practically speaking and in terms of laws and regulation. Questions to ask are:
- How long will I need the data to achieve the purpose?
- Is there unnecessary data that can be deleted?
- How often will I review and delete what isn’t needed?
Related to data minimization is the right to be forgotten. In GDPR, this states that data subjects have the right to request that their data be erased as soon as possible. But this right might be overridden in some cases. For example, when the data is needed to comply with other legal obligations, or achieve some purpose in the public interest. So what does this mean for data collectors? How will it change your workflow? You’ll want to provide subjects with clear information and practical ways to make a request for data.
Notice and consent aims to provide notice to subjects about how their data is planning to be used and get their consent so that subjects have the option to choose if they want to be involved or not. The decision has to be informed, meaning the subject has sufficient knowledge and comprehension to make a decision. And this rules out lies, deceit, or partial disclosure. The decision must be voluntary, where the subject freely chooses to give consent, ruling out coercion or inappropriate pressure or influence. The subject must be competent, having the decisional capacity required to offer consent, which rules out children or adults deemed mentally incompetent.
Now we’ll turn to a related question of how the data will be collected. As you’re probably aware, good representative data is crucial. So it’s important to anticipate possible disadvantages of a chosen method. As Kate Crawford, principal researcher at Microsoft and professor at NYU Tandon School of Engineering has said,
“We need to ask which people are excluded. Which places are less visible? What happens if you live in the shadow of big data sets?”
Problems with data collection are closely related to issues of bias. Finally, collecting the right data can be complicated, costly, and time consuming. That’s why it’s especially inspiring to see movements in the ML community towards sharing open source, publicly available data sets. Not only does this make working on ML easier than ever before, but this increased accessibility is great because it fosters innovation and healthy competition. And this is how we’ll end up with better products that improve the world we live in.