SAFE TRAINING PRINCIPLES IN AI | TOWARDS AI

How To Train Your AI Dragon (Safely, Legally And Without Bias)

How to source data correctly for AI algorithms and reduce bias-ness

Charles Towers-Clark
Oct 18 · 6 min read
Just like the dragons in Dreamworks’ 2010 film ‘How to Train Your Dragon’, AI systems are often … [+] Dreamworks Animation

Untrained dragons can cause a lot of damage. Likewise, as AI systems spread further and have more influence over our lives, it’s getting far more important to make sure they’re properly trained. Bias can creep into the reasoning of AI very easily, either via datasets that are not diverse enough or through irrelevant data attached to viable data points, leading to flawed results and in some cases prejudiced or dangerous conclusions.

Despite regulations like GDPR to protect the privacy of our data, personal consumer data is increasingly being used by companies to improve services or to gain customer insight. Ironically, these regulations also make it more difficult for companies to gather enough data to train an AI system or to prove how their AI reaches its decisions (an impossible task for many deep learning systems).

Therefore, as AI develops and its abilities grow, collecting useful data without breaching data regulations will be crucial to ensure that AI can make the right decisions and that personal and sensitive data isn’t used in the wrong context.

Safely sourcing data

With so much data flowing through cyberspace, companies are employing more and more granular metrics to measure our behavior and improve their service. However, the General Data Protection Regulation (GDPR) allows companies to only collect a person’s personal information with their explicit consent, or “if it is necessary for the purposes of legitimate interests pursued by the company,” says Sebastian Weyer, CEO of data anonymization company Statice. Because article 6 of the GDPR (which outlines the requirements for compliant data processing) leaves the phrase ‘legitimate interests’ open for interpretation this means that companies “are safest when they obtain direct consent from data subjects,” according to Weyer.

However, due to an air of mistrust around companies’ usage of our data, Weyer makes the point that the majority of customers “will not consent to the use of their data for product tests and innovation,” which limits the amount of useful data available to train AI and improve products. This “lack of education around the importance of data in building personalized products and services” can stifle AI innovation, Weyer argues, and fears around data breaches and commercial misuse could, in fact, restrict the ability of AI to tackle pressing societal issues.

Companies building AI products and services also need to be transparent about their use of data for automation, which is not always as easy as it sounds.

Removing all identifying information from a dataset, known as data anonymization, is therefore incredibly important when collecting data as it allows usable information to be gleaned from a dataset without compromising data privacy regulations. Statice, for example, creates a synthetic dataset that follows the same structural and statistical properties of the original but without any identifying information attached. Proper data anonymization is not only a GDPR requirement when collecting data but it also helps to accurately train an AI system. “If the data is not correctly anonymized before being used to build machine learning models, the learned patterns could involve sensitive information,” says Weyer. This is because algorithms work by recognizing patterns in data-if there is extraneous data in the dataset, such as a person’s age, race, or address, then patterns could be drawn between those factors and not the relevant data.

Training data

Aside from properly anonymizing sensitive information, getting the right training data for a particular algorithm is vitally important; in fact, data can be seen as the most important part of an AI system. Datasets that are not complete, have over- or under-represented elements, or have too much irrelevant information can easily skew an AI system’s reasoning. This has been notably demonstrated in flawed criminal recidivism systems that suggested that African-Americans were more likely to re-offend than their white counterparts, due to historically biased training data. But it isn’t easy to remove bias from a dataset, partly because of issues such as historical inequality, or because of a lack of diversity in a dataset. “You will almost always start with an overrepresentation of some elements and underrepresentation of others,” says Leila Janah, founder and CEO of Samasource, but without proper testing and review “data sets that are not inclusive and diverse can lead to issues with bias involving race, gender, and culture.”

Issues with bias are not just skin-deep, however, and for image recognition systems like self-driving cars having diverse training, data is a prime safety concern. “The data used to train an algorithm is a large component in ensuring it is able to appropriately identify a pedestrian from a stop sign and a stop sign from a tree,” says Janah. For example, a dataset that is under-represented by people with darker skin tones could lead a self-driving vehicle to be less likely to ‘see’ a pedestrian with darker skin crossing the road. While this may seem an extreme example, it is pertinent to think about the importance of representative datasets now as AI is used in more and more mission-critical applications. Employing a diverse team to annotate training data (as Samasource do) helps to ensure that all relevant metrics are accounted for, that cultural bias does not inadvertently enter the system, and that datasets are representative of the general population.

Correctly annotated and properly anonymized training data is also just good practice when training AI.

The devil’s in the data

While an AI company’s performance is often put down to the complexity of their algorithm, the power to make or break an AI system lies with the data it is trained on. Not handling sensitive data properly can not only lead to a PR nightmare but can also inherently flaw an algorithm’s reasoning by allowing it to draw patterns between irrelevant data. Notwithstanding the legal requirement to properly anonymize data (at least in the EU), it’s also good practice to ensure that identifying data is removed from a dataset before training an algorithm, so that bias does not creep in.

Our lives are becoming more automated, and the majority of people now interact with AI systems on an hourly basis whether we are aware of it or not. In this context, we must remain vigilant about protecting an individual’s right to data privacy, and ensure that discriminatory AI is not set loose upon the world due to biased training data. AI is getting more powerful every day, and proper data management and assessment will be the check and balance against the fateful consequences of poorly trained systems.


Originally published at https://www.forbes.com.

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Charles Towers-Clark

Written by

CEO of Pod Group (@PodGroup_IoT) Author of “The WEIRD CEO”. Advocate of Employee Self Responsibility, #FutureOfWork and #AI. Contributor: http://Forbes.com

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade