Excellent quality of data is as critical for the success of your AI solution as the quality of software for your mission critical programs. Getting data skills is a must have in your AI journey and they are needed to develop ethical AI solutions.
In today’s discussion on Artificial Intelligence, a lot is being said and written on neural networks and AI platforms that provide neural network capabilities. However, we believe a lot more discussion is needed on data that fuels AI. It is the data that trains the artificial intelligence and gives it its purpose. The AI acts upon data and the quality of data defines the quality of the Artificial Intelligence. A lot more needs to be learned and improved on how to work with data well and more discussion and research is needed to find ways to define, collect, implement and control data streams to get positive and reliable results in AI. Only if you use correct data and only if you have sufficient data, can the resulting decisions of Artificial Intelligence create valuable decisions with a positive impact for your users and for your business. Any bias in the learning data will leak into your AI system’s decision causing decisions to be incorrect, inaccurate and inconsistent. Wrong data can easily lead to ethically wrong and dangerous AI. In this article we describe how to work with data in AI.
The first stop in any AI project should be to define the the purpose of the project.
- Why is the project needed?
- What are you trying to achieve?
- What decisions do you want the AI to make?
- Why is AI the right choice for solving this problem and what are the alternatives?
The answers to these questions will define the impact of using AI on your project, they will allow you to measure the return on your investment (ROI) and will ensure you are solving a real problem of significance. This is necessary in giving you the justification for resources needed and efforts to be taken. Articulating the purpose will also define the data and the data quality that is needed to train and operate your AI. Our focus in this article is on using AI to take a predefined decision, so the decision taken defines the data needed. However, there are other types of AI projects, not part of this article, where you use AI to understand your data, discover patterns in existing data or extract knowledge from your data.
With the definition of the purpose of your project, an important first step is to define the data that is needed and review the available data.
- How much data is available and what is the quality?
- What data is missing and how can data be generated or added to your project?
- How do you ensure that the data is not biased?
- Are there ethical considerations in your project?
Be sure to label your project and your data correctly. An AI that is trained to recognize cats from photos will not be able to recognise horses. If this is not labeled correctly, your AI can be misused or misinterpreted and deliver the wrong results. This is a very important factor in getting your AI to behave in the most beneficial and ethical way.
An example illustrates how to work with data
Let’s take a simple example of speech recognition to describe some basic concepts and to lay out an approach to work with data in AI. Here we develop a podcast app for the elderly. Our target audience are people of age that have a lot of time, their eyes are not so good for reading and they are not very technology savvy. We envision the people to control the app via voice through a smartphone or a smart speaker. Using a defined set of commands we want the app to be controlled by people’s voice. We want to enjoy hours of entertainment without having to fiddle with buttons and controls. Understanding of voice is a good area for AI usage, as the data can be very unstructured, depending on the words used, the accents, the clarity of voice, just to name a few factors. The goal of your AI is to understand what the elderly speaks, interpret the meaning and convert this into commands that trigger a function in the app.
The first step is to define what data is needed. We come up with multiple ways of saying each command, as humans have various ways of saying things to achieve the same goal. The better and the more thorough we prepare this, the better the experience will be for our users. When the app is in use, the way to record the user input (and thus create the data) will be through the microphone of our selected hardware device (sensor).
Now we need to find a way to understand the speech data. Programming this in a traditional way is too complex and error prone, so we will deploy AI to understand what the user is saying. The microphone sensor will collect the voice data from the users and will provide our data stream to feed into our AI. AI will use the voice data to decide what the user wants to do. The decision will be to execute the command in our app based on the voice data input. To do so, we need to train your AI to understand the voice data, especially that of the elderly and their particular way of speaking.
To train the AI, we need to acquire training data and we need to label it (a person saying “start” translates to “start”). Now we need a lot of people saying start in their own ways, accents, tone of voices. The better our training data and the more train the AI, the better our AI will perform. How can we obtain training data for our specific use case? There are multiple ways to obtain labeled training data and our first job is to think about the most effective way of obtaining it for our project. Let us look at some possibilities.
- We can record the voice samples ourselves. To do so, we can visit elderly people and ask them to speak our defined commands in various variations. For each recording, we label the command, matching the voice data to the label. This can be quite time consuming and the results are limited to the type of people we record.
- We can search for YouTube videos that contain elderly language, search for the relevant sections, cut them out and label them accordingly.
- We could look for audiobooks read by older people. Here we might get access to both written and audio files. This will allow us to write a script searching for the right section and then we can identify and label those sections accordingly.
- We can also look for data brokers that can provide us with the defined labeled data.
- We can ask data service providers to create this data for us.
- Many countries with low labor costs now create data services that can be used for classification and AI.
Once you have collected this data it is time to prepare it for the AI to learn from it. Before we do so, we need to ensure our data is sufficient and it has the right quality.
- Do we have numerous ways of saying things?
- Do we have different tones and gender of voices?
- Do we have variations of accents?
- Is the data clear of background noises?
All of this will have an impact on our AI quality. We analyse our data thoroughly, then make our data consistent and convert it into the right format. We reduce and clean the data (ensure we only have the relevant sections, that background noises are filtered out), decompose the data (deciding whether we want words or sentences). We rescale the data (to ensure they have the same volume). In preparing our training data, each of those steps are considered and executed with great care. With this data we can then train our AI.
Once our AI is trained well, some of those data-managing functions can be automated (like rescaling) as we are feeding the data into your AI in real time. But for training data we recommend to do it with great care, step by step. In many projects, the initial training of AI is quite a manual process.
Preparing the data is a lot of work and you should be very careful to get it right, since your AI will only learn from the learning data that you provide. Poor data quality will mean poor AI performance. If your data is biased, your AI will be biased and will take the wrong decisions. Besides quality, you also have to consider ethics. If you only consider male voices in your training data, your AI might be wrong in understanding women. If you only consider some accents, other varying accents might not be understood correctly. This might not be a big problem for our podcast app, but it definitely is a big problem if one takes critical-life decisions based on AI, like in an emergency app, as an example.
Training the neural network
Once the training data is prepared, it is time to train the neural network. For this we setup the neural network of our choice. There are a growing number of options for AI platforms from the various market players like IBM, Amazon, Google, Microsoft. Many already have prepared services for speech data, some services are already pre-trained to a certain extent. Feeding labeled training data is simple and does not require a lot of time. Once this is done, we are ready to test the quality of our AI. We send unlabeled data and observe our AI taking decisions. It will always give us the confidence level and based on that, we can measure the correctness of our trained AI. Comparing this to the real known data, will give us an indication, whether our AI is behaving correctly or if we need more and better data to train it. In our simple example, this is a process that requires a lot of human intervention, attention and effort. We have seen in a lot of AI projects, that time and effort for training the AI should never be underestimated.
AI in operation and improving over time
Once our initial training of the AI is done, we can use the AI in our app and the users can start interacting with it. The quality of the AI will depends on the quantity and the quality of our training data. It often makes sense to setup the application in such a way that the AI improves over time. Note that AI only learns with feedback so we have to build this feedback loop into our app for AI to continue to learn on real-time data. This is another important step that is not always easy to achieve so let us look at some options for doing this for our example. One way of getting feedback on decisions of the AI is to ask the user to engage. In our Podcast App, we could create a feedback button that users press if the app does not understand them correctly. This information could be routed to the data team, which then looks at the data and feeds it into the AI for training (possibly creating additional variances of the data for optimal training). If the users are really engaged, they could also do the training and label their intentions themselves. This is often done with engaged early users (lead users), who have a high motivation to improve an app or a service. Doing this with mainstream users isn’t recommendable, as they get tired if they are continuously asked to improve a core function of an app and simply expect the app to work. Alternatively your data team could listen to the speech data of the users to see and evaluate where the AI needs to improve, label the data and use it to optimize the AI based on real usage data. This task can also be outsourced to a growing sector of data service companies.
This small example illustrates the importance of data for AI. While there are often various choices to obtain data, it is very important to get the right data, prepare it for the AI and to ensure that we understand and minimize the bias in the data. Great care should be taken with preparing the data and there is often a large effort in getting it right. Be aware of the ethical impacts if the data is biased. Once the AI is in operation it is important to define ways on how it can improve and learn over time. All of this requires a strong purpose for the AI. Otherwise the efforts might outweigh the benefits that can be achieved and the project will fail.
How to prepare the data
Data preparation is essential for training and for real-time AI decision making, this should be your most important task in any AI project. This task is specific to the use case and the company creating the use case. It should be done by you and your team and does not come with the AI platforms that are available today. This will differentiate your AI from others independent of what AI platform you use. Today, there is a misconception that the neural network is the key for your AI success, we believe that the quality of training data is the key. You can assume that the neural network is just a black box and that it will improve over time. The key for your AI success is the quality and the quantity of your data. This is your biggest responsibility and this is your biggest opportunity. It pays off to spend a lot of time and efforts to prepare your data. Good and unbiased data will lead to good and unbiased AI. It is worth repeating this for yourself and for your team. It is all about the data, data, data…
We recommend to take the following 7-step approach to answer the most important questions, when preparing your data:
1. Articulate the problem
- What problem are you trying to solve?
- Why can’t it be solved with traditional means?
- What decisions do I want the AI to make?
- What is the benefit of solving this?
- How much effort can I put into this to achieve a positive ROI?
2. Define the data needed
- What data is needed to take those decisions reliably?
- What other data and factors have influence on this data?
- What could be important correlations in the data and with outside data?
- Where could biases exist or be created in the data?
3. Qualify your data and define the minimal prediction accuracy
- What data do you have available?
- Is historic data available and what is the quality of that data?
- What is the bias in the data that you have available?
- What is the minimal prediction accuracy that makes the decision taken by the AI valuable?
4. Source the missing data
- Can you create the missing data (example through changing processes or behaviors, adding sensors)?
- Can the data be sourced elsewhere?
- Can you purchase missing data?
- Can you get someone else to create the missing data for you?
- What data will you be able to source in the future?
5. Format the data and make it consistent
- How do you get access to the data?
- How do you create a consistent format for the AI to read your data (the input format should be consistent across all data sets)?
- How do you create a data stream from all the data to feed into your AI?
6. Reduce, decompose and clean the data
- Which data or attributes are going to be most critical for the decision?
- What data could be noise or overrule the critical data and should be removed?
- What attributes are definitely not needed to take the decision (to be removed)?
- Which records are missing data or might be incorrect or incomplete?
- How can data be aggregated or additional data be added?
7. Rescale the data
- Is your data in different scales?
- Can different scales affect the decision or quality of your AI?
- How can you rescale or normalize your data (to feed an optimizes scale into your AI)?
As emphasized above, this is your chance to stand out. Your data is what differentiates you from others so we recommend to take great care in going through this process. While it sure is a lot of effort, it will define the quality of your AI, your data, your data management and your success.
Data is your biggest responsibility in any AI project, as it defines the decisions that the AI takes. Since data is used to train the AI, it will be the basis for each decision the AI make. Therefore, it is the most important responsibility to get the data right. In a well-run AI project, all stakeholders should be well aware of this and the project lead must make this a priority. While it is not the sole responsibility of single roles like the neural network developer or the data scientist or data engineer or the project manager, everyone must be made aware of the importance of data for the success and impact of the project. We recommend to appoint someone to take the role of data governance and this role should take a long-term perspective on the project and the data. You should factor in the legal and the ethical aspects of your AI. Take great care to verify, trace and document the origin and the identity of the data used. When the AI makes a decision it can only be traced back to the data that was used to train it. Any legal, ethical or even optimization effort might need to understand the origin and identity of the data.
In setting up your organization for the future, you will have to make data a key element of your strategy. Without a clear vision of data and how it can be used, it might be hard for your organization to follow in implementing your vision of AI. You will also have to invest in data competence and break down the silos of IT and business to achieve a joint vision and effort. Data is where all parts of your organizations will have to work together and a lot of competence will be needed in order to execute.
Big data, analytics and AI
The rise of computing power, storage, connectedness and the deployment of sensors of various types have created an explosion of the availability of data, which often cannot be dealt with in traditional means and methods. So called big data and big data analytics have started to tackle this challenge to work with a lot of data from various sources, adding additional outside data sources, sharing data, querying data, visualizing and storing data. A lot of analytics methods are beginning to focus on big data and there are efforts to improve predictions derived from big data often using AI. Since data is the fuel for AI, big data is an important development and an opportunity to watch and leverage. However the tendency to add too much data in AI can cause the quality of the AI decision to suffer. So it is important to take the benefits from big data and analytics to prepare your data for AI and to ensure and measure the quality, but don’t get carried away by adding data or complexity to your AI projects. Most AI projects, which are mainly narrow artificial intelligence projects, do not require big data to provide its value. They just need a good quality of data and a big quantity of records.
Tools for data preparation
With the availability of data and data projects and the rise of AI, there is a growing number of tools that help you prepare your data for AI. It is best to search the web for current tools, their use cases and best practice for using them. A few links have been provided in the sources. Some of the tools are tightly integrated with AI platforms, so your first choice is to look for the tools that the platforms provide that you are using for AI. Examples are Google’s DataPrep and IBM’s Data Refinery. These tools offer management consoles to manage your data. They allow you to add various data sources, calculate and visualize the health of your data, allow you to add data to your records, amongst many other features. All tools require some specialized know-how though a lot of training material is available for self-learning. Also look for a growing number of consultants that can help in preparing your data.
What can go wrong?
Unfortunately a lot can go wrong with data in your AI project. We see the biggest failure in the lack of data availability or ability to source the right data for AI. Companies often fall in the trap of thinking they have all the data, but experience shows that data is often not available, not accessible, storable, incomplete or biased. It requires a strong vision and leadership support to overcome this deficiency and get the right data for your AI.
Once a project is launched, there is still a lot that can go wrong. Lacking quality or correctness of decisions can mostly be routed back to lack of efforts in selecting and preparing the data and training the AI. Using the wrong sources, not understanding data dependencies, not cleaning the data, not having enough data to train, biased data are just a few areas with high impact on AI decision quality.
So please focus on one thing: get the right data right.
This article is written as part the AI&U™ (Artificial Intelligence & YOU) series by Sharad Gandhi and Christian Ehl. Watch for future articles on how to understand, learn, deploy and leverage AI for you and your organization. Our book AI&U was published in 2017. We also offer customer workshops to help companies jump-start in transforming their business with AI.
Contact us at www.ai-u.org
- Preparing Your Dataset for Machine Learning: 8 Basic Techniques That Make Your Data Better, https://www.altexsoft.com/blog/datascience/preparing-your-dataset-for-machine-learning-8-basic-techniques-that-make-your-data-better/
- IBM Watson Services, https://console.bluemix.net/developer/watson/services
- Developing a machine learning strategy, https://www.altexsoft.com/blog/datascience/machine-learning-strategy-7-steps/
- Big Data, Wikipedia https://en.wikipedia.org/wiki/Big_data
- IBM Data Refinery, https://www.ibm.com/cloud/data-refinery?S_PKG=&cm_mmc=Search_Google-_-Analytics_Watson+Data+Platform-_-WW_DE-_-+Data++Preparation_Broad_&cm_mmca1=000019OO&cm_mmca2=100006501&cm_mmca7=20229&cm_mmca8=kwd-313315197543&cm_mmca9=95236ac3-c383-4292-8bd8-c18eb727e3ed&cm_mmca10=230276596521&cm_mmca11=b&mkwid=95236ac3-c383-4292-8bd8-c18eb727e3ed%7C456%7C196491&cvosrc=ppc.google.%2Bdata%20%2Bpreparation&cvo_campaign=000019OO&cvo_crid=230276596521&Matchtype=b&gclid=Cj0KCQiAp8fSBRCUARIsABPL6JZwH4NxB5w47tGV1cDH6mm-nRbvGsAWQAv6cusSTlR62Y-Qf1cdoj4aAsQXEALw_wcB
- Google Cloud Data Prep, https://cloud.google.com/dataprep/
- Top 38 data preparation tools and plattforms https://www.predictiveanalyticstoday.com/data-preparation-tools-and-platforms/
- Infographic http://download.microsoft.com/download/0/5/A/05AE6B94-E688-403E-90A5-6035DBE9EEC5/machine-learning-basics-infographic-with-algorithm-examples.pdf