Why AI and machine learning demand high-quality information

This article is part of The data dimension: Robotics and automation, a content series developed by The Economist Intelligence Unit (EIU), sponsored by Veritas. In this series, we explore the way in which information and data underpin this technology advancement and the management challenges that businesses will face as they adopt these technologies.

Artificial intelligence (AI) has made great leaps in the last ten years, advancing from the university robotics labs and the pages of science fiction to the point where such data-driven technologies now mediate much of our everyday lives — whether controlling our transport systems or determining our Facebook feeds.

The biggest breakthroughs have been in machine learning. Rather than controlling computers by giving them linear instructions, this approach involves supplying them with huge amounts of data and letting them learn to interpret them on their own, spotting patterns and making connections, much like a young mind encountering the world for the first time. The result is software that can yield great insight from data previously inscrutable to machines, and too massive for humans ever to process themselves. Machine-learning systems learn in the wild and improve continuously. Once business data have been converted into information that machine-learning systems can understand and explore, there’s unlimited scope for what these new minds can learn.

Some of these data have been in businesses’ hands for decades, but companies haven’t been able to use them.

Machine learning has enormous potential in business. Banks, for instance, can use these approaches to better identify insights in transaction data that help them manage risk and prevent fraud. Such algorithms gather a full understanding of the “baseline” of user behaviour, both individually and of global trends, by observing many millions of transactions. With an understanding of what “normal” looks like, anomalies are easy for an AI to spot. This turns lots of data at banks’ fingertips that humans couldn’t possibly interpret themselves into business insight that protects their customers and saves them money.

Marketers can use machine-learning algorithms for “if you like that you might like this” tools that make recommendations based on customer purchase history. Online streaming service Netflix uses data about what individual customers watch and when, or at what point they give up on a television series, as useful information on what to show them next — but also what kind of programming it could commission.

Some of these data have been in businesses’ hands for decades, but companies haven’t been able to use them. Many will now be able to do so. In transport, for instance, the ability to track many millions of journeys through transport systems — whether subway systems, urban buses, road journeys or cycle routes, or better still, a combinations of these — allows planners to manage network load, minimise maintenance disruption and build their network according to demand. By processing billions of journey combinations and interpreting patterns, urban transport networks are now in a position to smartly reconfigure their network much more efficiently.

By allowing businesses to extract value from a greater proportion of the data they collect, machine learning will challenge their ability to process and govern data, so that they can be converted into useful information for machine-learning systems. The quality of output from machine-learning systems reflects the quality of the information that is fed into them.

Take as an example one of the pioneering success stories of machine learning: Google Translate. The algorithms of the translation software work not by teaching computers grammar (which is how AI translation attempts proceeded for decades, without much success) but by processing hundreds of thousands of parallel, human-translated texts and comparing them. There aren’t that many high-quality, large bodies of direct parallel translations, however. In its early days much of Google Translate’s corpus consisted of documents — contracts, treaties — from the United Nations and, as a result, its translations had a distinct feel of legalese about them, more often translating the French “avocat” as lawyer than avocado.

Poor data will only produce nonsense — or worse, cause harm.

The lesson is that with machine learning the structure, provenance and quality of your data matter more than ever. Poor data will only produce nonsense — or worse, cause harm. Algorithms that analyse online behaviour or purchase history can easily be polluted by times when your children borrow your Netflix account, or you do your Christmas shopping for friends and family with different tastes on Amazon, rendering these recommendations useless. Recently, Microsoft’s attempt to engage with millennials with an artificially intelligent Twitter “chatbot” ended in a PR disaster after users barraged it with offensive remarks and the bot, in turn, became a racist, sexist monster.

It is estimated that about half of large enterprises are currently experimenting with this sort of data-driven AI. Interest in machine learning is surging not just because the capability has improved, but also because those advances coincide with a more data-savvy world, where businesses of all sorts are generating and capturing more data. But effective use of them relies on solid data management.

Take healthcare. IBM’s Watson is being employed to sort through and analyse massive amounts of medical data, to find new correlations and make predictions. This could make Watson the world’s best diagnostician. But to outperform human clinicians in diagnosis, its machine-learning algorithms need to be able to take into account heterogeneous sets of structured data (perhaps from wearable devices), unstructured data (doctors’ notes) and image data (scans). So far, machine-learning systems work best with a well-defined set of data and a single task, such as analysing radiological images, but they are poor at generalising to new contexts (in the way humans do easily). This is not just a modelling challenge — it is also about making data clean, simple and universal.

To use machine learning effectively, organisations must ensure that they have the information needed to identify relevant and reliable patterns. Many already have the raw data, but without the ability to govern the quality, structure and timeliness of these data their machine-learning systems will end up drawing the wrong conclusions. As machine learning grows in both sophistication and ubiquity, putting “garbage in” and getting “garbage out” will become increasingly dangerous.