Machine learning demystified: the importance of data
Machine learning (ML) may sound like a daunting concept to anyone unfamiliar with it, some may believe it to lead to outlandish ideas about machines poised to enslave mankind. Fortunately this isn’t what ML is, it’s basically a major advancement in the development of Information Technology (IT). For ML to benefit an organisation it first has to understand the full benefit and limitations it offers.
While the principles of ML are rather simple and intuitive to grasp, it does require the use of specific statistical and IT skills that few people currently possess. To understand the idea think of a common and rather mundane language translation service — like Google Translate — this helped me realise the transformative potential of ML.
To simplify it, language translation software has long been based on programming dictionaries, grammatical rule and their numerous exceptions. This approach involves considerable effort.
From ‘rule-based’ to ‘data-driven’ processes
The new methodology stemmed from a simpler idea: don’t try to define rule and lexical tables from scratch, let the software discover them. How?
In three steps:
A collection of millions of pages, already translated from one language to another, are collected from international organisations. These include documentation available online from, for example, the UN or European institutions.
When a user submits text for translation, the software slices it into basic elements and then searches for similar ones in the same language.
The most likely translation is the extracted from the bilingual corpus which is suggested to the user. Relevant statistical patterns found in the data, therefore, replace translation rules. Instead of having to be painstakingly programmed, they are simply “learned” by the software. This approach is highly cost efficient and the quality of the translation is often on par with a traditional approach.
In areas less complex than translating human languages, the productivity gains are compounded by substantial quality improvement. Anyone who’s worked on software knows how complex it can be to anticipate all the potential problems once it’s entered production.
The software’s functional rules are based on assumptions that are limited to a linear number of observations. Reality often proves to be far more complex than expected, meaning automation is eventually suboptimal or the software ends up requiring expensive corrections.
Machine learning on the other hand absorbs and develops itself using all available data, regardless of the volume. This means the risk of patterns or a use case being left out of the picture is therefore limited.
Posted on 7wData.be.