Datasets, the gold of Machine Learning

Published in

Chatbots Developers

5 min readMar 2, 2016

In this article I’m going to explore Machine Learning, but if you’re completely new to this, I recommend you take a look at @ageitgey’s “Machine Learning is fun!”

In the connected era, data we can collect from users is internet gold.
Companies and advertisers are exchanging parts of our lives in the form of cookies, preferences, browsing habits and logs.

After the emergence of buzzwords like Big Data, Data Mining, Data Analytic the last few years, we now have Machine Learning and Deep Learning.

Here are two quick definitions, so we’re sure we speak about the same things:

Big Data has been defined by McKinsey Global Institute as

Data sets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.

Machine Learning, as defined by Tom M. Mitchell is

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

*Google Trend for Big Data and Machine Learning*

The relationship between Big Data and Machine Learning is simple, if you have data (be it a robot’s sensors outputs, users’ queries, photos, sounds…) you can learn to recognize patterns in it and thus predict or classify those patterns.

So, in the previous definition of Machine Learning, the experience E is in fact data.

Two types of data

There’s two types of data, unlabeled and labeled.

Unlabeled data (the kind we talk about with Big Data) is generally processed through Unsupervised Machine Learning (UML) to cluster it, or find patterns which will be helpful to further take advantage of this kind of information. In other words, UML uses unlabeled data to tell us where to look in it, and what can we get from it.

On the contrary, labeled data is.. well, labeled, meaning that we already know what we want from it, and how to use it. In this case, the Supervised Machine Learning (SML) will help us classify an unknown request we receive.

Apart from the type of dataset they require, Unsupervised and Supervised Machine Learning differ in the amount of information they need to perform well.
In fact, since UML does not have access to tagged data, it require huge amounts of information to spot and analyze patterns.
And since the SML has already labeled data, thus making it more valuable as a trusted source, it needs a lot less knowledge.

I’ll make a pause here, to properly illustrate my point. Imagine two people learning a new skill, like glass-blowing:
On one hand, the first decided to learn on his own, using his tools. He will spend a lot of time practicing, failing and retrying to be able to create a well-manufactured glass.
On the other hand, the second one is helped by a master glass-blower, who is teaching him all the proper ways to craft a glass.

It is quite obvious that learning from the master himself is faster than learning alone, because you need a lot more time to learn from your own mistakes.

A SML algorithm needs a trustful source of knowledge to create his model: the gold dataset. It is called “gold dataset” because all of the entries in it should be valid, and are used to architecture the representation of the data the algorithm has.

From unlabeled to labeled

You understood it, the better the gold dataset, the better the result.

The main problem is that the process of transforming raw data into labeled datasets is really time and resource-consuming, because you need to go through every line of your data and label it manually.
Thus, everyone is keeping his own dataset secret, in order to keep an advantage over its concurrent.

And it is a huge problem for developers who want to try and play with Machine Learning, because most of the datasets you will need are either expensive, or just inaccessible.

However, it is still possible to find specific purposed dataset, coming from Conferences and Shared Task such as CONLL or MUC. There is also community efforts to create formatted databases: DBpedia or Freebase.

Community is key

To us, the way we need to create the perfect gold dataset is to allow everyone to use it and build it. By bringing together developers, using different languages, and from different cultures, we can build a community willingly participating at the creation of tomorrow’s AI.

Wikipedia disrupted the sharing of knowledge by allowing people to add, update and consult knowledge for free. 42 (a French school teaching developers) decided to encourage its students to work in group via a method called “peer-learning”, completely at the opposite of the the current educational model, where sharing is cheating.

That’s what we offer, let’s all cheat to create the best conversational AI!

Paul RENVOISE – Recast.AI

This post was originally published at our blog

If you enjoyed this piece, you might also like: From Context to User Understanding

Datasets, the gold of Machine Learning

Two types of data

From unlabeled to labeled

Community is key

Written by SAP Conversational AI