Data essentials for effective Machine Learning

Published in

Applied Artificial Intelligence

8 min readJan 30, 2019

Amsterdam AI — Panel discussion about Data Access, Data Quality and Data Cleanup

Whichever industry we are observing right now, we will most certainly come across a story about how “data” is changing the face of the world. We have already created more data in the past couple of years than in the entire history of humankind. Predicted for 2020, every person in the world will be creating 7MBs of data every second. Researchers say the adoption of big data technologies is unlikely to slow anytime soon. IDC predicts that the big data business and business analytics market will increase from $130.1 billion this year to more than $203 billion in 2020. This article summarizes a panel discussion from Amsterdam AI in 2018 and uses the expert opinion of panelists to understand the essentials of big data for both corporates and start-ups that work within this sphere.

Data is a collection of facts (numbers, words, measurements, observations etc.) that have been translated into a form that computers can process. Big Data refers to vast and voluminous data sets that may be structured or unstructured. These data sets are so voluminous that traditional data processing software just can’t manage them. The continuous development in the digital era shows the true importance of big data and how it can be utilized to reach a new business problem you wouldn’t have been able to tackle before (read Oracle’s definition of Big Data if you want more info).

At Amsterdam AI in 2018, we’ve had the chance to listen to expert opinions regarding data access, data quality, and in between some tips regarding data clean-up.

1. Data Access

Whilst Data access can be as easily defined as to a user’s ability to access or retrieve data stored within a database or other repository, in reality — getting the right data with the right access, is much more complex. In machine learning, an algorithm is needed to give an AI program, neural network, or another machine to help it learn on its own. These algorithms require vast datasets to boost their learning algorithms and eventually produce desired results, like finding trends, patterns, and predictions. Complex analytical tasks can therefore, be solved faster with the help of artificial intelligence and machine learning. You might be asking yourself, “Why hasn’t big data, AI, ML etc. not appeared sooner?” — Well, there wasn’t any real life or real-time data as there is today, and the data sets were simply too small. With the advent of mobile and IoT technology, people produce more and more data (through geolocation, social apps, voice command etc.). It has become a no-brainer for firms to participate in the IoT race, however, it’s a marathon, not a sprint, where one technology is complimenting the other.

When building algorithms, Geert Vos, CTO from Media Distillery advises us to “take data that matters to your business and train it. There is a lot of access to a lot of data, but it needs to be labeled and shorted, which is very time-consuming. So, start with synthetic data to make sure that your pipeline works, and the algorithm is trained for once it meets real data.” The better the quality of synthetic data is and closer to real data, the better the algorithm will perform once real data is inserted. Its main purpose is to be supple and efficient enough to help a Data Scientist to conduct experiments with various classification, regression, and clustering algorithms etc. For start-ups especially, Synthetic Data can slow you, as it is very difficult to create synthetic data that looks like real data.

When working with large quantities of data, volatility is present and therefore the data needs inspection and evaluation of data points. Edwin Poot, Founder & Chief of EnergyWorx states:” This process is necessary to scout mistakes and gaps within your data. EnergyWorx built an automated tool that solved this problem by indicating KPIs that evaluate it accurately.” According to Gartner’s research from 2018 — Organizations believe poor data quality to be responsible for an average of $15 million per year in losses. “In order to have a machine learning model, firms need to have qualitative historical data, which they usually don’t have, as they didn’t pay attention to saving, labeling and classifying its data sets. It is also important to build an infrastructure that let’s you figure out the flaws of your data, and once you have found it, you can either fix it or you go to the source and tell them to fix it themselves”, says Edwin. Joergen Sandig, Co-Founder of Scoutely says: “Always ask yourself: Who owns the data? Who is responsible for the data? and states: The reason many data models don’t work is simply due to poor data quality”.

2. Data Quality

Data is bias — it can get very expensive to label your data, but there are augmentation tricks to produce large sets of data (i.e. duplicating and altering your current data). To increase your data quality, build algorithms that train your data set and then widen the training process onwards.

Here is a hands-on strategy on how to improve your data quality:

1. Determine what you want from your data and pay attention on how to evaluate quality
2. Create, maintain and use a data dictionary
3. Take snapshots of your ‘static’ data (show changes to data or emergent data quality issues)
4. Aim for objectivity
5. Consider removal of computed (or derived) data
6. Be vigilant about missing data
7. Perform regular reviews of your data to uncover anomalies
8. Take advantage of technology

Below is the data quality sophistication curve, which explains it in a more visual context.

(graph taken from here)

It can be challenging to find the right data for your training algorithm. Once you’ve analyzed your datasets, you need to define and then redefine a problem within your data set. Digging deep into the root of your data will assist you in checking its quality and then enable you to build up from there. Joergen Sandig’s advice here is to: “start simple, for example with linear regression or even sorting data sets to one variable can be key, as it educates you about your data. If you start complex, you can get lost and oversee many problems that your data has. Every single outcome must be discussed. Before labeling data, you need to define a common understanding of your data first, especially understanding the interpretation of the data and its source.”

[Example] An interesting case is one of the clients from the U.K., who have over £80m worth of assets, but they are struggling on finding them (related to water meters under the ground). Evaluating the data given by the client, EnergyWorx has managed to find the lost assets of the firm, simply by studying the water pressure indicators (one variable) in the pipes.

[TIP] When working with clients that present you their data sets which are labeled as “good data” and have thrown away the “bad data”, various problems can arise. It is crucial to keep every type of data, even if you might think that it’s negative. Often the negatives or as you might call it “bad data” can be the missing piece to building an algorithm quicker. If the negatives are not given, you then must generate it again…which is very time consuming and most importantly, expensive.

[Example] All of your AI expertise is often directed towards first fixing data, which usually takes up to 95% of your work. The other only 5%, is the time that you build the actual AI algorithm. Joergen Sandig calls this Data Hell, whereas Edwin Poot calls it Data-Exploring.

So should corporates or start-ups acquire more data? Is more…better? What should you pay attention to?

Geert Vos’ response is: “If you know your type of data, then more is of course better. High-quality data with larger sets will give you better results. Many founders, for example, don’t really know what kind of data they need, which creates Data Lakes and can create chaotic scenarios if not handled properly”.

Joergen Sanding says: “A Data Lake is, in my opinion, a lack of business intelligence. Having the mentality to collect everything and then figure out in the future what to do with the data, can be dangerous for your business. You should first define what you need, and then let the Data Lake fill according to your need. The right data is MINIMAL, where only 1%-5% of the whole data that you have will eventually be useful to you”.

Edwin Poot states a different opinion. He says: “You need a Data Lake purely for data collection purposes, therefore having a large set of data is not necessarily bad, but it needs to be labeled and classified. This can eventually save you a lot of time in the future. A search engine for your data set can and should be build, which will assist you to export the exact data that you need”.

When talking about buying external data, then you should be careful, because a model is driven by specific data. You should use as much metadata as possible (set of data that describes and gives information about other data), but first, make sure to discover exactly which type of data you need. Make sure to validate how important this information is to you, before buying external data.

Edwin gives us another tip regarding open data: “Open data, metadata etc. is important, especially when the information of the data is available. If you know how to correlate this data and make context, you can easily create value. This can be an ideal situation if you’re trying to find business contracts that require bidding processes”.

When you have imperfectly labeled data — try to train the experts that handle your data. These specialists need continuous improvement on how to make your data more reliable. Create a model on top of your model to detect errors, then work on this and try to improve your imperfectly labeled data as much as possible. “The best tool for labeling imperfectly labeled data, is to create the tool yourself, as it will allow you to understand your datasets much more efficiently,”says Joergen.

All in all, it is highly important to focus on understanding your type of data needed and then working yourself up, step by step, to make sure that your big data sets have a purpose for when being used in the future.

Words of wisdom from our panelists are the perfect conclusion for the understanding of big data essentials. Always keep these in mind:

Data Hygiene
Don’t start a project, if you don’t know what you want to achieve
Always experiment with real data

Thanks for reading. Below you can find the info about Amsterdam AIs panelists. If you would like to watch the whole 1hr video, click here.

About the panel discussion and its members:
Panelist — Geert Vos, CTO of Media Destilery
Panelist — Edwin Poot, Founder & Chief of EnergyWorx
Panelist — Joergen Sandig, Co-Founder of Scoutely / Member of ScaleUp Nation
Host — Mike Reiner, Venture Partner at OpenOcean VC Fund / Co-Founder of City AI

Thanks to Maxim Matias who helped drafting this post for DataSeries

Data essentials for effective Machine Learning

1. Data Access

2. Data Quality

Written by Mike Reiner