The minimum viable data set

How to overcome the big data burden of machine learning

Tobias Bohnhoff
shipzero
8 min readMar 15, 2019

--

Very often in the context of AI, it is mentioned that enormous amounts of data are required in order to work with it in the first place. Very complex models have to be programmed and the success of a project is often associated with many unpredictabilities and risks. However, as a general rule, this is completely wrong. This article is all about giving you a perspective on how to handle situations of data scarcity and the possibilities to consider in this context.

Of course, there are complex projects that place extreme demands on the amount of data in order to achieve effective results, but usually this has to do with poor planning or a deliberately high willingness to experiment. There are also other ways. And the goal in AI development and technology deployment for companies of all sizes should be to keep the necessary data as small as possible and to build the algorithms as efficiently as possible.

AI is currently still easier to access for large companies than for medium-sized companies. This will change very quickly.

There are several reasons for this statement:

1) Price liberalization of computational power
2) Modular deployment of AI micro-services
3) Using transfer learning via cloud-based AI models as MLaaS (Machine Learning as-a-Service)

Let’s start with a relatively obvious trend in the price efficiency of computing power. No one today needs to run or purchase high-performance server farms unless there are very specific reasons for doing so. Cloud vendors offer the latest processor capacity in flexible and usage-based pricing models to run any application on them. Entry-level prices for training neural networks start at a couple of Cents per hour, and even high-performance chips such as Google’s TPUs or ASIC components, which cost usually between 5–10 Dollars per instance and hour, fall in price like all previous processor generations. Access is still hampered by a small financial barrier, but in the long run it will not lead to medium-sized companies being excluded from use.

Google’s Compute Engine Pricing (as of 10/2018)

The second challenge is the code, the algorithmic models that make up the AI. Developing neural networks from scratch requires a high level of expertise by developers. But Machine Learning Engineers and Architects in particular are highly sought after by large companies and are therefore very difficult to recruit as employees for medium-sized companies. The salary expectations are extremely high due to the acute shortage. However, modularized microservices, i.e. pre-defined code modules, which have already been developed for standard applications such as text recognition, image recognition and pattern recognition in different contexts, create a lot of potential for developers. The libraries of TensorFlow and other development frameworks are constantly growing and additional service providers and OpenSource platforms will participate in liberalizing AI code. This can be observed quite similarly to the development of the Internet.

Today, individuals can create websites, blogs, and shops without writing a single line of code. Similar structures will also be established for AI.

At first, it will be pushed into the market from the big cloud providers — later more specialized by smaller service providers will arise in the ecosystem.

The third challenge in addition to processing capacity and skilled personnel is data. Providing sufficient data is a major challenge. For an e-commerce retailer, for example, this means having a certain number of orders or actions on their website. Why are huge amounts of data needed? Obviously in order to be able to make valid statements from the analysis, but also because the models start from scratch. Similar to the human learning process, it takes years of data consumption, trial and error as a child to be able to cope confidently with a wide variety of situations later on. Artificial intelligence follows a similar path. As soon as robust algorithms are available that, for example, can recognize objects accurately and distinguish them from each other, there is no need to develop them anew for every task.

Transfer learning is the magic word here. However, the abilities of transfer learning are nowadays more or less sufficient to distinguish dogs from cats. The differentiation between a faultless and a broken brake disc in optical inspection, on the other hand, requires specially trained models for this application.

What kind of data do I need for using machine learning?

So how can I benefit from AI with a minimum of proprietary data? To answer this question, one should know the concept of different data sets for machine learning. There are three types of data sets for the training of algorithms:

Training data is used to train an algorithm. It is an initial set of data used to help a program learn and produce sophisticated results.

Validation data is a portion of the data used to assess how well the models fit, to adjust some models, and to select the best one.

Testing data is a portion of the data used to assess how well the final model might perform on additional data.

Every machine learning algorithm will only be as good as the underlying data that we are feeding it. Therefore, the importance of that data cannot be overestimated. Generally speaking, you’d have one data set and split it up into those three subsets:

Splitting your Data into Training, Validation and Testing Data

This 60/20/20 split is of course only a rule of thumb. It always varies depending on your use case, the number of variables and the size of your sample.

Training data basically consists of pairs of input and output. Having that in mind, different types of algorithms need the data to be structured in different ways. Examples for that are computer vision, where the training set consists of a large number of images, or sequential decision trees, where it would be alphanumerical data.

(Cross-) Validation data is used to ensure better accuracy and efficiency of the algorithm. It is well suited for tuning parameters and avoids ‘overfitting’ — which means training the algorithm too specifically on the training data.

Testing data evaluates the final model on how well it performs when confronted with previously unknown data input. It is used for comparing different models in order to derive which one to decide for. Validation data can’t be used for this step as it was part of the training process itself. As you can see it is crucial to test in order to derive reliable results and avoid misinterpretation due to incomplete or biased training data sets.

The importance of testing data

The concept of a minimum viable data set

Now, the key question arises, how can I minimize my data input, if I need to split up all of the data in different sets to train my data, and if it is so important to have a valid, non-biased sample? There are different approaches to minimize the required amount of data from millions of data points to significantly less:

1) Data pooling
Join forces with other data vendors, e.g. business partners, suppliers or non-competitive market participants. This requires a well-organized-process of standardization, ideally managed by an independent third-party.

2) Data enrichment
Enrich your existing data set by using public data sets or buying from dedicated data vendors — prerequisite is to make your data set more meaningful without losing the initial cognitive interest.

3) Knowledge transfer
Use pre-trained models or train your model with suitable but more generic sample data and refine it by using smaller samples of your proprietary data. Cloud vendors and specialized service providers will soongrow this segment very quickly in the future.

4) Iterative data generation
If you have rather small data sets, this not exludes you from starting to build machine learning models. Take the data you have or can easily extract from your systems to derive a very rough first idea, whether your idea of optimization is working and accept a high degree of uncertainty in the first place. Then start to build up your data resources over time by kicking off corresponding business intelligence processes and adjust your (in the beginning) very simple model iteratively. This way you can legitimate every financial decision regarding AI and BI projects and work very focused towards one goal instead of making huge efforts in either one of these fields without knowing exactly what to do with the end result.

Conclusion

So, what does all of that mean in practice? Whether or not your company already works with AI, there will be situations where data becomes a scarce resource. Here are some aspects to consider in such a scenario:

1) AI strongly depends on high-quality data, which is even more important than large volumes

2) A lack of data does not necessarily kill ideas or projects, there are several ways to deal with data shortage

3) Transfer learning will be a massive adoption driver for AI in the coming years — also for medium-sized companies

4) Cloud-based Machine-Learning-as-a-Service offerings will provide a suitable infrastructure for rather generic enterprise functions

5) Specialized service providers will fill the gap for niche but, high-value use cases such as optical quality control or a certain supply chain optimization problem

6) Iterative development on both the data side and the model side create higher uncertainty but is far better than not starting to develop your data infrastructure

The first step is defining what business problem you want to solve or evaluate what strategic potential can be tapped with systems based on machine learning. And as the quality of your data is critical, this should be the starting point of the efforts on an operational level: Getting your data as clean as possible for example in the form of integrated and (near) real-time data warehouse systems. If you don’t have that yet, there are some free example data sets to get an idea of how a nice and clean data basis could look like:

AWS Public Data sets
You’ll need an AWS (Amazon Web Services) account, but Amazon gives out a free access tier for new accounts that will enable you to explore the data without being charged.

Google Public Data sets
You’ll need a GCP (Google Cloud Platform) account here as well. The first TB of queries you make is free.

Kaggle
This amazing data science community regularly hosts machine learning competitions. You can get free data sets when entering competitions or out of contributions by the community.

Data.gov
You can browse data sets by various US government agencies directly on the website, without signing up.

Data.world
You can find numerous data sets tagged with the relevant keywords while navigating in a github-like environment. With a free account, you can easily start off and use up to 100MB per project.

Quandl
Among traditional financial data, you can find a vast amount of ‘alternative’ data on the platform — meaning: tapping into different pools of data (e.g. from satellites or IoT devices). Pricing depends on the data set and your intention of utilization.

After that — and most likely after choosing the tools to use out of the vast landscape of technology vendors — the process of data preparation can begin.

For more insights on how to kick-off your AI projects check our other blogposts or our website: appanion.com

--

--

Tobias Bohnhoff
shipzero

Founder at appanion.com. Technology enthusiast and passionate about trends and innovation in artificial intelligence.