Insights from our Small Data roundtable v.2

Mike Reiner
Oct 22, 2020 · 7 min read

This week, DataSeries, an OpenOcean led initiative hosted another Virtual Roundtable about “Small Data”.

INSIGHTS GATHERED:

CHALLENGES:

Small data is also about what you don’t know and whats in people’s heads
Typically when people think about small data they think about the lack of quality data and small data sets. A much more fundamental problem is to discover the right data to solve your particular problem. Often times companies try to analyse big data sets to find answers, where the real valuable data that needs to be uncovered is in our heads. Tacit knowledge is something humans can not easily articulate. The real question is how to capture cognitive data. There is a clear push design system to capture this data more effectively, but at the same time Gary Klein argues that collecting in-depth expert knowledge is crucial to understand decision-making. In order to do this well, interviewers should be well trained and follow a psychological framework.

Qualitative small data that can be uncovered via these interviews can significantly change the scope -and consequently, alter your machine learning approach.

A human feedback loop is crucial and is very dependent on the input/interface
Most resources when building AI systems are focused on the backend and process all the data. Surprisingly little effort is spent on the front end to gather the right human feedback to train the models argues Wilson Chan. In many cases, the human feedback loop is absolutely critical — especially in areas such as the medical sector for instance Dirk Hofmann says. This is not only about the fields of data input and the design, but also about right psychological triggers to maximise conversion of the data input. In this context, it’s also important that the response rates, in optimal, are in real time.

ML techniques typically still don’t function well horizontally
Current ML techniques are a bit of a hit or miss and can work well within a certain category whilst performing poor horizontally — especially in unsupervised learning Bradley Arsenault argues. It’s also hard to understand why certain techniques work in some use-cases and not in others. Clients who have massive amounts of raw data (preferred labeled) should harvest the benefits of unsupervised learning models, as they work well.

Critical to balance our machine learning approaches to be less CPU-power hungry.
The amount of power consumed for machine learning tasks is staggering. Even just until a few years ago we lacked computers powerful enough to run many algorithms. Repurposing GPU boosted the horsepower needed, but the exponential power requirements of machine learning are unsustainable. GPT-3 is a good example. Interesting approach, great for potential demos, but — in its current form and at scale -short-term potential only. Partly due to the massive resource requirements. Advances are likely to be made in both algorithms and architectures to balance this trend. Furthermore, this approach doesn’t scale to approach sophisticated ‘intelligence’ either.

Big data is not only unsustainable, but also further amplifies the power of large corporations
Marc Schoenauer is not at all excited about getting 100x more data. It is not sustainable from a resource perspective, it truly benefits primarily large corporations. Only very few companies are able to afford these massive datasets and there will only be a few of models that will be able to harvest these massive data sets -a sort of data-formality will be born.

Deep learning is a black box and its hard to make the decision making process transparent.
With a neural network of 5 layers we are talking about millions of parameters. In comparison, with logistic regression we may talk about hundreds of parameters. This makes it significantly harder to add transparency to neural nets and make them ‘explainable’.

OPPORTUNITIES:

Explainable AI is easier to achieve with a combination of various techniques and will be increasingly a requirement for industries
Combining different machine learning techniques and essentially building a layer of more simplistic models on top of deep learning approaches is one way to make AI more explainable at scale.

Transfer learning enables firms to work with much smaller data sets
Machine learning / deep learning algorithms have in the past been designed to work in isolation. These algorithms are trained to solve specific tasks. Often the models have to be rebuilt from scratch when the scope changes. Transfer learning is the idea of overcoming the isolated learning problem and using learnings from one task to solve related ones. This is a great paper on the topic.

One/N shot learning is an “extreme cousin” of transfer learning for even smaller data
One-shot learning is a variant of transfer learning, using the required output based on just one or a few training examples. Zero-shot learning is a more extreme version, which relies on no labeled examples to learn a task. This is quite amazing and becomes possible by essentially making smart adjustments during the training stage to leverage additional information to understand the unavailable data.

Danko Nikolic calls this the “extreme cousin” of transfer learning and argues that its absolutely crucial to put your own domain expertise into the model. Sometimes that's not possible — some of your expertise is subconscious with not enough data sets on how to teach an algorithm. The idea is to let the machine develop the domain expertise based on a big data set. This can be an add-on to existing deep learning models on ‘steroids’ — with access to the original data sets and improve your learning.

In a nutshell, one-shot learning is related to deep learning where deep learning tends to ignore edge cases. For increased adoption, it’s important to enable companies to implement this easily on top of already existing models.

Improving the human feedback loop: A ‘cold-start solution’?
There are tons of ways to improve the human feedback loop and, as discussed earlier, a lot of this relates to innovation on the interface side. One problem in this context are anomalies which by definition are hard to find. Wilson Chan explains that Permutable allows users to draw on interfaces (characteristics, similarities, etc). In turn, by taking these drawings and combining them with data augmentation, they can kickstart the labelling process. Permutable calls this a ‘cold-start solution’. Essentially you can get an estimated sample of similar classifications. This is where the feedback loop commences.

We need better ‘cognitive’ approaches/frameworks and capture expert knowledge
Advancements in cognitive frameworks and tutorials for an AI system might offer the best route to more resource efficient and better decision making. This is also partly a front end challenge in terms of how to capture the right data in the most effective way.

Related to this challenge is the evolution of expert systems. It basically records the situation and surroundings of how a human made a decision as well as the decision itself and stores it in a database.

This allows us to capture data that gets us closer to capturing ‘knowledge’.

Hire a senior and dedicated researcher to read through the latest ML research and test new models can make a significant difference
In this context, the skills of the person are crucial for success combined with a deep understanding of the problem that needs to be solved (assuming the problem is complex enough to require this research). When asked what model approaches are particularly exciting, Bradley Arsenault mentioned anti-loss models and filling in the blanks models. Essentially, you can reduce overfitting with small data sets by applying anti-loss models. You train a neural network, where you teach it to predict one variable as good as possible, whilst being really bad at predicting the other.

For example with gender discrimination — race identification is really good, but the gender identification is really bad. You can then swap the variables and cancel overfitting.

The concept of filling in the blanks models can be applied across an extremely large number of models and problems. For instance, you have datasets with 10 cases, but you only feed it 9 data points and let it figure out the last one. You do this with all the datasets on the same model (correlation method) and it sets up the internals of your neural network in a nice way.

Eventually, Bradley hopes, we will have pre-trained core networks that can take any input, format it and then have general purpose networks train it for specific problems.

Genetic machine learning might be the route to ‘intelligence’
Our genome has rules that allow us development, but also initiates the process of learning argues Danko Nikolic. That’s why as humans we are capable to acquire knowledge, learn, adapt, etc. If we want to reach a similar equivalent with technology, we need to create tech that has similar capabilities to our genome. We need to go much deeper into the “learning to learn, to learn methodology”.

Evolutionary algorithms as a highly flexible approach

An evolutionary algorithm (EA) uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection AS (read more here). To summarise, in an evolutionary algorithm, a fitter members will survive while unfit member will die and not contribute to further generations (much like in natural selection).

The importance for users and companies primarily comes from the flexibility. EA offers the ability to self-adapt the search for optimum solutions. For instance, EA approaches can take into account user preferences for ‘explaining’ the outcome of a certain model. The benefits of flexibility and adaptability make it easier for companies to start with. Hybrid-models offer the most potential in this context -for instance by combining deep learning together with evolutionary computation. You can split some features and this has successfully been achieved.

The staggering amount of ‘edge cases’ will accelerate the importance of flexible small data approaches such as evolutionary algorithms.

The vast amount of data sets out there are small data sets. There are so many edge cases in the world with limited data that we simply do not have a choice. Furthermore, machine learning approaches do not only have to work on small data sets, but also become more flexible. According to Marc Schoenauer evolutionary algorithms offer the best first foundation to figure out the next best algorithm. Evolutionary algorithms might be the second best algorithm for any solution.

DataSeries

Imagine the future of data