Making small data sets more useful

Given our focus on computer vision, we’ve spent a lot of time looking at the deep learning ecosystem as a whole too. Within AI, deep learning has received a lot of attention in recent months, and for good reason — its success across a large variety of problems is impossible to ignore. Most of the promises of deep learning hinge on the use of convolutional neural networks and unsupervised or semi-supervised feature learning, which require more data than typical machine learning algorithms with hand-tuned feature extraction.

Deep learning is very well-suited to a specific class of problems. However, we’ve seen a lot of folks use deep learning in contexts where it may not be the best fit. We believe there’s also a large opportunity to create value for end users by applying new, less-supervised techniques to smaller data sets. These could be data sets generated by single individuals, like health sensor readings, email correspondence, or location data. Time series data is often a good example of such a smaller data set, because the rate of new data collection is constrained by time.

With new academic work being done, there may be better ways to apply machine learning in highly data-constrained situations. There are a few directions that researchers are pursuing that look promising. One angle has been to use probabilistic methods to increase neural network accuracy on small data sets, such as this paper defining a technique using posterior probabilities. Another technique, called Bayesian program learning (BPL), was recently used to recognize and generate handwritten characters based on a single example. The BPL system researchers built was able to pass a turing test of sorts, where the handwriting samples it generated from one example could not be separated from human-generated samples. Full paper can be found here (secondary).

Research in this area is still very early, and the greater use cases for BPL are still unproven, but there are early indications that it could have more general applications. From the work done so far, it’s clear there are vision-related use cases where BPL could learn from far less data than deep learning. BPL could also be used to generate more examples (like the above) to feed into a CNN training process.

Value provided on top of a small data set also creates an opportunity to aggregate a larger data set across all users of a company’s product, where the initial incentive for users to provide data might not be present otherwise. A good example of this is how Mint or Personal Capital are useful for just a single user, providing an aggregate view of your personal finances, but have built up a much larger multi-user financial data set that allows them to provide more powerful recommendations. Small data AI could be a new way to provide that initial user value, paving the way to aggregate features like benchmarking or more traditional deep learning on the larger future data set.

More data will almost always lead to better results, but we believe there are many potential use cases for intelligent systems where data is scarce. If you or anyone you know uses machine learning on smaller data sets, we would love to chat!