Learning with Small Data: Part 1

Juan Mancilla Caceres
aiincube-engineering
5 min readDec 1, 2020

In this series of posts, I want to address one topic which I believe is not covered as much as it should: what to do when the amount of training data for your model is small. We live in the era of Google, Cloud Computing, and Deep Learning, all of which give the impression that data and computing power is readily available for anyone and for any application you can imagine. Nevertheless, a common problem for startups (specially innovative ones) is that they find that their particular problem or application does not have large amounts of data available, be that because it exists but it is too expensive, it exists but not in digitalized form, or it simply doesn’t exist.

This is a problem known as the Cold Start Problem, and it is quite common for many real-world applications (beyond the classical blog examples that use common and overused datasets like MNIST). For example, we originally faced this problem back in 2010 when Parknav was starting and we realized, to our surprise, that major US cities did not have organized information neither on parking restrictions nor parking statistics. This means that you will likely have to collect the data yourself or pay for it, which in turn means that you may want to start to extract value from it as soon as possible (even when its amount does not require the latest GPU technology and thousands of layers on a deep neural network).

For the next four posts, starting with the current, I will go through some of the techniques that we have found to be useful when dealing with small amounts of data that you may want to try. The topics will be:

1. Getting the Data: Sampling and Confidence Intervals

2. Choosing your Model: Generative vs. Discriminative Models

3. Improving your results: Using priors, MLE vs. MAP estimators

4. Evaluating your model: Cross-fold validation and One-vs-All

1. Getting the Data: Sampling and Confidence Intervals

If you are starting a project were data is not readily available in large quantities, chances are that you will need to gather it yourself. This means that you need to make a choice about how you will sample your data, and how much of the data to sample.

Regarding the type of sampling, you need to have some understanding of the nature of your data. You want to have an educated guess about the distribution of the data, whether it is expected to be normally distributed, uniformly distributed, and so on. With the distribution in mind, you need to be careful not to obtain a biased sample but one that is as representative of the actual distribution as possible.

For example, imagine you want to model something that happens every day of the week, something like ordering take out dinner. In this case, you want to sample orders equally distributed around every day of the week to not overfit your model for a particular day. In contrast, if your model is trying to predict something that happens more often during the weekends, say going to the movies, you want to sample movie attendance with more emphasis on weekends.

Regarding the amount of data, there are several techniques that you may employ but I suggest computing the confidence interval (CI) of the data, while assuming a particular distribution with different amount of samples. In that way, you will be able to guarantee to your clients the interval in which the actual value of your experiment lies, based on your estimation. It is always a good idea to compute this interval, but it is of upmost importance when you are dealing with small amount of data as you want to be certain yourself that the results are within an acceptable margin of error.

As an example, imagine that you want to test whether a coin is fair or not (i.e., whether the probability of getting heads is 0.5). For this example, we would compute the confidence interval using the following expression (assuming a binomial distribution):

Confidence Interval for a Binomial Distribution

Where p is the estimated probability of getting heads, z is a known coefficient depending on the confidence level (1.96 for the commonly used 95% confidence level), and n the number of samples. Assume we get the following results when throwing the coin 3, 5, 10, or 25 times:

╔═════════════╦══════════╦═══════════════╗
║ #throws (n) ║ #heads ║ Estimator p ║
╠═════════════╬══════════╬═══════════════╣
║ 3 ║ 1 ║ 1/3 = 0.33 ║
║ 5 ║ 2 ║ 2/5 = 0.4 ║
║ 10 ║ 3 ║ 3/10 = 0.3 ║
║ 25 ║ 7 ║ 7/25 ║
╚═════════════╩══════════╩═══════════════╝

Therefore, for the examples above we would have the following CIs:

╔═════════════╦══════════════════════════╗
║ #throws (n) ║ CI ║
╠═════════════╬══════════════════════════╣
║ 3 ║ 0.33 +/- 0.53 [0,0.86] ║
║ 5 ║ 0.4 +/- 0.43 [0,0.83] ║
║ 10 ║ 0.3 +/- 0.28 [0.02,0.58] ║
║ 25 ║ 0.28 +/- 0.17 [0.1,0.45] ║
╚═════════════╩══════════════════════════╝

You can see that only after 25 samples we can say with 95% confidence that 0.5 is not within the confidence interval and therefore conclude that the coin is not fair. So, if your goal is to determine whether this coin is fair or not, you should plan for at least 25 samples. Notice that this number depends on the actual parameter of the coin: this example simulated a coin with a 0.3 probability of getting heads. If the coin had a parameter of 0.45 then we would need at least 150 throws to get a confidence interval that does not include 0.5. This is sometimes referred to as the effect size of an experiment where, in this case, the effect is the difference between the expected parameter (0.5) and the actual one, and the number of samples depends on the effect we want to measure.

Conclusion

Long story short, if you expect to deal with small amounts of data, make sure that you have an idea of the actual distribution and try to sample according to it. Also, based on your expectations (i.e., prior knowledge of the problem), you can estimate the confidence interval of your results and choose the number of samples accordingly.

Next time we will discuss how to choose your model given your small sample of data based on the strengths and assumptions of such models.

--

--