1. Scale drives machine learning progress
When we say “scale”, it generally means two different things:
1.1 Computational scale
Also known as the amount of available computation available for use. Generally, the more compute is available, the more easily machine learning tasks can be solved. It has been just recently that people started training neural networks large enough to take advantage of all the data we have in the world.
There are also interesting counterexamples, such as Starship technologies, that are using low-compute platforms to do things generally not solvable by current algorithms. This has the potential to be a strong R&D moat (sustainable competitive advantage) but may slow down development speed initially.
1.2 Data availability
With the information age, the capacity to generate data on- and offline has grown exponentially. This is partially why data-hungry machine learning algorithms started out-performing more classical AI approaches. In particular, large neural networks have so-far shown no limit to increase performance when given more and more training data.
In a typical commercial setting nowadays the constraint for training data-hungry algorithms isn’t the amount of data available, it is the amount of labeled data available. There are generally two ways of obtaining labeled data — human annotation or automatically generated. Sometimes annotation can be done as part of the normal workings of company processes but in the far more common case, it will have to be meticulously hand-labeled by an additional team as a form of highly repetitive work.
2. Optimization of algorithms
In mathematical terms, optimization is a selection process where the best element is chosen (according to some criterion) from some set of alternatives. Finding that criterion remains difficult even for machine learning experts because in the real world there are many considerations to make. Multi-objective optimization signals that some information is missing and when two objectives conflict a tradeoff must be made somehow.
For the sake of simplicity, let’s consider two algorithms with the following properties for speech recognition:
Accuracy, in this case, can stand for word error rate and delay the amount of time the user needs to wait before the output can be shown.
Somehow designers of the system need to know which of these algorithms is better which version should be shipped to the customers. For combining these metrics, we have three ways to do it.
2.1 Optimize and satisfy
If the use-case for our algorithm is to do real-time speech recognition, we might say that anything less than “real-time” is considered unacceptable. Longest acceptable real-time delay might be 500ms. Meaning that we can rule out Algorithm A.
Evaluation metric = Accuracy * (Delay < 0.5s)
The question naturally arises — Does an algorithm exist that has higher accuracy and delay?
2.2 Single-number evaluation metric
Even though satisfaction is easier problem statement, oftentimes it is worth mapping out an evaluation metric that doesn’t impose hard limits. For example, if we had an 85% accuracy algorithm that took 600ms for computation, we would automatically disregard it if we used the previous evaluation metric.
In this case, I combined delay and accuracy in followingly:
Evaluation metric = Accuracy - (Delay / 100)
2.3 Person makes the call
This is usually the base case that gets used until the solution becomes important enough to require constant development by a team, not just a single person. Usually this person is the engineer solving the problem, product owner or technical lead.
Get rid of this base case if the problem you are solving meets one or more of these:
- Multiple people involved
- Multiple approaches will be tried
- A single person is a bottleneck on moving forward
Machine learning development oftentimes looks like regular software development — data scientists in a small team might spend time building front-end, back-ends and managing infrastructure bigger part of their time than training algorithms. However, they often think very differently from software engineers. First question data scientist, when confronted with a new problem, is:
“Do you already have the data, or can we obtain it easily?”
There is a very good reason for that — after years of working with humans, I find it rare that they accurately and consistently know what is actually the problem without looking at data. This is why data scientists always start solving the problem by looking at data. However, data comes in many forms and levels of usefulness.
3.1 Specific examples and metrics-level data
A machine learning engineer trying to automatically flag abusive content on a social media platform will first try to understand the big picture — what are relative weights of different problems contained in this larger problem. After spending a little bit of time exploring the data they come up with a metrics-level understanding of the data:
Previous to doing this analysis, the engineer probably received a problem statement that attributed poor weights to the relative importance of solving nudity and hate speech, but looking at the metrics-level description, it is apparent that 87% of the problem is solved by tackling only these two categories.
Looking at it closer, he will want to solve the cases, for this, he will need to look at specific examples of data — specific examples. This generates ideas that allow building better algorithms. This completes a loop from metrics to specific examples to solutions.
3.2 Data is the specification
Data is the specification is what popped out when “start with data”, “minimum viable product” and “iterative development” concepts came together.
From all of our collected experiences, we have extracted three core principles:
1. Instead of coming up with a good general solution, it is better to focus on solving specific cases of the problem.
2. Instead of trying to solve all problematic cases right away, it is better to address a proportion of the easiest cases and repeat the process multiple times.
3. Instead of writing a good specification of a solution, it is better to curate a collection of good problem cases. This collection of problem cases then essentially becomes your specification.
This way of developing intelligent systems is so effective that everyone who has experienced it takes it as the new norm. Margus Niitsoo shared his experience:
After a year of working on my own “intelligent” project in a completely different domain, I can agree to this approach whole-heartedly.
One point to note though is that this is not “Let’s find more hard cases and feed them to the black box machine learning algorithm”, which while tempting usually comes short.
Instead, it is more like “Lets find hard cases for the current state, understand why they are hard, fix them, iterate”.
I especially like the note about how this approach finds bugs that are otherwise near-impossible to even notice, as the exact same thing happened multiple times to me too, and that is probably the main thing that finally made me a convert.
3.3 Statistical unit testing
In traditional software development, testing plays a critical role in producing stable software at scale. One of the more popular paradigms are unit tests:
Unit tests are typically automated tests written and run by software developers to ensure that a section of an application (known as the “unit”) meets its design and behaves as intended.
Applying a similarly low-level test suite for algorithms can pose difficulties because specific case-studies might get shuffled around significantly without a large effect on the final algorithm performance. In order to still set up tripwires around algorithm releases, we have started using statistical unit tests — an algorithm is run against a small set of salient cases (typically 100) and the results are compared against the latest benchmark. If the results get significantly worse, the developer fixes underlying bugs or has to manually force the release.
3.4 Statistical integration testing
Taking the unit testing framework further, an integration testing framework can be designed which mimics the full production flow as closely as possible.
This is especially useful for catching problems between algorithms where the precision-recall curve changes significantly. Often downstream services have made assumptions on the properties of results on specific recall-precision points.
As with the statistical unit testing, we have a set of cases that will get simulated as if they were production data. The final results will get compared to known ground-truth. If the baseline metrics get significantly worse — automatic deploy procedure will get stopped.