Structuring Machine Learning Projects (Part 1)

Shikhar Ghimire
May 10 · 16 min read

Machine learning projects like everything in life needs a concrete plan. Just by feeding machine learning algorithms in the data won’t give you the optimal results. So here’s the couple of blueprints in the bag that you can utilise.

Credit :

Why strategise Machine Learning?

Let’s start with example. Let’s say you are working on cat classifier and after giving some time training the data, your accuracy came out to be 90%. It is not good enough. Normally, what we do after it is as follows:

  • Collect more data
  • Collect more diverse training set
  • Train algorithm longer with gradient descent
  • Try Adam instead of gradient descent
  • Try bigger networks
  • Try Dropout
  • Add L2 regularisation
  • Change the Network architecture
  • Change the activation function

When trying to improve deep learning learning system we often have a lot of ideas that we can try. The problem is that, if we chose poorly, it is possible that we spend 6 months on wrong direction, only to realise after 6 months that it wasn’t the right way.

Setting up the Goal

1). Use single number evaluation metric

Machine learning is a very empirical process. We often have idea and we code it up and experiment to see how it did and then we use the outcome of the experiment to refine new ideas and keep going around the loops to improve our algorithm.

Whenever you are prototyping deep learning architecture, you will find that the progress will be much faster if you have single real number evaluation metric which lets you quickly know if the new things you tried(optimisation, regularisation etc…) are giving you more accuracy or not.

Let’s take an example:

We have two classifiers:

  • Classifier A
  • Classifier B

Let’s assume that first we trained classifier A but we didn’t like the results of it so we tuned in the hyper parameters and resulted in classifier B. To evaluate the performance of classifiers is to look at the precision and recall.

  • Precision — For all the examples that classifier recognised as cats, what percentage actually are cats?
  • Recall — Of all the images that really are cats, what percentage were correctly recognised as cats by the classifier?

Using single real number evaluation metrics will give us the freedom to choose which classifier is better and which ideas, hyper parameters tuning is going to give us the best results.

The problems however of using the recall and precision is that if classifier ‘A’ does better in recall and classifier ‘B’ does better in precision then you are not sure which classifier is better. To overcome this challenge you can use another evaluation metric alongside it called F1 score. In machine learning literature F1 score is the standard way to combine precision and recall.

Formulae of F1 score

F1 score evaluation of test set will let you clearly indicate whether to choose classifier A or classifier B.

Let’s take another example

Let’s say you are building a cat app for cat lovers in four major geographies and let’s say your two classifiers achieve different errors in data from these two different geographies.

But tracking four different errors is very difficult to decide whether algorithm A or algorithm B is better. And if you are testing a lot of different algorithms (like below), the difficulty level gets even more complicated.

To overcome this, we can add another column called ‘Average errors’ and calculate the average error from all the different algorithms.

Courtesy to Andrew Ng

Once you calculate the average errors from all different algorithms, pick the one with the lowest average error rate.

2). Satisficing and optimising metrics

It’s not always easy to combine all the things you care about into single real numbers evaluation metric. In those cases, it is useful to setup satisficing and optimising metrics.

Let’s take an example:

Let’s say you decided that you care about the classification accuracy of the ‘cat classifier’ (Assume above accuracy is F1 score). Alongside accuracy, you also care about the running time(how long it takes to classify the image) in milliseconds. In those case, you can choose classifier that maximises accuracy but subject to that their running time has to be less than or equal to 100 milliseconds.

  • In this case we would call accuracy as an optimising metric and running time as satisficing metric

In this case, B is the best classifier with 92% accuracy and running time of 95ms

3). Train/dev/test distributions

The way you setup training, development and test sets can have a huge impact on how rapidly you and your team can make progress building machine learning algorithms.

Workflow in machine learning is that you try out different models in training set and use dev set to evaluate the models and pick one and keep iterating to improve dev set performance until you finally have one classifier that you are happy with and evaluate it again using test sets.

Setting up dev and test sets

Let’s say you are building a cat classifier and you are operating in these regions.

  • US
  • UK
  • Other Europe
  • South America
  • India
  • China
  • Other Asian Countries
  • Australia

Let’s say your dev test comes from first four countries and test set comes from last four countries. This would create a problem as your dev sets and test sets do not represent all the datas that you want your model to classify. So, having a dev and test set distributions that represents all the datas from all the different countries is better for evaluation. For example : You can shuffle all the data before you split into train, dev and test sets.

  • Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.

4). Size of Dev and Test Sets

You might have heard or practiced rule of thumb in machine learning of splitting data into 70% training and 30% testing data or you if you had to setup train and test set and dev set, you might have practiced 60% train, 20% dev and 20% test sets. In early era of machine learning, these practices were pretty reasonable, specially back when data sizes were smaller. In current era of machine learning where data are abundance, we are used to working with much larger dataset.

In much larger datasets scenario, let’s say 1 million, we now work with:

  • 98% training sets and 1% dev and 1% test sets

Rule of thumbs for test sets sets:

  • Set your test set to be big enough to give high confidence in the overall performance of your system.

The purpose of the test sets is that after you finish developing the system, test sets will help evaluate how good your system is. The guideline is to set your test sets big enough to give high confidence in the overall performance of the system across multiple models you build

In some application however, you don’t need test sets if dev sets includes datas that covers all the datas from all the number of classes in the training sets.

It is however not recommended to not have the test sets alongside the dev sets, as test sets ensures that your model is working as intended before you ship the application.

When to change Dev/Test sets and metrics

Setting up a dev and test sets as an evaluation metric is like placing a target for your ml team to aim at but sometimes halfway through the projects, you might realise that the target was set in the wrong direction. In that case, you should move your target.

Let’s take an example:

Let’s say you build a cat classifier to try to find lots of picture of cats to show to your cats loving users and the metric you decided to use is classification error

Classification Error

Algorithm A: 3% Error

Algorithm B : 5% Error

By the looks of it, algorithm A is doing much better but if we look into detail, let’s say algorithm A is letting through a-lot of pornographic images. In contrast Algorithm B misclassifies fewer images but it doesn’t have pornographic images. So from the user acceptance point of view, algorithm B is much better algorithm.

So what happened here is that Algorithm A is doing better in evaluation metric(3% error) but it is the worst algorithm compared to algorithm B. So in this case the evaluation metric plus the dev sets prefers algorithm A but the users prefers algorithm B because it is not letting out pornographic images.

When this happens, and your evaluation metric is no longer correctly rank ordering preferences between algorithm(in this case is miss-predicting algorithm A is better than algorithm B), you need to change the evaluation metrics or perhaps your dev sets or test sets.

Let’s look at one more example:

Let’s say that two cat classifiers got an error of:

Algorithm A : 3% error

Algorithm B : 5% error

when evaluating on dev and test sets which you downloaded from the internet. High quality images.

But, when you deploy the product, you find ‘Algorithm B’ actually is performing better even-though it was doing poorly on the dev sets. And you found out that you are training on very high quality images but when you deploy it in mobile apps, users are using all sorts of images such as blurred, or with funny facial expressions and therefore your dev and test sets isn’t hitting the right target your application is intended for.


  • If doing well on your metric +dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.

Comparing to human-level performance

Why human-level performance?

In the last few years, a-lot of machine learning people have been talking about comparing the machine learning systems to human level performance. Why? There are two main reasons.

  • Because of advancement in deep learning, machine learning and deep learning algorithms are working much better to actually become competitive with human level performance.
  • The workflow of machine learning and deep learning gets much more efficient when you try to do something that humans can also do.

Let’s take an example:

Credit : ThinkingonData

As you work on problem over time, progress tends to be relatively quick as you approach human level performance. After a while when algorithm surpasses human level performance , the accuracy slows down. Overtime as you keep training the algorithm, maybe on bigger and bigger models and more and more data, the performance approaches but never surpasses some theoretical limit, which is called the Bayes Optimal error.

Think of Bayes Optimal as the best possible error, in that there is no way for any function mapping from X to Y to surpass certain level of accuracy.

Progress is often quite fast until you surpass human level performance. There are quite a few reason why it does that. Human level performance for many task is not that far from Bayes Optimal Error. There are two reasons for it:

  • Humans are very good at pattern recognition and human performance is not that far from Bayes Optimal error.
  • As long as the machine learning model performance is below human level performance, there are certain tools you can tweak to scale the performance.

Incase the algorithm is worse than human level performance you can :

  • Get labeled data from humans
  • Gain insight from manual error analysis.(We will talk about this below). Why did a person get this right?
  • Better analysis of bias/variance

Summary : Knowing how good humans can do on a task, can help you understand better on how much you should try to reduce bias and variance.

Avoidable bias

We talked about how you want your learning algorithm to do well on training set. But, sometimes you don’t want to do too well. Knowing what human level performance is can tell you exactly how well but not too well you want your algorithm to do on a training set.

Let’s take an example

Training error →8 %

Dev error →10%

Human level error → 1%

In this case if the learning algorithm achieves 8% training error and 10% dev error then maybe you want to do better on training set. The fact that there is a huge gap between how well your algorithm does on the training set vs how well humans do shows that your algorithm isn’t even fitting the training set well. In this case focus on reducing the bias. You want to do things like finding a bigger neural network or run gradient descent longer. Try to do better on the training set.

Let’s take another example:

Training error →8 %

Dev error →10%

Human level error → 7.5%

In this case, even-though the training error and dev error are the same as the earlier example, maybe you are doing just fine on the training set. However, in this case, you should focus on reducing the gap between training error and dev error. Meaning you should focus on reducing the variance. There isn’t much head room to improve on training error and human level error so we shouldn’t focus on closing the gap as there can be a risk of overfitting, so it’s much better to focus on closing the gap of training error and dev error.

Understanding human level performance

The term human level performance is sometimes used casually in many research paper and articles. But what is it exactly? How do we define it?

Let’s look at an example:

Medical image classification

Credit: Chicago Tribune

Let’s say that you want to look at a radiology example like this and make diagnosis classification example.


Typical Humans → 3% Error

Typical Doctor →1% Error

Experienced Doctor → 0.7% Error

Team of experienced doctors → 0.5% Error

In this case, how should we define what ‘human level’ error is?

We know that human level error is also a proxy of Bayes Error. If we want Bayes error in this example, we should define Bayes error as the error given by team of experienced doctors.

For purpose of deployment, it is better if we can surpass the error of team of experienced doctors. In some cases, if an algorithm can surpass typical single doctor level error, it is good enough to deploy.

The gap between human level error and training error is called avoidable bias and the difference between training error and dev error is called variance.

Improving model performance

Getting the supervised learning to do well is to assuming that you can do two things well. Those are:

  • You can fit the training set pretty well
  • The training set performance generalizes pretty well to the dev/test set

Reducing (avoidable) bias and variance ( closing the gap between human level and training error)

  • Train bigger model
  • Train longer/better optimisation algorithms(RMSprop, Adam etc)
  • Neural Network architecture/hyperparameters search

Reducing Variance (closing the gap between training error and dev error)

  • More data
  • Regularisation (L2, dropout, data augmentation)

Error Analysis

Carrying out error analysis

If you are trying to get learning algorithm to do task that humans can do and if your algorithms is not at the level of humans, manually examining the mistakes the algorithm is making can give you insight on what to do next. This process is called error analysis.

Let’s say you are working on cat classifiers and you have achieved 90% accuracy and 10% error on the dev sets. Let’s say this is much worse than you are hoping to do.

Credit : Pinterest

The cat classifier sometimes is miscategorising some dogs as cats. The question is, should you go ahead and start a project focused on dog problem? There could be several months of work that you can do to make fewer mistakes on dog pictures? So, is that worth the effort?

Well, rather than spending a few months doing this, only to risk finding out at the end that it wasn’t that helpful, here’s an analysis procedure that can quickly let you tell whether or not this could be worth your effort.

Here’s what you can use according to Andrew Ng in this case:

  • Get 100 mislabeled dev set examples
  • Examine them manually (Count up how many are dogs)

Suppose that it turns out that 5% of 100 mislabeled dev sets examples are picture of dogs so that is 5/100 of these mislabeled examples are dogs. If only 5% of errors are dog pictures, then the best you can reasonably hope to do is if you spend a-lot of time in dog problem is that your error might go down from 10% error to 9.5% error.

Let’s take another example:

Suppose out of 100 mislabeled dev sets examples, you find 50% of them are actually dog pictures. Now you can be much more optimistic about spending time on the dog problem. In this case, if you solve the dog problem, the error will go from 10% to 5% and you can decide that halving the error could be worth the time to actually solve it.

Build First System Quickly, Then Iterate

If you are working on a brand new machine learning applications, according to Andrew Ng, you need to build system quickly and then iterate.

Let’s take example of speech recognition:

If you are thinking of building a new speech recognition system, there are alot of things that needs to be considered to make speech recognition system more robust. Things such as noisy background, car noise, accented speech, young children speech, stuttering, far from microphone and many other factors that needs to be considered. Like any other machine learning projects, there are 50 other different direction that you can go in and each of these directions can be reasonable and make your system better. The challenges however is which of those 50 directions will you take first to ensure you are in the right direction?

The recommendation according to Andrew Ng is to build your first system quickly and iterate. What that means is that you quickly set up dev/test set and metric so to decide where to place your target. Then, build initial system quickly(Find a training set, train it and see how it is doing against dev and test set). When you build your initial system, you then will be able to use bias/variance analysis and error analysis to prioritise next steps. If error analysis causes you to realise that a-lot of errors are from let’s say speaker being very far from the microphone which causes speech recognition then that will give you the good reason to address this problem. Also, look for any academic research papers that tackles almost the same problem you are working on. That will give you a head start on which problem curves you should avoid.

Mismatched training and dev/test data

Training and testing on different distributions

Deep learning algorithms has huge hunger for training data. This sometimes leads to getting whatever data one needs and just feed it in the training set just to get more training data even if a-lot of those datas come from the same distribution as dev and test data. So a-lot of deep learning team are training on the data that comes from different distributions than the dev and test sets. There are some best practices in dealing with when your train and test distributions differ from each other.

Let’s take an example:

Let’s say you are building a mobile app where users will upload pictures taken from cellphones and you want your app to recognise whether the pictures uploaded by users are cat are not. So now you can get two sources of data. One which is the distribution of data you care about which is the data from the mobile app which tends to be less professionally shot. The other source of data you can get is, you can crawl the web and just download very professionally taken pictures of cats.

Courtesy : Andrew Ng

Let’s say you don’t have lot of users yet for your mobile app so maybe you have gotten 10,000 pictures uploaded from mobile app but crawling the web you can download huge amount of cat pictures so let’s say you got 200,000 pictures of cats downloaded off the internet. What we really care about is that your final system does well or not in the mobile app distributions of images. Because at the end the users will upload pictures like the one in the right and you need your classifier to do well on those set of images. But, now you have a bit of dilemma because you relatively have small amount of datasets(10K) examples from mobile app distributions and you have much bigger dataset drawn from different distributions(webpages). So, you don’t want to use just those 10,000 images for training set as it ends up giving you a relatively small training sets and using these 200,000 images seems helpful but the dilemma is that those distributions isn’t from exactly the distributions that you want.

So what can be done?

Here are some options.

First options :

One thing you can do is put both of these datasets together. You now have 210,000 images(200,000 from webpages and 10,000 from mobile) and you can then take those 210,000 images and randomly shuffle them into a train/dev and test sets. Let’s also say for the sake of argument that you have decided that your dev and test sets will be 2,500 examples each and 205,000 for training examples.

Setting up the data this way has some advantages but also disadvantages. The advantages is that your training, dev and test sets comes from the same distributions but the disadvantage is that if you look at your dev set, of these 2500 examples, lot of it will come from the webpage distribution and not from the mobile app distributions which you actually care about.

Remember that setting up the dev sets is telling any machine learning team where to aim at. Your dev set is your team’s target. Therefore, make sure your dev sets represents the data that you actually want your apps to work on later on.

Second options

This way of splitting data into train/dev and test is that you are now aiming the target where you want it to be. You are telling your team that your dev set has data uploaded from the mobile app and that is the distribution of the images that you care about and let’s build a machine learning system that does really well on mobile app distribution images.

That’s it for now. For the next part, we will talk about how we can use transfer learning to leverage the power of an already trained Neural Network to train on a different task that it has never been trained on. See you then.


Everything connected with Tech & Code

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store