Neural networks fudging the numbers

Why do we still need data scientists?

The whole purpose of supervised neural networks is to find the best parameters of a certain model that can successfully map some inputs to some outputs. Although neural networks can be impressive, even to the limit of being magical, there is really no magic behind the learning mechanism itself. It usually involves specifying a loss function that acts as a punishment metric and an optimizer that tries its best to reduce the losses. In a previous blog post, we explained all these concepts in details, the general mechanics of neural networks and the details of each of its specific gear.

Let’s say we are trying to predict the quantity of rain falling outside by checking the power of reception of a satellite signal. As we can expect, we usually have the strongest satellite signal on non-rainy days, and the worst signal under heavy rainy days. The goal of the neural network is to find the transfer function that maps satellite signal to the rain quantity.

Rain rate vs satellite attenuation (source here)

Let’s say the first random initialization of the neural network created the model rain=2. Meaning no matter what the satellite reception power is, we’re predicting that it’s raining 2 mm/h outside. How good is this model compared to let’s say (rain=5) or (rain=30) models?

Naturally, by looking at the current curve, we can say (rain=30) is the worst model, since in our dataset the rain never reached the 30 level! while rain=2 is somehow correct at many times! In a generic way, how to measure how bad a model is ?

Loss functions are metrics that measure the performance of a neural network on a certain dataset. They grade how badly the current state of the neural network performed in the learning task.

The most famous loss function is the Sum of squared errors. It sums up the squares of errors (which are the difference between the predictions of the neural network and the actual outputs of the dataset). The larger the error is, the larger the losses are (in a square way! a difference of 10 unit produces a loss of 100 units).

If the neural network predicted correctly all the outputs, the squared error(prediction -actual) is zero. There are no losses at all and we can replace the whole dataset by this neural network. Basically we have found a way to compress the dataset in a lossless way (that’s why it’s called a loss function!)

Ok, now our first model has a total losses of 1000, can we find a better model with a loss 800 or 400 or 100 or 0? (Be aware that 0 loss might not be ideal due to over-fitting problems!)

Gradient descent techniques such as sgd, rmsprop, adagrad help the neural network to navigate around the losses. Usually they give the neural network directions to follow that reduce the losses. Think of optimizers as financial advisors. They look at your expenses and tell you: here you can do better, here you can spend less, here you can save more, here you can cut your losses!

Optimizers do the job of reducing the losses, by following the gradient of the loss function. Mathematically, this involves calculating the derivative of the loss function, and following the most negative slope. Since the goal is to minimize the losses, following the most negative derivative is like following the steepest way down a hill. We can expect to reach the bottom the fastest!

Ok well, so far machine learning looks easy. We just need to define a loss function, and optimizer et voilà! What can go wrong? Why do we even need to hire data scientists?

The moment you choose to manage by a metric, you invite your managers to cheat and manipulate it.

Imagine a company rating its employees by the time they spend at the office. When an employee is aware of this, she can cheat in many ways. She can come at the office to badge early, do nothing the whole day, leave late and win the best employee of the month award! When scientists are rated by the number of papers they publish, they have a huge incentive to publish as many papers to as many conferences as possible, while repeating the same ideas. Not to mention the countless ways student cheat on exams just to get higher grades. “The Honest Truth About Dishonesty: How We Lie to Everyone — Especially Ourselves” is an excellent book about the cheating topic.

Let’s play the number game!

How is this relevant to our neural network example? If we consider for example that our database is skewed: 99% of the time it is not raining, a neural network can lazily converge to output rain=0 all the time. By doing so, the neural network saves 99% of the losses! It’s like when a financial company escapes from taxes 99% to only pays small fines 1% of the time. Here are some other funny examples of machine learning lazily converging to surprising solutions.

We can argue that a metric can work better if the subjects being measured are not aware that they are being measured or how. But the optimizer works exactly by following the gradient of the loss function, we can’t optimize if we don’t define a loss function! What’s the solution?

(Photo credits here)

Only by deeply understanding this number game, the mechanics of deep learning and the mathematics behind it, that a data scientist can escape from the pitfalls of the optimizer playing it! It’s a detective job!

When things are going well, we don’t need a detective. But when a model fails to converge, or converges to a lazy solution, only someone with deep understanding and knowledge can solve the deep learning problem.

Here is a non-exhaustive list of things we can do in such a cases:

  • Spend more time with the domain experts to understand better the problem. Can we extract better and more relevant features?
  • Check the dataset, can we get more data? can we reduce the bias? can we reduce the number of features? Are the features independent? Can we try to filter, transform, pre-process, or reduce the skewness or the bias.
  • Check if the model is not too simple or too complex. Try different neural network models and architectures. Trying to learn a quadratic or highly non-linear function with a simple feed-forward layer(y=ax+b) and a linear activation is clearly an under-fit!
  • Tweak the loss function itself. When a metric is not working, we can try another one! For instance, we can add a “99x” multiplier to the sum of squares loss. This can discourage the optimizer to converge to the lazy solution y=0. We can customize the loss function to fit our need (according to the problem)! We’re not bounded only to the existing standard stock of loss functions (Sum of squares and log loss)!
  • Try different optimizers, different learning rate, different regularization rate.
  • Improve our own knowledge. Check for instance this interesting post: “deep misconceptions about deep learning”
  • As a last resort, we can try meta-learning. Meta-learning is the technique of exploring several neural network architectures with different configuration each. In this sense, it can be seen as the brute force of machine learning. The downside is that it is obviously very costly in term of computational power!