25 proverbs in AI language

Training of a 1000 epochs begins with a single gradient descent

Shu Ishida
Tech to Inspire
Published in
7 min readDec 21, 2020


It is fascinating how so many proverbs have endured the test of time and are still used in literature and daily conversations. The beauty of proverbs is that so many people can relate to them. This can be observed both from the abundance of synonymous proverbs and the number of proverbs that have spread cross-culture and cross-language.

As a research student who spends time with neural networks, I thought it would be fun to rephrase some of these well-known proverbs using AI terminology, and see how well they preserve the meaning of the original. The hope is that this will make neural network jargon more relatable and approachable. The meaning of the proverb is written beneath each translation, so whether you are knowledgeable in the field, you can try to guess the original proverbs from the description.

All the below translations are my own work. Feel free to use these quotes in your daily life (at your own personal risk of scaring away your friends), but please reference/link this article if you want to use any of these in written form.

Disclaimer: some of the quotes below are not technically-speaking always true, but that goes the same with proverbs, so please take it easy. I’d be opened to any suggestions for improvements:)

Photo by Jonathan J. Castellon on Unsplash

1. Training of a 1000 epochs begins with a single gradient descent

Even the longest and most difficult ventures have a starting point; something which begins with one first step. —[see answer]

An epoch refers to one cycle through the full training dataset. Usually each epoch is further broken down into several mini-batches. A neural network is trained by applying gradient descent to its parameters for every mini-batch.

2. All that has a derivative of zero is not a global optimum

Not everything that looks precious or true (or optimal) turns out to be so. — [see answer]

The goal of machine learning is to find a set of parameters which optimises an objective function. At a global optimum (the best possible solution), the derivative of the objective function becomes zero, but that is also true for local minima, maxima and saddle points.

3. Set a stupid objective function, get a stupid prediction

If one asks a strange or nonsensical question, the listener will probably respond with a similarly strange or nonsensical answer. — [see answer]

The objective function should define the problem an AI should solve; otherwise the prediction that the AI makes will be meaningless.

4. Bad gradients propagate fast

Bad news circulates quickly because people often spread it everywhere. — [see answer

Out-of-distribution training samples, often known as outliers, will most likely result in large losses and gradients. Regularisation techniques and dropout may be good measures to counter overfitting to these samples.

5. Interpretability lies in the eyes of the researcher

Different people have different views on what is beautiful (or interpretable). — [see answer]

It is often not clear how a neural network is making a prediction just by inspecting its parameters and intermediate outputs. Many methods have been developed to make it more interpretable, but it remains an actively investigated topic in AI research.

6. Convergence of loss comes to those who wait

A patient person will be satisfied in due time; patience is a virtue. — [see answer]

Big neural networks can take a very long time to converge, but when it does, it often outperforms smaller ones.

7. A watched plot never improves

A process appears to go more slowly if one waits for it rather than engaging in other activities. — [see answer]

Here, a plot means a graph that shows the loss over training.

8. Data leakage leads to overfitting

One does not profit by cheating. — [see answer]

In machine learning, it is common practice to split data into training, validation and test datasets, and only use the training dataset for training. One could cheat by leaking information from the validation and test datasets into the training dataset, but then the model will most likely fail in real-world settings because it has overfitted to the dataset.

9. Don’t delete your checkpoints

Don’t do something which forces you to continue with a particular course of action, and make it impossible for you to return to an earlier situation. — [see answer]

Checkpoints are network parameters that are saved during training after every couple of epochs.

10. There’s no point crying over killed processes

To worry about unfortunate events which have already happened and which cannot be changed. — [see answer]

Always save your intermediate results to avoid having to start from scratch if your process dies mid-training:(

11. Don’t change model architectures in mid-training

To change one’s plan or approach when an effort is already underway or at another inopportune time. — [see answer]

Usually, once you define your model, you won’t be able to change it mid-training if you don’t want to start from scratch again. (Instead, you can turn the training on and off for sub-components of your model.)

12. Don’t judge a network by the number of parameters

One shouldn’t prejudge the worth or value of something by its outward appearance alone. — [see answer]

Unfortunately, size does matter for neural networks, but that’s not all. A cleverly designed architecture with shared parameters and additional constraints can go a long way, and also makes the network more robust and generalisable.

13. Don’t put all your weights on one feature

To make everything dependent on only one thing; to place all one’s resources in one place, account, etc. — [see answer]

Typically you would have many channels in a neural network layer, each channel computing some kind of feature. A robust network will not rely on one channel or feature alone, but makes its decision based on a combination of different features. Dropout is one strategy to ensure such robustness.

14. Don’t filter out the signal together with noise

To discard, especially inadvertently, something valuable while in the process of removing or rejecting something unwanted. — [see answer]

15. Don’t try to segment before you can classify

You must master a basic skill before you are able to learn more complex things. — [see answer]

Image classification is simpler and is a necessary step to image segmentation, which is a problem of classifying every pixel in an image instead of just classifying the entire image.

16. Reinitialise pre-trained embeddings

To spoil one’s plans or hope of success. — [see answer]

Pre-trained embeddings are learnt features that can help accelerate downstream tasks.

17. Returns are maximised by agents who self-play

You cannot depend solely on divine help, but must work yourself to get what you want. — [see answer]

A return is a discounted sum of future rewards. In reinforcement learning, the objective is to find a strategy to maximise the expected return. In the famous example of AlphaZero, an AI learnt to play Go, chess and Shogi at a super-human level, just by playing against itself.

18. It’s all neural network parameters to me

A way of saying that something is difficult to understand. — [see answer]

19. No model can be optimised for two objective functions

You cannot work for two different people, organisations, or purposes in good faith, because you will end up favoring one over the other. — [see answer]

An objective function defines the problem the AI must solve. If you give it two objectives that contradict each other, the AI can’t give an optimal solution to both simultaneously.

20. The gradient is always steeper on the other side

People always think they would be happier in a different set of circumstances. — [see answer]

Steep gradients are good because that means you have more scope to improve your parameters and reach a better solution (but not too steep).

21. Adversary and loss make a network wise

We gain wisdom faster in difficult times than in prosperous times. — [see answer]

Defining a loss as an objective to minimise is a typical way to train a neural network. A family of networks called Generative Adversarial Networks also employs adversity to train itself. Typical applications are for realistic image generation.

22. GPT-3 wasn’t trained in a day

It takes a lot of time to achieve something important. — [see answer]

GPT-3 is a massive state-of-the-art network that can perform a variety of language-related tasks.

23. Multi-head attention is better than single-head

It is better to have the power of two people’s minds to solve a problem or come up with an idea than just one person on their own. — [see answer]

Multi-head attention is a mechanism used in Transformers, a neural network architecture that has been shown to be highly effective at capturing the complexity of language. GPT-3 uses Transformers as its building block.

24. Where there’s a gradient, there’s a loss

Every rumor has some foundation; when things appear suspicious, something is wrong. — [see answer]

25. Don’t put a fully-connected layer before a convolutional layer

Do not do things in the wrong order. — [see answer]

Convolutional layers typically appear in computer vision tasks. It is very likely that a fully-connected layer comes after a convolutional layer, not before (although nothing is impossible).

That’s it! I hope you’ve enjoyed it. Here are the original proverbs:

  1. A journey of a thousand miles begins with a single step
  2. All that glitters is not gold
  3. Ask a stupid question, get a stupid answer
  4. Bad news travels fast
  5. Beauty is in the eye of the beholder
  6. Good things come to those who wait
  7. A watched pot never boils
  8. Cheats never prosper
  9. Don’t burn your bridges behind you
  10. There’s no point crying over spilt milk
  11. Don’t change horses in midstream
  12. Don’t judge a book by its cover
  13. Don’t put all your eggs in one basket
  14. Don’t throw the baby out with the bathwater
  15. We must learn to walk before we can run
  16. Cook someone’s goose
  17. God helps those who help themselves
  18. It’s all Greek to me
  19. No man can serve two masters
  20. The grass is always greener on the other side
  21. Adversity and loss make a man wise
  22. Rome wasn’t built in a day
  23. Two heads are better than one
  24. Where there’s smoke, there’s fire
  25. Don’t put the cart before the horse

If you liked this article, please also have a look at this article, where I draw analogies between our lives and how AI overcomes difficulties.



Shu Ishida
Tech to Inspire

DPhil student at University of Oxford, researching in computer vision and deep learning. Enjoys programming, listening to podcasts, and watching musicals.