The Elegance of Deep Learning Lies in Its Empirics, Not in Its Lines of Code

Freedom Preetham
Autonomous Agents
Published in
3 min readJul 25, 2024

The core Transformer model comprises just 25 lines of code. However, the number of lines of code is not an indicator of the significance of deep learning models, nor is it a useful measure of their complexity and utility.

Anyone experienced in deep learning training understands that the number of lines of code for any layer is typically between 2–3 lines. Within 15–30 lines, you can write the entire Transformer code, and another 50 lines can cover multi-head attention with a basis function that can be parallelized. Despite this, the simplicity in code length is not useful in appreciating the model’s potential.

The trick in deep learning lies in running these lines of code through an iterator (for-loop) to create n-layer deep models (where n is a parameter, potentially up to 10,000). Even so, this is quite basic and not particularly meaningful in isolation.

Experienced coders have known this for decades. The lines of code can be made 1000 layers deep (by setting the layer depth to 1000) and run over virtually unlimited tokens in the trillions (dependent on the parameter size of the network, determined by the layers and available hardware), parallelized over unlimited GPUs (thanks to multi-head attention).

Also, it is 2024. Software code is heavily abstracted out. One should not be surprised by lesser lines of code :)

Here is a powerful transformer I wrote to make the point clear.

While AI researchers are not surprised by this, the general audience often finds it fascinating to count the ‘lines of code’ of the core module instead of understanding how the architecture unravels itself through iteration and the scaffolding required to fully harness its potential. The core lines of code are always small, regardless of whether it’s VQ-VAE, Diffusion, GAN, GNN, etc.

Aspects of Engineering That Matters

Deep learning models is measured by how effectively you can engineer the following aspects:

  1. Converge Errors Empirically through Hyperparameter Tuning: The black art of adjusting parameters to minimize errors is crucial for model performance.
  2. Build Scaffolding Around the Model: Techniques such as Reinforcement Learning from Human Feedback (RLHF) guide the model towards stronger inductive biases, enhancing learning efficiency and accuracy.
  3. Reduce the Cost per Million Tokens During Inference: Optimizing models to reduce computational costs and improve inference efficiency is essential for practical applications.
  4. Model Distillation for Faster Training in Subsequent Epochs: Pruning, quantization and mathematical modeling are used to streamline the model, making future training faster and more efficient.

Consider this: If I provided the same lines of code to two different engineering groups with unlimited data and hardware, one well-versed in the dark art of model training and the other simply marveling at the code, wondering about the hidden elegance and magic within the lines of code, which group do you think would actually succeed in training the model to convergence?

The Intricacies of Deep Learning

Empirical conditioning is where most of the work in deep learning goes. Once all of this is done, the model is distilled back for efficient training and inference, which is what the open-source code often shares at the end, reduced to a small number of lines.

The elegance lies in its empirical conditioning, not in the lines of code.

The nuance and elegance of numerical conditioning and stability engineering are lost if you merely admire the lines of code.

Deep learning is an empirical science. I hope this helps shift your perspective to what truly deserves admiration in deep learning.

--

--