In-Depth Machine Learning for Teens: Training Faster And Better

12 min readAug 21, 2022

This article discusses various optimizations that you could’ve used in the last lab, and will be using in future labs. Specifically, you will be reading about how to speed up both the coding process and execution times with matrices, and how to make your model train faster.

Before we go further, though: it’s time to stop calling our inputs “parameters.” Us fancy data scientists like to call them “features,” to sound more professional. Just kidding, it does have a background. Wikipedia defines it as “an individual measurable property or characteristic of a phenomenon,” which sums it up pretty well.

Note From Author

As of now, this article is part of a series with 5 others. It is recommended that you read them in order, but feel free to skip any if you already know the material. Keep in mind that material from the previous articles may (and most probably will) be referenced at multiple points in this article — especially gradient descent, the keystone of all machine learning.

Survey

If you wouldn’t mind, please fill out this short survey before reading this article! It is optionally anonymous and would help me a lot, especially when improving the quality of these articles.

Training Faster And Better Survey

Matrices

In the last article, you implemented a linear regression model on your own, with some assistance. However, you probably used a lot of loops and hand-wrote a lot of code that seemed annoying and excessive. Well, there’s a way around that — let me introduce you to… matrices!

A matrix is essentially a rectangular array of values, of any size. It’s an extension of a vector, in a way — instead of going just vertically, now you can also go horizontally.

You can use this concept to work with the data all at once, instead of having to break it up into rows and iterating through them.

Matrices come with a few key concepts up its sleeve, which are detailed below.

Basics

An element aᵢ,ⱼ refers to the element in a matrix at the i-th row and the j-th column. The i and j are called indices (index in singular form).

When doing this or any future lab, keep in mind that python starts indices with 0. Trust me, one-off errors are REALLY annoying.

When referring to the “dimensions” of a matrix, an x by y matrix means that the matrix has x rows and y columns.

Elementary Operations

You can add two matrices together if they have the same dimensions. This is done by each element’s indices. Commutativity is preserved for this operation (as a reminder, this property means that the result is the same if it’s done backward — for example, 2 + 3 equals 3 + 2 equals 5).

You can also multiply matrices by a constant, which is equivalent to multiplying all the elements of the matrix by that constant. As a consequence, you can have a negative of a matrix, which is equivalent to multiplying all the elements of a matrix by -1. Commutativity is also preserved for this operation.

You can also subtract matrices, which is equivalent to adding the first matrix to the negative of the second matrix.

Identity Matrix

This is a matrix that’s filled with zeros, except for the top-left to bottom-right diagonal, which are filled with ones. An identity matrix is always square, meaning it has the same number of rows as columns.

The dot product (keep reading, introduced in a future subsection) between a square matrix and an identity matrix equals the original matrix. In a way, this is the “1” for matrices!

Transposition

Transposing a matrix is sort of like “flipping” it by a diagonal starting from the top left. Formally, all the elements in a matrix aᵢ,ⱼ correspond to an element bⱼ,ᵢ in the transposed matrix.

Element-Wise Multiplication

The name explains it all — you take two matrices of the same shape, and you multiply element-to-element. Formally, each element aᵢ,ⱼ in the first matrix and bᵢ,ⱼ in the second matrix has an element cᵢ,ⱼ = aᵢ,ⱼ ⋅ bᵢ,ⱼ corresponding to them in the resulting matrix.

The operation is denoted with a hollow circle.

Commutativity is preserved for this operation.

Dot Product

This one requires a little bit more visualization. Essentially, you take the matrix on the left, and take the first row. You flip it vertical, and multiply element-wise to the first column of the matrix to the right. Then, you add up that column. After that, you once again multiply that row to the next column, and add it up. You keep going until you reach the end of the second matrix. Then, you repeat the process with the next row from the first matrix.

The operation is denoted with a full circle.

Something very important to note is that commutativity does NOT hold with dot products. This means that for two matrices A and B, AB ≠ BA. In fact, you might not be able to perform one of these dot products in the first place, because of how dimensions come into play.

When performing the dot product on two matrices, one with dimensions a×b and the other with dimensions c×d, then b and c must necessarily be equal for this dot product to be legal. In addition, the output matrix must be of the dimensions a×d.

When multiplying matrices and no explicit symbol is given (or just the typical multiplication dot [⋅]), always assume that it’s a dot product unless stated otherwise.

Inverse

You cannot divide by a matrix. Instead, you have to multiply by its inverse.

To grasp this concept, think of regular multiplication and division. If you were to divide 6 by 2, you would get 3. Instead, you can multiply 6 with the inverse of 2, which is half, and still get 3. In this case, an inverse is defined where the number times its inverse is equal to 1.

With matrices, it’s very similar — but this time, the inverse is defined where the dot product of a square matrix and its inverse equal an identity matrix of the same dimensions. The inverse of a matrix is not very straightforward to compute. Luckily, we won’t have to do it manually either way — we have a computer to do it for us!

Here is an example matrix and its inverse multiplied together. Note that it has been broken down into multiple steps for readability.

When performing dot products with a matrix and its inverse, commutativity is preserved. Of course, it will result in an identity matrix either way, so it doesn’t really matter.

If you’re still curious about how to calculate inverses, check out this link from Math is Fun.

Left and Right Inverse

This extends the concept of inverses for non-square matrices, and is only applicable to the situation if the terms “left” or “right” is directly and specifically mentioned.

If you have two non-square matrices A and B, and you find the dot product AB and get an identity matrix, then A is a left inverse of B. You can remember this easily because A is on the left side of B, thus A is the left inverse of B. Conversely, B is the right inverse of A.

Needless to say, commutativity is NOT preserved when performing dot products with these inverses and the original matrix.

We will work with matrices further in our lab, during which we will be performing linear regression on a different dataset, in a much more efficient manner.

Data Normalization/Regularization

Data normalization comes in many flavors. Essentially, it’s a way of processing the raw data inputs to remove outliers and skews (both within a feature and across features).

When training machine learning models, you should have data roughly resembling a normal distribution, or a bell curve.

Your feature data should have a shape roughly resembling a bell curve.

In addition, the distribution range for each feature should be about the same, as otherwise it typically leads to slow convergence. This is due to one axis’s gradient being significantly more “flat” than another’s, which leads to slow descent in that direction.

Before getting into popular methods to preprocess data, I want to point out the importance of remembering what transformations were made to the original features. This is because any new data needing predictions need to go through the same transformations before being plugged into the trained model, and appropriate transformations also need to be made to the model’s output to obtain the actual prediction. Essentially, the type and format of the features need to stay consistent when working with the trained model.

Scaling

The name explains exactly what’s done to the data — you take the minimum and the maximum, and translate and linearly stretch it to cover another min-max range.

Clipping

Clipping data essentially means removing it from the dataset. This is usually done to reduce the effect of outliers. The easiest way to do this is to visualize a feature as a bar graph, with the y-axis as the frequency of occurrence. Formally, this is known as a histogram. Then, simply clip off the areas with outliers.

Log Scaling

When visualizing the dataset, if the bar graph method detailed above results in a skewed dataset, then you should apply a log to even it out until it resembles a bell curve. Note that the pre-logging features should be skewed towards zero, so make sure to scale and transform features before that as necessary. Another important thing to keep in mind is that just before log scaling, all of your data points should be positive, as the log of a negative number is undefined (technically it can be defined via the use of imaginary numbers, but we’ll keep it real for now).

Here’s what a log transformation can do to a dataset. Click to enlarge if necessary.

Z-Score

This is simply a standard to follow after you finish applying other transformations to your data, or even in general. Specifically, z-scale scales the feature distribution to a mean of 0 and a standard deviation of 1. Conceptually, this puts all the features on a similar “scale” and helps the algorithm relate them better.

We will be using Z-Score in our lab on all the features to normalize the data.

Other Operations

Sometimes, these operations by themselves are not enough. For these cases, you can use exponentiation or box-cox techniques. However, one thing to keep in mind is to keep all the values positive. This is because data points on either side of 0 might map to the same value when exponentiated. For example, 2⁴ will map to 16, and (-2)⁴ will also map to 16. Of course, with experience, you’ll be free to lift this restriction (as long you’re aware of what you’re doing).

In the end, you want your data to somewhat resemble a bell curve. It doesn’t have to be perfect, and you shouldn’t try to overdo it by compounding technique after technique. With some experience, you’ll be able to tell “not enough” from “close enough”.

For more information, such as specific formulas of how these work, feel free to check out this article from Google, or even just search the internet.

Boolean Features

Oftentimes, boolean features are generated from existing features to model a non-linear relationship. An example of this can be demonstrated in the previous lab. In the dataset, if we had the species of the fish as a string input, then we could convert those into boolean attributes where, say, if the fish was a Perch, then the “perch” feature would be 1, but the “bream” feature would be 0 (along with the other fish species features).

Bucketing/Binning

This builds upon the concept above, but this time with quantitative data instead of something like a string. Essentially, it does the same thing, except it uses ranges (something with a lower and higher limit) to generate those features. The two most common ways to “bucket” data are evenly spaced and by quantile. Evenly spaced creates bins that cover equal-sized ranges, whereas quantile creates bins that have about the same number of data points in each bin.

We will be using this concept in our lab to better deal with data that shouldn’t exactly have a linear relationship, but are still important. Bucketing essentially splits the nonlinear relationship into smaller segments that are each modeled linearly. For this reason, I like to think of bucketing as forming multiple models packaged to generalize in one.

If you want to explore this topic further and in more detail, feel free to check out this article from Google.

Correlation Coefficients

The correlation coefficient measures the level of linear correlation between two variables (in our case, features). It ranges from -1 to 1, where a -1 means a negative correlation and a 1 means a positive correlation. For linear regression, we want values close to 1 or -1, as they show a linear correlation. We also want to take a closer look at features that have a correlation closer to 0 to our output, to see if there’s a better equation to fit the relationship (ex. a quadratic).

We will be using this to judge and alter features and models in our lab, but we don’t have to calculate it because we’ll have a library do it for us. Yes — we’re going to use a separate library! This is because the formula itself isn’t super-duper complicated implementation-wise and it’s conceptually pretty simple.

Here’s the formula to calculate it:

Once again, you won’t need this when doing the lab, but it’s import to know what the value represents so you can analyze and alter features as necessary.

Hands-On Lab

Unlike the last lab, this one doesn’t have any checks, and it’s more “freeform” implementation-wise. Much like bicycling, we have to wane off the training wheels at some point 😉.

A quick side note: just because part of the code is written for you and ready to run doesn’t mean that you don’t have to look at the code. A big part of learning is seeing, then doing.

As we slowly progress, remember that consulting internet resources such as documentation or Stack Overflow is something that every programmer can and will do (even the proficient ones)! They’re like a guide showing you the way to the light, so don’t be afraid to take a look at them.

Training Faster And Better Lab

colab.research.google.com

Parting Notes

Having an understanding of matrices and how to work with them is crucial not only to all the future labs that we will do, but to grasp and work with more advanced machine learning concepts on the mathematical level. Of course, this doesn’t mean that you have to do all this math every time — that’s why there are libraries like pytorch, scikit learn, opencv, tensorflow, and many more. But, it’s important to gain an understanding of what’s going on under the hood before you try to attach wings to your car.

Also, another quick note about extracting workable features from your input data — while there are plenty of other ways and methods to pre-process data, the methods discussed above are the most widely used, and generally, serve their purpose well. However, if you still want to explore other ways to extract features and work with them efficiently, then the internet is your best friend!

Done reading? Ready to learn more? Check out my other articles in this series!
Logistic Regression
Neural Networks

Or, feel free to refer back to any previous articles in this series:
Gradient Descent
Linear Regression

External resources used:
https://people.math.carleton.ca/~kcheung/math/notes/MATH1107/wk06/06_left_and_right_inverses.html
https://en.wikipedia.org/wiki/Feature_(machine_learning)
https://developers.google.com/machine-learning/data-prep/transform/normalization
https://developers.google.com/machine-learning/data-prep/transform/bucketing