Understanding the Role of Randomness in Machine Learning

Kush Ojha
GDSC VIT Vellore
Published in
9 min readMay 21, 2021

Imagine you are new to Machine Learning and you have made your first ML model using Linear Regression. You check the R2 score and get disappointed with the results. Yet you You decided to re-train the model thinking that it’s impossible to achieve such a bad R2 Score with the recently trained model and see your R2 score jump as much as 5–10%. What did just happen? If you are like me, you would keep on running the code until you stumble upon the max R2 score and then show off your shiny new model to your peers. What you just experienced is one of many ways where randomness affects how your models function.

Let’s delve a little deeper and understand how we can use randomness to our advantage.

Randomness in Computers

Computers are not able to generate truly random numbers as it is fundamentally impossible to produce truly random numbers on any deterministic device. So how do we encounter randomness in our models and how are we able to get random numbers by calling the random module? The answer to this question is given by the terms Noise and seeds.

Noise

Noise is any unwanted interference or corruption that distorts the original signal. To put it simply, they are the outcome of various Electromagnetic and quantum processes which are out of bounds for this blog!

Try clicking a picture in a dark environment and you might notice some grey blobs scattered around the picture. That is nothing but noise! They are much more prevalent in low light photos as the minuscule fluctuations are a considerable change to #000000(Black) but not to some higher hex valued colour.

The darker the picture, the more is the impact of noise in it

Surprisingly, noise is not the culprit to the variability we encounter.

Seed

Most random numbers generated by the computer are generated using a pseudorandom number generator or PNRG for short. Random module in python uses a PNRG algorithm to get you random numbers, but there is a reason why they are called pseudorandom. The randomness can be replicated by feeding in the seed for the algorithm.

Watch this brilliant video by Khan Academy to understand how PNRGs works and why they fall short in comparison to real random numbers

A seed or a random state is the starting point for the PNRG algorithm to start generating pseudorandom numbers recursively. Python’s random module uses the Mersenne Twister PRNG algorithm as its core generator.

This repeatability is not safe as it might endanger encryption or two-factor authentication. A safer alternative to PNRG is a cryptographically secure pseudorandom number generator (CSPRNG). They are slower than PRNG but provide seemingly ‘more random’ numbers. A popular example of this would be the inbuilt secrets module in Python.

When the seed is not provided manually, some value of noise is taken as the seed by the generator, which is the reason we see different values every time we execute a random generator.

Why does it matter?

This variation can have some unintended consequences on your machine learning experience. Many popular models like KNN and SVM start ‘randomly’ and then slowly reduce their errors. Algorithms like Gradient Descent also work by starting at a random point on the features and then working their way up towards higher accuracy. The starting point of the algorithm plays a very important role in the outcome of the model’s fit. Let us have a look at some examples.

Linear and Logistic Regression

Linear and Logistic Regression use Gradient Descent to find the values of slopes and intercept such that the cost function is minimized. It starts at a random point and descends towards the local, or sometimes the global minimum of the cost function. The big problem with this algorithm is that the minimum chosen in the end completely depends on the starting point of the descent. It might fit the model for a local minimum which is considerably larger in value than the global minimum or other local minima.

We can solve this problem easily by converting the function into a convex function by applying convex optimization. Also, it is known that if the actual value of slope given by the global minimum is β₁, then there is a 95% chance that β₁ lies in the given region.

If you are wondering where the 95% chance comes from, the values for β1 are in a Gaussian distribution where the value of μ is ^β1 and σ is SE(^β1), where ^β1 is the estimator of a parameter.
Variance is approximately calculated as RSE (Residual Standard Error) by using RSS (Residual Sum of Squares). The values of RSS and SE will come in handy later!

Decision Trees

Decision Trees on their own cannot provide very accurate results and are prone to over-fitting if not pruned correctly. Ensemble techniques like bagging and random forest reduce this by a process called Bootstrapping. Bootstrapping works by randomly selecting observations from our data and creating multiple samples from it. Some observations may occur more than once due to random selection. By letting the trees grow, we make sure that the trees have high variance but low bias. By taking the average through aggregation, the variance can be reduced as well, giving us better results.

SVM

SVM and other Maximum Margin Classifiers like it can be highly susceptible to change in their train data. Just by adding a single point to the data (try to spot the difference) we get a completely different classification!

Have a look at this simulation by Stanford to understand how your SVM models get affected by the random state of the model. Feel free to tinker with the parameters and gain an understanding of them.

How do we minimize its influence?

Now that we understand how randomness can influence our models, we may now look at some ways that we can curb its influence and get more concrete results that better demonstrate the results of our data cleaning and model fitting process. Let us initialize with the datasets first. We will be using the statsmodels module in python to make our Linear Regression Model and the mlbench package in R to get the Boston Housing dataset in order to easily get the needed statistics for the same data.

Setting up the data in python
Setting up the data in R

Manual Seed Imputation

A common mistake that beginners do is that they do not manually set the value of the seed while coding. Manual imputation makes sure that your results can be replicated by others and you can be sure that the increase or decrease in accuracy is the result of data cleaning and not the change in the random state of the generator. One might get tempted to do some seed optimization, that is, find the value of seed that gives the highest accuracy on the model. This practice doesn’t hold well during real-world applications as it might cause over-fitting and a waste of time and resources. It is best to manually select a few values and stick with the one that gets you the best result out of the bunch instead of iterating to find the absolute best seed value. It can be done simply by the following line of code in python and R respectively.

np.random.seed(4742069)#python codeset.seed(4742069)#R code

Cross Validation

A great way to make sure that you are minimizing the variance caused by the starting point of the algorithm is by cross-validating your data. In Cross-validation, you split your data into n parts and run n experiments, taking each part one by one for testing your data while the rest is used to train the model.

The final model can be either the average or the best performing of all the experiments.

Statistical Tests

  1. Student’s t-test:
    This test is based upon the value of t-statistic which was developed by William Sealy Gosset and is given by
^β is an estimator and μ is a known constant from the null hypothesis

Gosset worked at the beer manufacturing company, Guinness. Company policy at Guinness forbade its chemists from publishing their findings, so Gosset published his statistical work under the pseudonym “Student” as he developed this test while he was on study leave given by Guinness.

It is assumed that the distribution of the sample roughly follows a normal distribution, has similar variance and the same sample size. The t-statistic is then compared to a table which determines P(>|t|) i.e. the probability that the relationship that the model arrived at is because of an actual relation between the two and not just chance due to some noise making it seem like there is a relation. Generally in science, the value of P(>|t|) should be less than or equal to 5%. This test should only be performed when the sample size is less than 30. Z-test is preferred for samples with more than 30 observations.

2. Z-test:
The Z - statistic is given by

Just like the t-test, we find P(>|Z|) from a table for Z - statistic. We must keep in mind that Z-test is performed on population parameters. If population parameters are missing, then sample parameters can be used but Student’s test is recommended for the sample data.

3. F-test:
When there are a lot of features, the probability of a feature showing wrongful correlation must be known too to get a better view of our model. To tackle this problem, we use a metric called F - statistic which is given by

TSS stands for Total Sum of Squares, p is the number of features and n is the number of elements in the data

To understand its significance, let us look at an example. Suppose we set up an experiment to find if a given coin is biased or not. A coin is said to be biased if it shows heads all the five times we toss it. The probability of falsely classifying a fair coin as biased will be (1/2)⁵ ≈ 3%. If we repeat this same experiment 100 times then the probability of falsely classifying at least one fair coin as biased will be 1-(1–1/2⁵)¹⁰⁰ ≈ 95%, which means it is almost certain that we will wrongly classify a fair coin as biased.

Even if you don’t understand the concept of such statistical tests, the important thing to take away here is that the value of P-values for these tests must be less than the threshold value, that is 0.05 or 5% for scientific purposes.

To get these tests, we just have to write the following lines of code. The statsmodels library makes it very simple to get these tests as they are displayed in the model summary itself.

Fitting the model and getting statistical tests on it in python
Fitting the model and getting statistical tests on it in R

Here are the results! Below are the notes left by the statsmodels library.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multi-collinearity or other numerical problems.

As we can see, the value of t-statistic and P-value is given for each estimator and compare if they are actually in correlation or not. Since the P-value for F-statistic is below the threshold, we can confidently say that no estimator has been falsely labelled.

Conclusion

We learnt where the randomness in computers comes from and how it can be not so random. We saw some examples where we see the influence of the random state on popular models and learnt how to minimize it.

I hope you had a fun time reading my blog. Uncertainty can be a boon or a bane, it is up to you to tame the beast and put it to good use.

--

--