A Friendly Approach to AI

Zackary Nay
5 min readFeb 10, 2024

Have you ever tried to learn or program your own machine learning algorithm, only to get overwhelmed by the amount of jargon and complex math? Well, do I have the series for you. In this string of essays, I will be going over the very basics of machine learning, with each chapter accompanied by two other essays around the topic going through an example in Python and deeper into the math behind these concepts. Let’s jump in!

Suppose we wanted to create an equation which identifies whether a movie review was positive or negative.

One of the most basic ways is assigning a negative score to negative words, such as -2 to “bad” and -4 to “awful”, and a positive score to positive words, such as 5 to “amazing” and 2 to “awesome”. We could then add up all of these scores that we have assigned to the words and if the summation is greater than or equal to 0, the review will be labeled as positive. Otherwise, if the total is less than 0 the review will be labelled as negative.

Take for example the review:

The review would receive a score of: (5 * 1)+ (-2 * 1) + (-4* 1) = -1 and we would therefore classify it as a negative review.

However, there are a few issues with this approach. To name a few, how do we know what weight to put on a particular word to classify the most reviews correctly? Should “bad” have a weight of -2.1 instead? Additionally, it is too time-consuming to put weight on every word in the English language.

This type of problem, known as a binary classification task, can be solved using machine learning.

Lets formalize what we are trying to accomplish. First, we have weights (w), and inputs (x). In our example, our weights were the scores that we placed on the individual words, and our inputs were the number of occurrences of the words. We then would multiply the number of occurrences of the words by their specified weights to get a score, giving us the equation:

Going back to our example, x1 would be one instance of the word “bad” and w1 would be -2.

You may be thinking to yourself “Yeah that’s great, but how does a computer know what to set the weights as?”. Well, I am glad you asked.

Instead of setting the weights (w_i) to specified values, we just set them randomly. You can probably guess that this would lead to predictions that are not very accurate initially, and you would be right. However, we then change every weight to make the predictions more accurate. We continue the cycle of updating and predicting until we can’t improve our predictions in any meaningful way.

Formally, we change the weights by the following equation:

Where delta w_j refers to the specific change in weight, y^i is the true value of the ith training example, y-hat-i is the prediction value of the i-th training example, x^i_j is the value of the ith training example. and eta is the learning rate. It is important to note that the superscript “i” is not an exponent, but rather a label.

To bring this back down to earth, in our example, delta w_j would be the change in w that we should implement. y¹ would be 1 if the first review in the training set was positive and -1 if the review was negative and y-hat-1 is the value that we predicted the review to be. Lastly, x¹_j is the number of occurrences of the word in the first review. Eta is known as the learning rate, which is usually between 0 and 1. For example, suppose our learning rate was equal to .1, and we had 2 occurrences of the word “bad”, which was our first word, and we predicted the first review to be positive when it was actually negative. We would get the following equation:

Our equation would indicate that the weight on bad is too high and that we should change w by -.4. If we were to continue this algorithm, would then change the weight of every word and then predict the next review and update the weights based upon the results. Eventually, depending on the linear separability of our data, could end up with a pretty good equation for predicting movie reviews.

The type of algorithm discussed is called the perceptron algorithm. It is likely the most basic form of machine learning and it is amazingly quite simple. Depending on our data, from that simple equation, we can create an accurate equation for predicting binary classes. In the math section here, I will go through the adaptive linear network and batch gradient descent. In the Python section here, I go through a worked example of these algorithms.

Zackary Nay is a full-time software developer working within the construction industry, where he implements artificial intelligence and other software to expedite repetitive tasks. In his free time, he likes to read and go backpacking.

--

--