Stories by Garrett Williams on Medium

A Summary of PageRank

Garrett Williams — Mon, 08 Nov 2021 03:30:58 GMT

The Algorithm that Launched Google into a Tech Giant

“Just Google It.” That phrase has become ingrained in our society. Whenever we need to search for something on the web, over 80% of people choose Google’s search engine to find what they are looking for. But why is Google’s search engine so popular? And how was it constructed? It all started with PageRank. Below, I will attempt to summarize PageRank and the findings from this research paper.

Background

In 1990, the very first search engine, Archie, was developed by a student at McGill University in Montreal, Canada. Over the next couple of years, more search engines were created including Infoseek (used by Netscape), Yahoo! Search, WebCrawler (bought by AOL), and AskJeeves. Though they were primitive, each search engine strived to become more optimized than the previous. In 1996, as students at Standford, Sergey Brin and Larry Page developed a new algorithm for optimizing web page searches, called PageRank.

Sergey Brin and Larry Page developed PageRank while being students at Stanford

What Is PageRank?

Even in the 90s, there were hundreds of millions of web pages ranging from extremely diverse topics like “What is Joe having for lunch today?” to journals on information retrieval. Brin and Page wanted to create “a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them.” And that’s what they did with PageRank. Essentially, PageRank is an algorithm to measure the relative importance of web pages by computing a ranking for every web page. It helps search engines and users quickly make sense of the vast environment that is the World Wide Web.

How Does It Work?

According to Google, “PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.” This theory can be boiled down to a simple formula below (where Σ means adding together).

Let’s look at a simple illustration to better understand how PageRank is calculated. Let’s imagine we only have 5 webpages, labeled Page_0, Page_1, Page_2, Page_3, and Page_4

Simple illustration of 5 web pages that are linked together

The PageRank for Page_0 is:

PR(Page_0) = PR(Page_4)/3

This is because Page_4 is the only page that links to Page_0 but it also links to 2 other web pages (Page_1 and Page 2). The PageRank for Page_1,

PR(Page_1) = PR(Page_2)/2 + PR(Page_3)/1 + PR(Page_4)/3

Damping Factor and Final Formula

But what if there is a web page that doesn’t have any links toward it. Or what if the random surfer gets stuck in a loop between 2 or more web pages (i.e. Page_5 only links to Page_6 and Page_6 only links to Page_5). So they had to add a damping factor into the final formula. What this does is add a probability that some random surfer would stop following the links and go to a completely random web page. Therefore, every website has a chance to be clicked on, no matter where the random surfer starts. It is generally assumed that the damping factor is set to 0.85. So, the final formula is:

Final Formula

where u is any web page, v is any webpage that links to u, N(v) is the number of links on web page v, and d is the damping factor. Notice how the random surfer will “jump” to a web page every so often.

It starts at web page d, following the links, but making a “jump” every so often

Implementation

Since we have our formula, the algorithm steps are as follows. In the beginning, set the PageRank of each web page equal to 1. For each iteration of the random surfer, update the PageRank of each web page. After enough iterations, the PageRank values for each page will converge. Below you can see an example of what the outcome would look it if there were only a handful of web pages. The values for this tiny network are expressed in percentages.

As you can see, web page B is a more important webpage, having more links pointing to it. And even though web page E has more links pointing to it than web page C, web page C has a higher score because web page B, the most important web page, it linking to it.

Conclusion

In the end, Sergey Brin and Larry Page took on the audacious task of condensing every page on the World Wide Web, regardless of their content, into a single number. Two years after making this discovery, in 1998, Google was founded. At the time, Google was just a search engine that would use PageRank to order search results so that more important web pages are given preference. I think it’s safe to say it worked out pretty well. That same year, they published their findings as a research paper titled “The PageRank Citation Ranking: Bringing Order to the Web”. Today, Google still uses PageRank along with 200 other more complex algorithms. That’s a testament to how revolutionary their findings were.

References

http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

https://towardsdatascience.com/pagerank-algorithm-fully-explained-dc794184b4af

https://en.wikipedia.org/wiki/PageRank

Stepwise Feature Selection for Statsmodels

Garrett Williams — Mon, 18 Oct 2021 03:18:53 GMT

A Tutorial for Writing a Helper Function

As Data Scientists, when we are modeling we need to ask “What are we modeling for, prediction or inference?” Are we trying to use the model to predict outcomes from new data? Or, are we trying to gain insight into how different features affect the target variable? If we are modeling for prediction, we typically use all available features because we are trying to accurately predict the outcome, no matter the cost. For inferential modeling (also known as statistical modeling), we want to interpret how certain features affect our target variable and what happens to our target when we change those features. This means we want to use statistics to determine which features have the greatest effect on our target.

Statsmodels

A great package in Python to use for inferential modeling is statsmodels. It allows us to explore data, make linear regression models, and perform statistical tests. You can find their website here and their API documentation here. Despite all of the benefits of using statsmodels, one major flaw is that it doesn’t have a method to help you decide what are the best features to include in your model. And when we say ‘best’, that can have many different meanings. It could mean features that result in the highest R-squared value, the lowest RMSE, or just the easiest to interpret to a non-technical audience. Even though it doesn’t have a built-in method for selecting features, it won’t be hard for us to code one.

Tutorial

For this tutorial, we’ll first code a solution for how to select an optimal first feature. Then, we’ll modify this solution into a function that can be called each step of our modeling process. This will save us time so we don’t have to retype code. Let’s import our necessary packages and read our dataset so we know what we are working with.

https://medium.com/media/26a36ec3776578d7c241ce45e72763d2/href

After preparing our data so that all our columns are numeric data types with no null values, we are left with these features.

Based on our understanding of the data, let’s have our target variable be price, the price of the homes sold in King County, Washington from 2014 through 2015. Our independent variables will be the other 31 features. It would be hard to interrupt the price based on 31 different factors, so let’s just choose the top 3 features based on the r-squared value.

First Iteration

For our first linear regression model, we’ll start by figuring out which feature has the greatest correlation to the target. We want to iterate through every feature and calculate the r-squared value if that feature was used in our model to predict the sale price. So, we will use a for-loop to go through every column, use statsmodels to create a model for only that feature, and use numpy to calculate the r-squared value between the actual target and the predicted target. To visualize our results, we will put the information into a DataFrame and only display the best columns.

https://medium.com/media/59dd2d1110da29c9335dfd1f1daef374/href

Great! Now we know that our first predictor should be sqft_living . Unfortunately, the r-squared value is relatively low. This isn’t surprising because there are a lot of different factors that impact the sale price of a house. Let’s see what the r-squared value of our model would be if we had 2 features. We already know we want to keep sqft_living . Then let’s iterate through all the columns that aren’t sqft_living and calculate the r-squared value, just like we did in the code above. This time though, instead of copy and pasting the code, let’s create a function using the code above as guidance so that we only need to type a line or two every time we want to add a new feature to our model. The only difference between our 1st model iteration and our 2nd model iteration is that we know we want to include sqft_living . We will need to pass in a list of current features. While we’re at it, let's also pass in a list of features that we want to ignore because of multicollinearity or interpretability.

2nd Iteration using Helper Function

https://medium.com/media/8aace0850eda0653db0dec0332f4b9d3/href

All we need to do now is create a list of our selected features, a list of any features we want to ignore, and then call our next_possible_feature() function.

https://medium.com/media/e8773337abdb575d3e6935c04fb41d4b/href

Notice that our model’s r-squared value increased when we added a second feature. Before we can continue, whenever we are creating a Linear Regression model for inferential purposes, we have to check for multicollinearity. We will do this by computing the Variance Inflation Factor (also known as VIF scores). This checks to makes sure our predictors have a linear relationship between them. We didn’t have to do this for our first iteration because we only had 1 feature. A score of less than 5 is good.

https://medium.com/media/a6319cc8453fb92402dac3c68e117be6/href

3rd Iteration

Since we created our helper function next_possible_feature() , all we have to do is call it to look at our best options for our 3rd feature. This time we will add lat and long to our features_to_ignore list since distance was calculated using those 2 features (i.e. they’re multicollinear). I did this calculation in the data preparation step and didn’t include the code on this blog because it’s not relevant. It’s only important to know how to ignore features if you want to.

https://medium.com/media/f476484ff50fbe22d5f0afb6b5938273/href

Now if we check the VIF scores, sqft_living and sqft_living15 are multicollinear. This makes sense since the square footage of 1 house is dependent on the square footage of their 15 closest neighbors.

https://medium.com/media/8be2a27bb2871a95f9010d858e967204/href

If were were to continue for a 4th iteration, we would put sqft_living15 in the features_to_ignore list. We’ll move on and check the VIF scores if the next best feature, grade_above_average, was added to our model.

https://medium.com/media/b06bcab70353e904ee4b4de5f126d605/href

Final Model

Since we are only looking for the top 3 features, it’s time to display the statsmodels summary for our Linear Regression model.

https://medium.com/media/3ad3ac12e814f4634e2920b3bb8ea847/href

Other Tools

The next_possible_feature() function is a great option for when you’re trying to select features statsmodel , but scikit-learn has a couple of methods that are already defined. Sklearn, as it’s also known, is great for machine learning when you are trying to create a model to predict as close to the actual target as possible. It has a feature_selection module that can be used to import different classes like SelectKBest() which selects the best ‘k’ number of features to include. It also has SequentialFeatureSelector() which is very similar to the function we wrote, selecting the best feature after each sequence.

You can use Sklearn for inferential modeling, but it’s harder to interpret the statistics. One downside to using Sklearn for feature selection is that these classes don’t check for multicollinearity. If you want to use them, you have to check multicollinearity on all possible features before you run a feature_selection class. If you have a lot of columns in your DataFrame, this can be inefficient or hard to interpret. If you do decide to remove any multicollinear features before your feature selection, you could end up removing the stronger multicollinear feature.

One great benefit to creating your own function is you can fine-tune the code to work specifically for your project. You can manipulate next_possible_feature() to work for a train/test split to better predict your model’s performance for unknown data. Or you could add a multicollinearity check inside the function. Making your own function can sometimes be quicker than searching google for a similar function.

Conclusion

To recap, we created next_possible_feature()that will iterate through all possible features in a dataset and display the r-squared value if that feature was added to our statsmodel . We had to create a helper function because statsmodel doesn’t have a built-in feature selection method. Our function can be called repeatedly, one step at a time, to help us introduce more features into our model. This can improve the accuracy of our model while maintaining interpretability.

What is Tensorflow?

Garrett Williams — Mon, 27 Sep 2021 03:28:42 GMT

Understanding the Basics behind Google’s Machine Learning Library

Before Covid-19, when international travel was popular, visiting a new country can be an exciting time. It’s not every day that you get to experience a new culture. I’ve had the privilege to visit Australia and I loved every minute of it. Luckily for me, I’m fluent in English and I didn’t have to learn a new language, but I can imagine how this can add some stress to a trip. Enter Google Translate. This application can be used to translate speech, text, or images in real-time, leaving you with less stress and more opportunities to enjoy your trip. This is done using TensorFlow. Before we go over TensorFlow and give an example, let’s start with the foundations of Machine Learning.

Machine Learning

Machine Learning is the process of computers finding patterns in large amounts of data to enable decision-making. This process is constantly repeating because of their ability to learn from their experience and improve themselves without any extra code. Computers can do this through statistical modeling. Essentially, the Machines receive the data and then use mathematical methods to approximate an outcome, all within seconds. This allows computer programs to learn from the past and make updated decisions. Let me give you a real-world example of machine learning to help illustrate its benefit to mankind.

Machine Learning Example

Doctors using IBM Watson Genomics to diagnose cancer patients

Many of us have had a loved one affected by cancer. With hundreds of thousands of new medical studies being published every year, doctors can’t stay on top of all the latest articles and trials. To add insult to injury, a majority of cancer care in the United States is done in community hospitals where they don’t have all the necessary resources. IBM developed a machine learning tool, Watson Genomics, to help Oncologists better serve patients. It can read millions of scientific articles and treatments to give the doctors a particular diagnosis for the patient in real-time. This results in many lives being saved because of the speed and precision of Machine Learning.

TensorFlow

Now that we have a general understanding of Machine Learning, let’s talk about TensorFlow. By definition, TensorFlow is an open-source software library used for machine learning. Written in 3 languages (Python, C++, and CUDA), it was first created in 2015 by Google’s AI team for internal Google use. Tensorflow is structured using tensors and nodes. The software inputs data as multi-dimensional arrays, also known as tensors. You can then construct a flow chart of mathematical operations, also known as nodes, that you want to perform on the inputted data, outputting one or more tensors. This flow of inputting and outputting tensors using nodes is where they came up with the name TensorFlow.

Basic illustration showing the flow of tensors (i.e. multidimensional arrays) being operated on by nodes (i.e. operators). Hence the name TensorFlow!

A more complex, real-world illustration that still uses the basic TensorFlow model to predict a type of Iris

Reasons People Use TensorFlow

One reason why TensorFlow is popular is that it’s extremely fast and precise. It’s also very versatile. It can be run on a desktop or mobile device, and it can run on a GPU or a CPU. Another reason people tend to use TensorFlow is because of Tensorboard, a tool that allows you to visualize your Machine Learning model. The final reason why TensorFlow is popular is that it’s an open-source library developed and maintained by Google, a trusted tech giant. That means anyone could use it for machine learning at no cost without worrying about resources or reliability.

TensorFlow Example

Many of the biggest companies in the world use TensorFlow to deliver a better product. For example, Airbnb uses TensorFlow to classify different images by room and present the most appealing ones at the top of the website (TensorFlow Airbnb Youtube video). This is done by first identifying the different objects in the picture. A porch or a bed or a refrigerator. Once it identifies the probability of correctly classifying each object in the picture, it will use prior knowledge to estimate the probability of correctly identifying the room given the object was right. This flow continues and allows Airbnb to put the best photos first. In the gif below, we can look at a simple example of this process. For each picture, it identifies different features (i.e. eyes, ears, mouth). Using prior knowledge, it can predict the picture is a cat or dog given the data it collected.

Alternatives to TensorFlow

TensorFlow isn’t a monopoly in the software libraries used for machine learning. It has a couple of main alternatives, each with some pros and cons.

PyTorch: Developed by Facebook as an open-source machine learning library. The most important difference between them is that TensorFlow creates a static graph while PyTorch creates a dynamic graph. This means that in TensorFlow, you first need to define the entire graph of the model, then run it. While in PyTorch, you can define and manipulate your graph as you go. Another difference is that TensorFlow has a steeper learning curve, but it has a bigger community behind it. And TensorFlow has TensorBoard which enables you to visualize your Machine Learning model. PyTorch doesn’t have a tool like that.

Scikit-Learn: Also known as sklearn, it’s more of a general-purpose machine learning library while TensorFlow has positioned itself as a deep learning library. Deep Learning is a subset of machine learning that tells the computer to use multiple layers to process the data and fill in the gaps. Sklearn is best used for small to medium-sized projects that require the users to manually process the data and choose the appropriate algorithm. Since it was only written in Python, it doesn’t have the versatility of languages like TensorFlow.

Keras: More of wrapper library that needs to run on top of TensorFlow. It’s best for smaller datasets and more user-friendly, but not as powerful. It should be noted that this isn’t a straight comparison because Keras needs to be run on top of an open-sourced library like TensorFlow

Tutorials

If you would like to learn TensorFlow, below are some websites to check out:

https://www.tensorflow.org/tutorials

https://www.udemy.com/course/complete-guide-to-tensorflow-for-deep-learning-with-python/

https://www.coursera.org/learn/introduction-tensorflow

https://www.udacity.com/course/intro-to-machine-learning-with-tensorflow-nanodegree--nd230

API Documentation

https://www.tensorflow.org/api_docs

From Madden to Analytics

Garrett Williams — Mon, 06 Sep 2021 21:36:02 GMT

My Journey to Learning Data Science

Ask a kid what they want to do when they grow up, and you’ll get answers ranging from firefighter to astronaut to scientist. For me, I wanted to be a professional football player. Despite playing football from 3rd to 12th grade, I was never good enough nor have the physical attributes to make it in college, let alone the NFL. Like me, many of those kids end up doing something else for a career once they grow up. But looking back at my youth, little did I know that my passion for football would directly correlate to my wanting to learn Data Science.

Growing up in Northern Virginia, outside of Washington, D.C. as the middle of 6 kids, I was always the quiet one. For example, when my 3 brothers would fight over the video game controller, I was fine to just sit in the background and watch. When I did get to play, my favorite game was Madden, but I didn’t play them as most kids do. Instead of playing for hours as my hometown Washington Football Team, or trying to carve up defenses like Tom Brady, I spent the majority of my time analyzing. In Madden, every NFL player is given a rating for every aspect of the game. Speed, strength, agility, throwing accuracy, tackling, kicking accuracy, health, and awareness were some attributes just to name a few. If you were a fast player, your speed would be close to 100, and vice versa. All of those different attributes would contribute to your overall rating.

Along with every player having a rating, you were also able to remove every player from their current team and enter them into a league-wide fantasy draft. So ideally, I could have Washington draft Tom Brady with the 1st pick overall, and then in the second round, I could draft who I thought was the 33rd best player. But you had to be careful because the player you wanted to draft in the next round might not be available. Knowing this, I wanted to get the most value for every pick. I’d write down in a spiral notebook what the most ideal round was to draft each position and of those positions who had the best value. For example, if I knew the guards wouldn’t be drafted until the 13th round, I’d make sure I draft the best one in the 12th round. Or if multiple players had similar ratings, I would draft the younger player because the only reason their overall rating was low was because of the lack of experience. Some kids would call this boring, but for me, I enjoyed it. Don’t get me wrong, I would end up playing the games with the team I drafted. However, I would spend much more time and be happier when I was analyzing the players.

Adulthood

As the years went by, my priorities changed. I stopped playing video games, but analyzing data and numbers still stayed with me. After finishing high school with the second-highest math SAT score in my class, I ended up going on and graduating from Virginia Tech with a Bachelor’s in Mathematics and a minor in Computer Science. Some students know the career they want, and then get the degree that will help them reach it. That wasn’t the case for me. I got a math degree because I was good at it, not because of a job I could land with it.

I really struggled when it came time to find a job. With no sense of purpose, I’d go to indeed.com, search for “math degree”, and see what came up near Washington. I’d apply to anything I thought I might qualify for, like throwing a handful of darts at a board, nothing stuck. Whether it was because I didn’t have government clearance, I didn’t know the required programming language, or I didn’t have enough experience, I got rejected a lot. One blessing that did come out of the job search was noticing Data Analyst and Data Scientist came up a lot. But not being in a healthy state mentally or financially, I didn’t want to take the required classes that could teach me the skills to get my foot in the door. I needed to get a job quickly. I had student loans I needed to start paying off.

Eventually, I was hired by Sam’s club as an associate. Spending too much time being unemployed and being thousands in debt from student loans, I was able to get some sort of income. Even though it was not the job a college graduate would want, I was thankful because it was something. Yet, life at Sam’s wasn’t easy. Long, stressful hours for not a lot of income, plus knowing this is not what I want to do for the rest of my life. But I didn’t know how to move in the right direction. I would later be promoted to a department manager for a couple of years and make enough money to pay off my student loans. I’d still apply to jobs every once in a while to hopefully get my foot in the door to a math-oriented job, but I was always pessimistic.

Moving Forward

Gratefully, my family always loved me despite my situation and wanted the best for me. My mom would ask her friends if they would take a look at my resume. My older brother paid for some counseling sessions to help with my mental state of mind. It was my younger brother that recommended to me to do a coding bootcamp for data science at Flatiron School. “Why the data science program?” I asked my brother. “Because it can use your math and computer science background, and it’s a job in hot demand.”

There’s that career again; keeps popping up. Data science. I doubt he knew that the majority of jobs I previously saw on indeed.com were data analysts and data scientists. Maybe this is the career path for me. Sure, it’s a lot of money for a school I never heard of, but how can I expect to land a job if I don’t change my resume. Or as the great Albert Einstein said “Insanity is doing the same thing over and over and expecting different results.” And with that, I quit my job at Sam’s club, took a leap of faith, invested in myself, and applied to Flatiron School’s 15-week Data Science program.

I just finished my first week in the program and I can’t help but think about how my journey to learning Data Science didn’t start a week ago, or a few months ago when I got accepted into Flatiron School. The spark began when I was just a football-loving kid spending hours playing Madden. Collecting data by writing down the player’s ratings in my spiral notebook. Seeing trends by knowing what rounds to draft each position. Analyzing the data by seeing which available young players would bring the most value. All to benefit the team by making it easier to win football games. Now I know these next couple of months will be long and hard, but I didn’t decide to learn Data Science because I thought it would be easy. I didn’t even decide to learn Data Science because it’s high-paying or in hot demand, though it helps. For me, I decided to learn Data Science so that I can get a job within the field that I’ve always had a passion for. You could say that’s like winning the Superbowl for me.