Image for post
Image for post

The Beginnings of a Search Engine

Viren Khandal
Oct 25, 2018 · 3 min read
Image for post
Image for post

In the summer of 2018, I enrolled in a Data Mining and Analysis course, STATS 202, at Stanford University. Through this course, I learned the fundamentals of data science, including concepts like Random Forests, Decision Trees, and Neural Networks. In particular, Support Vector Machines (SVM) captured my eye due to its wide tuning opportunities, including kernel types and budgets.

At the end of the course, I had the opportunity to further investigate Data Science by creating a model similar to Google’s Page Rank. Through this project, I learned more about the beauty of SVMs.

Image for post
Image for post
Figure 1: Sample Data Ready for SVM

SVMs, Support Vector Machines, are classifiers that split a hyperplane into two sections. Suppose you were given a set of points that looked like the figure on the left. Clearly, there are two distinct clusters of points located in different locations of the chart, red and blue. If a model was built to optimize prediction of color of a random point, a dividing line in between the data would be very beneficial. This is exactly what SVM provides. The image after SVM was performed on the above data is shown here:

Image for post
Image for post

My project revolved around analyzing 80047 data points, each with 12 classifiers including url_id, query_length, and 10 others. The process I took to create an effective model is shown below.

Image for post
Image for post

After obtaining the data, the next step was preprocessing which involved understanding the data and the various different classifiers. Preprocessing also involved refining the data and ensuring that none of the data points are outliers. For the transformation phase of the process, I decided to try 4 different models: Decision Trees, Naive Bayes, Random Forests, and Support Vector Machines. The details of these models can be seen in the pdf attached at the end of this article. To enhance these models, I used 2 different styles of tuning for each which can also be seen in the pdf. For each of these techniques, I had to account for the benefits and repercussions, as the decrease of MSE may not define entire success.

Support Vector Machines, which turned out to be the best model for our data, is beautiful for its various tuning methods, including changing kernels and changing budget. The image below shows the two types of kernels used in SVM (other than linear), polynomial and radial, respectively.

Image for post
Image for post

Next, the budget-based tuning revolves around adding a cost to points that are not within the boundaries of the the vectors. If the cost is higher, the model will be able to fit more data, but will not be as flexible as a model with lower cost.

Overall, the accuracy we got on the training set with the SVM model was approximately 68.8%, which gave a 31.2% MSE. Although this may be seen as 2 in every 3 predictions is correct, an accuracy of around 69% is relatively high for such a model. A similar algorithm, but with hundreds of more classifiers, has been constructed by the legendary Larry Page and Sergey Brin. Attached below is the detailed report and findings from all data models.

Image for post
Image for post

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Viren Khandal

Written by

Writer for Data Driven Investor | https://goo.gl/squZPQ

Data Driven Investor

from confusion to clarity not insanity

Viren Khandal

Written by

Writer for Data Driven Investor | https://goo.gl/squZPQ

Data Driven Investor

from confusion to clarity not insanity

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store