In the summer of 2018, I enrolled in a Data Mining and Analysis course, STATS 202, at Stanford University. Through this course, I learned the fundamentals of data science, including concepts like Random Forests, Decision Trees, and Neural Networks. In particular, Support Vector Machines (SVM) captured my eye due to its wide tuning opportunities, including kernel types and budgets.
At the end of the course, I had the opportunity to further investigate Data Science by creating a model similar to Google’s Page Rank. Through this project, I learned more about the beauty of SVMs.
SVMs, Support Vector Machines, are classifiers that split a hyperplane into two sections. Suppose you were given a set of points that looked like the figure on the left. Clearly, there are two distinct clusters of points located in different locations of the chart, red and blue. If a model was built to optimize prediction of color of a random point, a dividing line in between the data would be very beneficial. This is exactly what SVM provides. The image after SVM was performed on the above data is shown here:
My project revolved around analyzing 80047 data points, each with 12 classifiers including url_id, query_length, and 10 others. The process I took to create an effective model is shown below.
After obtaining the data, the next step was preprocessing which involved understanding the data and the various different classifiers. Preprocessing also involved refining the data and ensuring that none of the data points are outliers. For the transformation phase of the process, I decided to try 4 different models: Decision Trees, Naive Bayes, Random Forests, and Support Vector Machines. The details of these models can be seen in the pdf attached at the end of this article. To enhance these models, I used 2 different styles of tuning for each which can also be seen in the pdf. For each of these techniques, I had to account for the benefits and repercussions, as the decrease of MSE may not define entire success.
Support Vector Machines, which turned out to be the best model for our data, is beautiful for its various tuning methods, including changing kernels and changing budget. The image below shows the two types of kernels used in SVM (other than linear), polynomial and radial, respectively.
Next, the budget-based tuning revolves around adding a cost to points that are not within the boundaries of the the vectors. If the cost is higher, the model will be able to fit more data, but will not be as flexible as a model with lower cost.
Overall, the accuracy we got on the training set with the SVM model was approximately 68.8%, which gave a 31.2% MSE. Although this may be seen as 2 in every 3 predictions is correct, an accuracy of around 69% is relatively high for such a model. A similar algorithm, but with hundreds of more classifiers, has been constructed by the legendary Larry Page and Sergey Brin. Attached below is the detailed report and findings from all data models.