IPL Decision Tree

I had just been placed as a data scientist at Innovaccer on December 2015 and was asked to pursue a course in Data Mining which was slotted in my final semester. The class gave me an understanding of the basic algorithms of the field. We were taught Classification, Clustering and Association Rule Mining.

Our professor wanted us to understand the applications and asked us to choose a topic of our interest and apply some algorithms to get some useful result. While my classmates chose the famous datasets like Iris, NLM Digit Classifier, I decided to be innovative. The IPL auction had just taken place a couple of days back and the lingering question that it posed was who was the right player to select ? I decided to build a decision tree that would classify the category of a player based on his past statistics.

The first challenge was to accumulate the past data for the player. Going by the league strength, I decided to only choose the statistics from International matches, past IPL records, the big bash league and the domestic competition of England. After identifying the records to use, the question was to bring it to my system. I was fortunate enough that there were many websites like cricinfo, cricbuzz and the official IPL websites which provided these records.

Using beautifulsoup in python, I managed to scrape the webpages and fetched the records to my local system. The next step involved aggregating to have a single database for building and training a decision tree. Through the pandas library in python, the appropriate data frames were manipulated to get the desired output.

To build a classifier, we required to identify the class of the players. We decided to determine the class of the player based on his retention level for his respective franchise. The ratio was determined by

A ratio above 0.5 was treated as a good player, a ratio between 0.25 to 0.5 was classfied as an average player and below that was classified as a poor player. This was again accomplished by the pandas dataframe.

Once the database was ready, the only thing left was to build a decision tree. We first split the data into 2 parts, training and testing. Since there were a lot of features in the training data, I decided to use a Relief F Algorithm to identify the top 10 most important features. We then built a J48 decision tree for which a zoomed version can be seen below

The tree seemed quite logical since the top node consisted of identifying the number of 50s that a player has scored. If it was greater than 14, then the player was classified as a good player. If not, the bowling stats of the player is seen to identify the class of the player. This is what most cricket enthusiasts would use to identify how good a player was. On running the tree on the hidden test data, we got an overall accuracy of 80%.