How to implement an End to End Machine Learning Project using Flask: A step-by-step approach for IPL Score Prediction🏆

Atharva Hemant Patil
Analytics Vidhya
Published in
7 min readOct 31, 2020

A beginners guide to understand, build and deploy a machine learning application from scratch.

My WebApp for IPL Score Prediction

INTRODUCTION

In the midst of this pandemic, the IPL has kept us entertained and hooked onto our seats. I myself being an avid fan of the tournament and the sport, decided to try my hands on a dataset to predict the runs scored.

In this article I will try to go through all the bits and basics of the process so that even a newcomer in the data science community will be able to comprehend the project.

A Step by step approach:

  1. Data cleaning and formatting
  2. Exploratory Data Analysis
  3. Feature Engineering and Selection
  4. Compare Multiple Algorithms
  5. Perform Hyperparameter Tuning
  6. Evaluate the models
  7. Deploy the model

I have used python for exploratory data analysis and the flask framework to deploy my project on the Heroku App. Before starting make sure you have flask installed in your system. This is the code for the same:

sudo apt-get install python3-flask
pip install flask

Important Note: Always create a new environment in your command prompt before starting a new project. Then install the necessary libraries in that environment.

Understanding the DATASET

The dataset consists of 15 columns:

  1. mid: The match id to uniquely identify each match.
  2. date: The date on which the match was held.
  3. venue: The name of the stadium.
  4. bat_team: The batting team name.
  5. bowl_team: The bowling team name.
  6. batsman: The name of the batsman.
  7. bowler: The name of the bowler.
  8. runs: The runs scored till now.
  9. wickets: The wickets taken till now.
  10. overs: The number of overs bowled.
  11. runs_last_5: The number of runs scored in last 5 overs.
  12. wickets_last_5: The number of wickets taken in last 5 overs.
  13. striker: The name of the batsmen on the batting end.
  14. non-striker: The name of the batsmen on the bowling end.
  15. total: The total number of runs scored in the match.

Start of Project

Before we begin, Let us look at what we are building:

First we will import the necessary Libraries.

Now read the dataset(csv file) and display the top 5 records.

This is how the output looks on Jupyter Notebook

Step 1: Data cleaning and formatting

In this step we will remove all the unwanted columns and clean any row for missing values. Here I have shown how to remove the unwanted columns.

Step 2: Exploratory Data Analysis

Here, we will explore the data and decide what data we want to keep for feature engineering.

Output of unique teams:

Output of unique venues:

Output of number of times a stadium appears in the data by using the groupby() function:

This shows us the number of balls bowled in each stadium

Now that we know how many times a stadium is being used, we can choose only the top stadiums which have most data for our model prediction.

Step 3: Feature Engineering and Selection

In this step we will decide which teams and venues to keep for making our model. We have used only the current playing teams and their respective home grounds as of October 2020.

Now we will use Data Preprocessing to convert features using OneHotEncoding and also convert the string date into datetime object.

Here is a glimpse of how the data will look after using OneHotEncoding and rearranging the columns.

Step 4 ,5: Compare Multiple Algorithms & Perform Hyperparameter Tuning

I have just compared 2 algorithms here Lasso Regression and Random Forest Regression. First we will have to divide our data into train set and test set before using a machine learning algorithm.

Dividing into Train and Test Data:

Now that we have dataset for training and testing, the first algorithm we will look at is Lasso Regression. We have used GridSearchCV for hyperparamter tuning.

Best Parameter and best score

Below is the code for Random Forest Regression. I have used RandomizedSearchCV for hyperparamter tuning.

Output of values of random_grid

Now we will find the best parameters and fit the model to make predictions. This code will require some time to compute.

The best parameters from hyperparameter tuning

Step 6: Evaluate the models

  1. Lasso Regression

Evaluating the Lasso Regression model using Distplot and Sklearn Metrics:

In this plot we can observe that most of our values are 0 or close to 0. Therefore we can state that the Lasso regression model works fine.

2. Random Forest Model

Evaluating the Random Forest Regression model using Distplot and Sklearn Metrics:

In this plot we can observe that not many of our values are 0 or close to 0 when compared to Lasso Regression. Also the error values are higher than Lasso Regression. Therefore we can state that the Lasso regression model has a better accuracy than Random Forest.

Saving model for deployment

By analyzing the above models we can conclude that Lasso Regression works better on our dataset as it had lower error metric values than Random Forest Regressor. Now we will save the model in a pickle file.

Step 7: Deploy the model

For deployment we will use flask framework and heroku app platform. I have made a python file known as ‘app.py’. What this code does is, it will give us access to the ‘index.html’ and ‘predict.html’ files. We have used the POST method to call.

When using the flask framework we need to make 2 folders: static and templates. The Html files should be stored in the template folder while the images and css files should be stored in the static folder. Here below is the code for html file of our home page: index.html

Further we also created a prediction webpage for displaying the result. Here below is the code for html file of that webpage: result.html

For styling and design we have to use CSS file and store it in static folder. Here is the code for that:

Now we are ready to run our model on our local machine. Open the command promt and first change directory to the folder where we have the project saved. Then run python app.py.

View of the Command Prompt

Now on your browser open http://127.0.0.1:5000/ and run the application. Below are the images of the User Interface.

Home Page
Prediction page

You will need a procfile and requirements file. I have provided those in the github link at the end of this article. The procfile will contain: web: gunicorn app:app

For creating a requirement file type this code on cmd in your virtual environment:

$ pip freeze > requirements.txt

We will deploy this project on heroku platform.

  1. Register on heroku.
  2. Upload project on GitHub.
  3. Log in to your Heroku Dashboard.
  4. Click on new/create new app.
  5. Give an app name ,choose region and click on create.
  6. Then go onto the deploy section and connect your app to GitHub.
  7. Click deploy project.

Now you have successfully deployed your app and completed the implementation. You can view your project by a link.

Here is my link for my deployed project: https://ipl-batting-score-predict.herokuapp.com/

If you encounter this webapp as shown in the picture given below, it is occurring just because free dynos for this particular month provided by Heroku have been completely used. You can access the webpage on 1st of the next month. Sorry for the inconvenience.

Here is the link for the code on GitHub: https://github.com/Atharva1604/IPL-Score-Prediction

Here is my LinkedIn, feel free to connect with me: https://www.linkedin.com/in/atharva-patil-a79a84176/

THANK YOU!!

--

--

Atharva Hemant Patil
Analytics Vidhya

I am an aspiring Machine Learning engineer who enjoys connecting the dots: be it ideas or applications from different fields or people from different teams.