A Guided Approach to Using Machine Learning for Cricket Wicket Prediction

David Ardagh
auquan
Published in
9 min readJun 10, 2019

In this article, I will be talking about data from our Cricket World Cup Challenge. To enter the competition and download the data set, please go to quant-quest.auquan.com/competitions/cricket-qq1

Overview

Introduction
Data science and sport are becoming more and more integrated and things have come a long way since Billy Beane’s famous 2002 season with the Oakland A’s baseball team (as popularised by the film Moneyball). People use data science to predict game outcomes, design bespoke training plans, assess player potential and much more. Today we’re going to take a look at just one aspect of this: Predicting whether a ball will be a wicket in a cricket match.

As noted above, we are going to be using a data set generously provided by Mustard Systems that contains information on 4000 recent ODI and high level 50/50 cricket matches. There are almost 30 factors that have been recorded, giving a total of almost 1 million data points.

Feel free to download the data set and follow along!

Plan of Attack

  • Step zero: We need to download, open and look at some descriptive information about the data. I’ve copied out the column headings and types below:
## Code for getting Data descriptions
print(df.shape)
print(df.columns)
print(df.dtypes)
Column headings and data types — Shape = (874860,31)
  • Step one: As you can see we have multiple variables that contain categorical data in the form of strings. We are going to need to convert this data into a more useable format.
  • Step two: Next we will do some feature engineering to create some variables that represent the current game state. This might include things like momentum, form, runs in last over, wickets in last over etc. These should be more predictive than the original factors.
  • Step three: Use these new features and build a model to predict if a ball will be a wicket or not. (We’ll make a couple and see which is best).
  • Step four: Iterate over the probability threshold to determine the predicted class. (We’ll come back to this, but basically, because about 95% of balls are no-wickets we can create a model that is 95% accurate by always guessing no-wicket).
  • Step five: Bask in the glory of our creation.

Implementation

Step One: Preparing the data
To make our lives easier when we build the model, we are going to want all of our data to be numerical. This poses an issue for us, as lots of our data is in the form of strings.

There are several ways to circumvent this problem. The easiest way is to simply encode each unique value with an integer equivalent. Like most things that are easy, this comes with significant drawbacks and will decrease the predictive power of our model. A more effective way would be to use a technique called ‘one-hot encoding’, where each unique value becomes a column containing a binary value for whether an entry has that value or not.

For the sake of this article, we are just going to use simple encoding but if you’re interested you can learn more about one hot encoding here.

Listing non-numerical variabels:listf = []
for c in data.columns:
if data[c].dtype==object:
print(c, data[c].dtype)
listf.append(c)
Converting unique values to integer:from sklearn import preprocessingfeature_dict ={}
for feature in data.columns:
if data[feature].dtype==object:
le = preprocessing.LabelEncoder()
fs = data[feature].unique()
f_dict = {}
le.fit(fs)
data[feature] = le.transform(data[feature])
feature_dict[feature] = le

We then need to get rid of any NaN values that have appeared. Also, we need to separate our target variable so we have a dependent variable we can try and predict.

Clean data by removing NaN values:data[data.isnull().any(axis=1)]
data = data.dropna()
del data['date']
Create target variable (y):pd.get_dummies(s1)y = data['Out']
del data['Out']

Step Two: Creating new variables
Imagine you are watching a cricket match with me and we decide to place a bet on how many runs will be scored in the next over. What information would we use to make this decision? Would we weigh it all equally?

We would want to consider factors like: who’s playing, who’s batting/bowling, where are they playing, what’s the pitch like etc. This information is the sort of information we’ve got in our table.

However, we might also want to consider other factors such as: how many runs do the teams normally score/concede or how well are the players playing in this game/recent games. Similarly, we probably care more about recent head to head results than those that happened a long time ago. All this information is related to the data we’ve already got but requires us to process it and create new variables. This is what we’re going to do in this step.

In the interest of keeping this article short, I’m only going to create a couple of new variables, which are going to look at how the team has performed in the last couple of overs. Hopefully, this will give us a sense of momentum in the game and highlight if any team is under pressure.

Create variables that measure the number of runs and wickets in the last 6/12 balls:data['run_last_6_balls'] = data['innings_runs_before_ball'].rolling(6).sum()data['run_last_12_balls'] = data['innings_runs_before_ball'].rolling(12).sum()data['wkt_last_6_balls'] = data['PreviousBallOut'].rolling(6).sum()data['wkt_last_12_balls'] = data['PreviousBallOut'].rolling(12).sum()Remember to set these to 0 at the start of the game:data['run_last_6_balls'][data['innings_ball_number']<6] = 0data['run_last_12_balls'][data['innings_ball_number']<12] = data['run_last_6_balls']data['wkt_last_6_balls'][data['innings_ball_number']<6] = 0data['wkt_last_12_balls'][data['innings_ball_number']<12] = data['wkt_last_6_balls']data.fillna(0, inplace=True)

Step 3: Creating some models
Now its time to start working some magic! Almost. We actually first have to split the data into training and test data so we can assess our models. Lets quickly do that now.

Creating traning and test data sets
Note: We're going to use stratified sampling here to make sure we maintain representative amounts of each class in each set
from sklearn.metrics import confusion_matrix, log_loss, make_scorer
from sklearn.metrics import roc_curve, precision_recall_curve, auc, recall_score, accuracy_score, precision_score
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
X_train, X_test, y_train, y_test = train_test_split(data, y, stratify=y, random_state = 10)Show the distributions for each groupprint('y_train class distribution')
print(y_train.value_counts(normalize=True))
print('y_test class distribution')
print(y_test.value_counts(normalize=True))

Now we are going to create some models. I did a bit of research before tackling this problem and found that certain classifiers work better than others. I’ve picked four as examples, but feel free to try your own. The ones we’re going to look at: Decision tree classifier, KNN classifier, Naive Bayes and a random forest classifier.

To test the accuracy of our models we are going to use something called a confusion matrix. This is a table that shows predicted outcomes vs actual outcomes as shown below.

Confusion matrix- TN = true negative, TP = true positive, FN = false negative, FP = false positive

We will use these values later to calculate performance metrics for our model. This could include: true/false positive rate, true/false negative rate, sensitivity, specificity, Cohens Kappa etc. We are going to be particularly interested in sensitivity and specificity, which we will use to create a ROC curve and complete an AUC analysis (see later).

Decision Tree Classifier
Decision tree classifiers are models that try to predict and outcome by building a decision tree. As a reminder, observations from the data create branches and outcomes form the leaves of the tree. The model will then use this to try and predict wickets.

First Train a DescisionTreeClassifier:from sklearn.tree import DecisionTreeClassifierdtree_model = DecisionTreeClassifier(max_depth = 10).fit(X_train, y_train) 
dtree_predictions = dtree_model.predict(X_test)
Calculate accuracy on X_testaccuracy = dtree_model.score(X_test, y_test)
print(accuracy)
lg = log_loss(y_test, dtree_predictions)
print(lg)
Creating a confusion matrixcm = confusion_matrix(y_test, dtree_predictions)
cm

KNN Classifier
KNN stands for K nearest neighbours and is a model that will take a ball, look and look for the most similar other balls to it and predict the outcome based on the outcome of those similar balls. You can play around with the value of K to change the degree of fitting.

Training a KNN classifierfrom sklearn.neighbors import KNeighborsClassifier# neigh = RadiusNeighborsClassifier(radius=1.0).fit(X_train, y_train)
knn = KNeighborsClassifier(n_neighbors = 10, weights='distance').fit(X_train, y_train)
knn_predictions = knn.predict(X_test)
Accuracy on X_testaccuracy = knn.score(X_test, y_test)
# accuracy = neigh.score(X_test, y_test)
print(accuracy)
lg = log_loss(y_test, knn_predictions)
print(lg)
Creating a confusion matrixcm = confusion_matrix(y_test, knn_predictions)
# neigh_predictions = neigh.predict(X_test)
# cm = confusion_matrix(y_test, neigh_predictions)
cm

Naive Bayes
Naive Bayes models use Bayesian statistics to create a model. The naive refers to the assumptions it makes about the relationship between the variables (i.e. independence).

Training a Naive Bayes classifierfrom sklearn.naive_bayes import GaussianNBgnb = GaussianNB().fit(X_train, y_train) 
gnb_predictions = gnb.predict(X_test)
Accuracy on X_testaccuracy = gnb.score(X_test, y_test)
print(accuracy)
lg = log_loss(y_test, gnb_predictions)
print(lg)
Creating a confusion matrixcm = confusion_matrix(y_test, gnb_predictions)
cm

Random Forest Classifier
Random forest classifiers create multiple models from multiple subsets of the data and then combine these models to create an overall model. This helps improve fitting and generalisability.

Training a Random Forest classifierfrom sklearn.ensemble import RandomForestClassifierrforest_model = RandomForestClassifier(n_estimators=300, max_depth=8,random_state=0).fit(X_train, y_train) 
rforest_predictions = rforest_model.predict(X_test)
Accuracy on X_testaccuracy = rforest_model.score(X_test, y_test)
print(accuracy)
lg = log_loss(y_test, rforest_predictions)
print(lg)
Creating a confusion matrixcm = confusion_matrix(y_test, rforest_predictions)cm

Identifying important features
The final part of this section is to test which features our models are weighting as important. This is especially important to assess any new features we’ve built ourselves. For example, it might be that our feature measuring the number of wickets in the last 12 balls isn’t actually predictive.

evaluate_clf = rforest_modelfor i in range(len(X_test.columns)):
print(X_test.columns[i], evaluate_clf.feature_importances_[i])
z = evaluate_clf.predict_proba(X_test)
for i in range(len(y_test)):
if y_test.iloc[i]>0:
print(y_test.iloc[i])
print(z[i])

We can repeat this process to trial new features until we are happy that the ones we’ve created are effective. Remember, it is important to make sure your features make sense from a practical perspective! Don’t just randomly combine and manipulate features.

Step four: Identifying the probability threshold

If you’ve been following along you should have noticed that most of the models just predict most balls as not out (as this is the most common class). This makes comparing the models difficult. If we alter the certainty threshold at which the model will classify a ball a wicket we will be able to get a better sense of which models are performing superiorly.

y_scores = evaluate_clf.predict_proba(X_test)[:, 1]
p, r, thresholds = precision_recall_curve(y_test, y_scores)

In order to do this, we will compute precision-recall pairs for different probability thresholds.

  • Precision = True positive rate: true positives / all positive predictions (true positives + false positives)
  • Recall = True positives / all positive results (true positives + false negatives)

These can intuitively be hard to separate, but you can think of it like this:

  • Precision is a measure of the number of times the model prediction will be correct. i.e. what percentage of predicted wickets we actually wickets
  • Recall is a measure of the number of times the model will successfully predict the actual outcome. i.e. what percentage of actual wickets we predicted to be so
import matplotlib.pyplot as plt
plt.style.use("ggplot")
Adjust class predictions based on threshold:
def adjusted_classes(y_scores, t):
return [1 if y >= t else 0 for y in y_scores]
Create precision recall function:
def precision_recall_threshold(p, r, thresholds, t=0.5):
Generate new class predictions based on the adjusted_classes function above and view the resulting confusion matrix:
y_pred_adj = adjusted_classes(y_scores, t)
print(pd.DataFrame(confusion_matrix(y_test, y_pred_adj),
columns=['pred_neg', 'pred_pos'],
index=['neg', 'pos']))

Plot Curve of precision and recall scores for different thresholds:
plt.figure(figsize=(8,8))
plt.title("Precision and Recall curve ^ = current threshold")
plt.step(r, p)#, color='b', alpha=0.2,
# where='post')
plt.fill_between(r, p)#, step='post', alpha=0.2,
# color='b')
plt.ylim([0.5, 1.01]);
plt.xlim([0.5, 1.01]);
plt.xlabel('Recall');
plt.ylabel('Precision');
Plot current threshold on line:
close_default_clf = np.argmin(np.abs(thresholds - t))
plt.plot(r[close_default_clf], p[close_default_clf], '^', c='k',
markersize=15)
return y_pred_adj

Here are the graph functions to we’re going to use to visualise our

Precision Recall Curve
Modified from Hands-On Machine learning with Scikit-Learn and TensorFlow; p.89
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):

plt.figure(figsize=(8, 8))
plt.title("Precision and Recall Scores as a function of the decision threshold")
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.ylabel("Score")
plt.xlabel("Decision Threshold")
plt.legend(loc='best')
ROC curve:
Modified from Hands-On Machine learning with Scikit-Learn and TensorFlow; p.91
def plot_roc_curve(fpr, tpr, label=None):

plt.figure(figsize=(8,8))
plt.title('ROC Curve')
plt.plot(fpr, tpr, linewidth=2, label=label)
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([-0.005, 1, 0, 1.005])
plt.xticks(np.arange(0,1, 0.05), rotation=90)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (Recall)")
plt.legend(loc='best')

Now you can adjust the model’s threshold to optimise the recall precision trade-off. One way to do this would be to maximise the AUC (area under the curve) of the ROC graph.

Once you’ve done this you’ll have created a pretty decent first attempt at a model. You can then refine it by changing the parameters of your chosen classifier, e.g. by changing the number of features used in your decision tree, or the number of decisions, or the depth of the tree.

The competition page also links to some more advanced techniques, but feel free to do your own research and surprise us!

Link to the competition is here: https://quant-quest.auquan.com/competitions/cricket-qq1

--

--

David Ardagh
auquan
Editor for

Cornish born and working in a Fintech in London (how original). I try to make big things simple.