I have been thinking about ways of apply machine learning to predictions related to cricket. Last two blog posts; Predicting outcome and Predicting score were not totally new since that work has also been done a few times before. I wanted to come up with something unique that has not been done before in the context of cricket. This blog post is about predicting the next wicket in a t20 match using machine learning. This is an interesting read, lets dive right in.
Predicting next wicket in a T20 match using machine learning
Machine learning can also be called pattern finding. Humans are not very good at finding patterns in many things. The purpose of this project was to see if a machine learning model can find patterns in ways cricket players get out in t20. I came up with some very interesting results.
The data set I use is of 400 International T20 matches excluding matches of teams like Oman, UAE, Afghanistan.
First of all, we have two values “Previous Balls” and “Next Balls” that corresponds to how many previous balls are we giving to our machine learning model and in how many Next Balls does our model have to tell us about whether a player will get out or not. I tried different values and found Previous Balls = 18(3 overs) and Next Balls=6 to be optimal. This means that our machine learning model is given the information about last 18 balls and is asked to predict whether any player will get out in the net 6 balls. This is a binary classification problem.
The features for this problem are pretty cool. Lets take a look at them.
- Score of each ball in last 18 balls (Values can be 0,1,2,3,4,6)
- Wicket in each of last 18 balls (Binary variable with length 18)
- Overs going right now
- Team Playing
- Opposition Team
- Home Ground or Not
- Chasing or Not
- Total wickets in the match till now
- Regions where each ball from last 18 balls went (8 regions in the pitch).
The reason for giving score and wicket for each balls is that we want to let the classifiers find patterns. Therefore, we are just providing them with raw data. Another reason is that neural networks are capable of learning features given the data and labels.
Some other features that I also tried were Ball Speed of each ball, Ball Variation of each ball. I removed these features since not each match had data for these features.
We start with the 18th ball in the first innings and for each iteration, we take 18 balls as our previous balls and 6 balls next to our 18 balls as next balls. We check if there is a wicket in our next balls and if there is, we make our label 1. Otherwise, it stays 0. We do the same for second innings but the chasing variable is set to 1 here. We do this for both innings and this is how we calculate features from a match. For each match, we have 0-20 positive labels (0-20 wickets in a match) and others remain 0.
I tried a large number of classifiers for this problem but I ended up using just Random Forests and Feed forward neural networks.
One thing to note is that I did not rely on classifier’s accuracy since that was coming out to be bad. But when I looked at the probabilities for each decision, there was a clear distinction between probabilities for 0 and 1 labels. If the probability of 0 was more than 90%, it was indeed 0 in the data but if it was less than 90%, then it was mostly 1 which means the player got out there.
Our classification was done by assigning 1 whenever the probability of assigning 0 by the model is less than 88% and 0 otherwise.
After feature extraction, we have over 50,000 rows of data. We split the data randomly in 70% training set and 30% test set. I wanted to see the results on a large data, so I kept the test data to 30%.
Accuracy: 82-85% (Remained in this range with both classifiers with and without a few features removed)
True Positives: 39%. This is our most important metric. These were the values that our model predicted as a wicket and there was indeed a wicket. We were able to correctly predict a wicket 411 times out of 1050 times.
True Negatives: 87% (These were the values where our model said that there was going to be a wicket and there was no wicket). We were able to correctly predict that no wicket would occur 14102 out of 16174
These results seem very abstract. Lets run the model and find the wickets in a few matches.
Results for a single match
I selected this match for our test match and ran the model after extracting features from this match’s data. This match was not included in the training data. The results we get are:
Match Link: http://www.espncricinfo.com/twenty20wc/engine/match/287875.html
True Positives: 50%. Out of 16 outs, our model predicted 8 correctly.
True Negatives: 89%
The balls where our model predicted an out in the next 6 balls are:
Banglandesh Innings: Over 2.1, 8.5, 9.6, 14.2, 14.5, 18.1
Pakistan Innings: Over 8.1, 14.3
The following graph sums up the results.
This was an exciting project and the results seem very convincing to me. I believe we can do a lot more stuff using such techniques.
Let me know about your opinions in the comments.