Can we predict a cricket match before it even starts? I have been thinking about it since I read a few books on Sports Analytics. Cricket is a game where the randomness seems to be maximum. I have been tinkering with different models but nothing comes up to be as good as the results in other sports fields including football, baseball etc. After a few days of struggle, I was finally able to build a model that can predict an ODI match of Pakistan with 75% accuracy before the match even starts. 75% is a great accuracy considering the randomness in the game and the fact that we are making our prediction before the match even starts.
Lets see how we can achieve this. For now, I am only sticking with Pakistan matches.
Predicting the outcome of Cricket Matches using Machine Learning
The dataset contains 184 ODI matches of Pakistan for the last 8 years.
This was the hard part. It took me a while to come up with features and I am still thinking about more features to integrate in the model. Our features are:
- Mean of average score of all batsman against the opposition team.
- Mean of average score of all batsman in the country where match is being played.
- Mean of average wickets of all bowlers against the opposition team.
- Win percentage of Pakistan against the team.
- Win percentage of Pakistan against the team in the same country where the match is being played.
These are the 5 features that I am using right now. Please note that I am not explicitly incorporating any information about the adversary at the moment. I would be doing that in a few days to see how it increases the accuracy. If you can think of a feature, feel free to comment it.
Now that we have our features, we compute them for every match and apply different machine learning models to see the results. The training was done on 80% matches and the last 20% (with respect to the date when the match was played) were used for testing.
I applied various machine learning algorithms and after some pruning, I finalized two of them that were giving good accuracy. One thing to note here is that I have only done this process for Pakistan’s team. Accuracies for different teams will certainly be different than Pakistan since different teams have different patterns of winning and losing.
Lets see the results.
- Naive Bayes – 75% Accuracy (75% Precision)
- Logistic Regression – 68% accuracy
The accuracies are great but we want to see how did the model predict these matches. Lets take a closer look at each match and its prediction. I have created a table of a few matches that were predicted by the model.
One of the most interesting findings is the one where Pakistan won the match despite our model telling that Australia will win the match. But an interesting thing to note here is that in the whole series, it was giving 60+ percent probabilities but as soon as Pakistan included Junaid Khan in their team, the probability of winning for team Australia decreased and Pakistan went on to win the match since Junaid Khan took two early wickets.
That’s it for this post.
If you have any comments, feel free to post them.