To be able to develop a machine learning & data science backed solution that is able to accurately predict winners of games in the NCAA March Madness tournament (for both the Men's and Women's tournaments).
As an avid basketball fan and an upcoming data scientist, I decided to tackle the 2023 Kaggle March Machine Learning Mania Challenge. Just to clarify, March Madness is a tournament where NCAA college basketball teams compete against each other. This tournament happens annually, and millions of people try to build the perfect bracket: you predict every game perfectly in terms of who wins and who loses. However, the probability of such a bracket being created is 1 in 120.2 billion.
I wanted to see if I could use my machine learning and data science expertise to my advantage. Furthermore, I wanted to see if using machine learning & data science gives you an edge in terms of predicting who wins each game. Now that we have covered the background of the project, let's take a look at the project goals.
Main Goal: Engineer a data science & machine learning backed solution to predict outcomes of games in the NCAA college basketball tournament (for both Men's and Women's).
SubGoals:
- Leverage my solution to submit brackets to various tournaments (ESPN, NCAA, etc).
- Be able to earn an award in the Kaggle competition (linked above) as it is the basis of this project.
In this section, I would like to cover what exactly we are building and how we will use it to achieve the main goal. Since the basis of this project is based off of the Kaggle competition, we have the "what" we are building solved. We are building a machine learning system that returns the probability that the team with the lower TeamID (assigned via the data provided) wins. For example, if Rutgers has a TeamID of 1311 and Purdue has a TeamID of 1312, we would want our system to predict P(Rutgers Wins). Likewise, if Purdue has TeamID 1312 and Ohio State has TeamID 1310, we want to predict P(Ohio State Wins).
Hence, this is a classification problem (supervised learning). The data is provided to us via the Kaggle Competition. We can use this problem to predict game outcomes. Our model would return P(lower TeamID wins) ,and from this we can derive P(higher TeamID wins) = 1 - P(lower TeamID wins). We simply choose the TeamID with the highest probability and get the team name based on provided mappers. We can then use the predictions to build brackets and to submit to the Kaggle Challenge.
Note, the Kaggle Challenge looks at all possible matchups in each tournament since the challenge was released prior to the tournament brackets being made. Also note, the Kaggle Challenge provides us data for both Men's and Women's tournaments.
One final thing to note is how I am evaluating the model The model is being evaluated using Brier Score. This was provided to us via the Kaggle Competition. I am also using Log-Loss to compare my model's results to last year's competition. I am also using accuracy because I want to see how accurate my model is. Note, I will not choose my model based off of this.
- Python
- Pandas: library to help leverage and manipulate tables (DataFrames)
- Numpy: library that provides lots of the fundamental scientific computing in Python.
- Matplotlib: library that provides tools to create nice visuals in Python.
- Seaborn: library that gives us nice visuals in Python.
- Scikit-Learn: API that allows us to build machine learning models in Python.
- Catboost: library that allows us to leverage the Gradient Boosting on Decision Trees Algorithm.
- Keras: a framework that allows the user to leverage Deep Learning models for various problems. (Note, this version of Keras is using Tensorflow backend).
- Scikeras: A wrapper that allows us to leverage Scikit-Learn functions with Keras deep learning models.
- Poetry: A Python dependency management system.
- Pickle: Python Object Serialization, helps with saving models.
The way I monitored project results was to (a) submit the predictions to the Kaggle competition (of course) and (b) leverage the model to create brackets for both the Men's and Women's games. From the prediction side of things, overall, the model was able to predict games with a 62.5% accuracy. However, what is to be noted here is the huge disparity between the accuracies for the Men's games and the Women's games. The model was able to predict the Men's games with a 55.56% accuracy and was able to predict the Women's games with a 69.84% accuracy. Obviously, there is a big gap in the accuracies and this reflected in the brackets. My men's bracket earned 290/1920 possible points on the ESPN bracket challenge. A lot of this can be attributed to the model's inability to predict the men's games at a high accuracy, which resulted in a lot of misses in the round of 64 which led to the entire bracket being thrown off. The women's bracket, on the other hand, earned 890/1920 possible points on the ESPN bracket challenge. Again, this can be attributed to a high accuracy in predicting the games correctly throughout. Obviously, some improvements are needed, but this is a good start.
- Predicting college basketball match outcomes using machine learning techniques: some results and lessons learned: A research paper published in 2013 that provided me with some ideas on features to create and models to try. Also gave me some insight on what some key predictors might be.
- Build a proper preprocessing pipeline that can automate the preprocessing of the data (box scores to actual model ready data).
- Read more papers to improve our feature set so that we can elevate the model's performance (engineering better features).
- Build a UI/UX so that this can become an actual application with a frontend that a user can use.
- Build a smooth pipeline such that we can send in 2 team names and get a prediction.
- Understand and analyze the errors made by the model in the 2023 tournament.
- Perform more feature analysis to better engineer our feature set.