Skip to content

predict the winning horse with supervised machine learning models (lucky to have 100% accuracy on small test data)

License

Notifications You must be signed in to change notification settings

karenwky/Predictive_Modeling_Hong_Kong_Horse_Racing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predictive Modeling: Hong Kong Horse Racing

With supervised machine learning models, predict the winning horse with data of various features.

Data Source

Data from Kaggle user Graham Daley, containing two sets of data about horse information and race information.

Features Explanation

won - whether horse won (1) or otherwise (0)
horse_age - current age of this horse at the time of the race
horse_rating - rating number assigned by HKJC to this horse at the time of the race
horse_gear - string representing the gear carried by the horse in the race. An explanation of the codes used may be found on the HKJC website.
declared_weight - declared weight of the horse and jockey, in lbs
actual_weight - actual weight carried by the horse, in lbs
draw - post position number of the horse in this race
win_odds - win odds for this horse at start of race
place_odds - place (finishing in 1st, 2nd or 3rd position) odds for this horse at start of race
surface - a number representing the type of race track surface: 1 = dirt, 0 = turf
distance - distance of the race, in metres
race_class - a number representing the class of the race
horse_country - country of origin of this horse
horse_type - sex of the horse, e.g. 'Gelding', 'Mare', 'Horse', 'Rig', 'Colt', 'Filly'
venue - a 2-character string, representing which of the 2 race courses this race took place at: ST = Shatin, HV = Happy Valley
config - race track configuration, mostly related to the position of the inside rail. For more details, see the HKJC website.
going - track condition. For more details, see the HKJC website.

Data Preprocessing

Labels
As the data is extremely skewed, resampling library imblearn is used. Under-sampling method RandomUnderSampler (RUS) and over-sampling method Synthetic Minority Over-sampling Technique (SMOTE) are used for different model experiments.

Modeling

1. KNeighborsClassifier (kNN Classifier)

For this dataset, the target is to minimize False Positive, which means prediction: win, actual: lose. So the metric is set to be precision score of the positive class (1), which is the win label. Thus, find out the optimized k-value with for loops.

knn_oriknn_rusknn_sm

2. LightGBM

Building a fast gradient boosting framework with adjusting the optimized threshold value to obtain the precision score of the positive class (1) as high as possible.

top10_oritop10_rus top10_smote
Top 10 important features are shown, win odds and place odds are particularly ranked highly for all of the models.

Training Summary

Size Time (sec) Precision (0) Precision (1) F1-score (0) F1-score (1) True Positive False Positive
kNN_original_data 49.1 MB 3.23525 0.92 0.32 0.96 0.02 13 28
kNN_rus 8 MB 1.06418 0.95 0.15 0.81 0.24 752 4222
kNN_smote 90.6 MB 6.65747 0.93 0.15 0.90 0.20 328 1808
lgb_original_data 729 KB 1.32079 0.95 0.28 0.93 0.33 493 1287
lgb_rus 130 KB 0.19044 0.94 0.30 0.93 0.32 429 1020
lgb_smote 823 KB 1.98941 0.93 0.36 0.95 0.15 113 204
  • By processing a lot of data, kNN model trained with over-sampled data took the longest time, while LightGBM model trained with under-sampled data took the shortest time.
  • kNN models performed relatively worse with low precision score and f1-score of the positive class (1).
  • Training models aimed at minimize False Positive (predict: win, actual: lose), but it seems True Positive and False Positive are correlated. Same as gambling and investment, you have the chance to win and the risk to lose at the same time.
  • File sizes of LightGBM models are incredibly small and the time spent on training models is really quick.

Predictions

With data for one of the races in the dataset (which is excluded in training the models), predict the winning horse.

1. KNeighborsClassifier (kNN Classifier)

For kNN models, only model trained with under-sampled data can predict the winning horse. However, there is one False Positive in the prediction.

"Classification Report of kNN Classifier model trained with under-sampled data"

                 precision  recall   f1-score   support

         0.0       1.00      0.92      0.96        13
         1.0       0.50      1.00      0.67         1

    accuracy                           0.93        14
   macro avg       0.75      0.96      0.81        14
weighted avg       0.96      0.93      0.94        14

knn_rus_cm

2. LightGBM

With setting the threshold value same as the models trained, all LightGBM models achieved 100% accuracy.

"Classification Report of all LightGBM models"

                 precision  recall   f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       1.00      1.00      1.00         1

    accuracy                           1.00        14
   macro avg       1.00      1.00      1.00        14
weighted avg       1.00      1.00      1.00        14

knn_lgb

Things to be Improved

  • Correlation between features is ignored.
  • Feature scaling was not performed for different range of numeric values.
  • One-hot encoding was not performed and just keeping the numeric values for some categorical features such as draw and race_class.
  • Finetuning of hyperparameters for kNN classifier models is ignored. GridSearchCV can be performed to experiment better results.
  • The volume of test data is small, so the highly-accurate prediction result is more of luckiness. More data can be used to do the testing experiment.

Detailed Presentation

Skills Acquired

  • Pandas: e.g. data cleaning and data imputation, one-hot encoding for categorical features
  • Scikit-learn: e.g. fine-tuning hyperparameters of the training models, setting metrics in evaluating models
  • Joblib: saving and loading trained machine learning models

Acknowledgements

LightGBM code reference from Medium article by Pushkar Mandot, and confusion matrix plot code reference from Stack Overflow user Joe Kington. Thank you coders for sharing your experience! =]

About

predict the winning horse with supervised machine learning models (lucky to have 100% accuracy on small test data)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published