tennis preds

The ATP Match Predictions Model predicts the outcomes (winner vs. loser) of tennis matches played by professional players on the ATP tour.

tennis preds
Service:
modeling
Industry:
data science
Year:
2024

About the project


The training data was collected from January 2018 to December 2022 and sourced from Jeff Sackman, who compiles comprehensive ATP match data. The dataset consists of 12,815 observations with 49 columns, covering match statistics such as match location, surface, points won, rankings, and player statistics.

The test set was the 2023 match data.

Key data processing and feature extraction steps included: 

  • Variables such as Elo ratings, serve statistics, break points, and match outcomes were created for each player.
  • Rolling averages were calculated for each player.
  • Random Forest was used to select the top 20 features for model training.
  • Standard cleaning steps such as removal of missing data, scaling, and outlier removal were performed.

The following models were tested and tuned using the MLFlow interface. MLflow is an open-source platform that streamlines the end-to-end machine learning lifecycle, supporting experiment tracking, model management, and deployment.

  • XGBoost
  • Support Vector Machine
  • Logistic Regression
  • Neural Network
tennis preds

Results

The best performing model was the XGBoost model with the following hyperparameters:

  • Eta = 0.3
  • Max_depth = 15
  • N_rounds = 20

This model had the following performance metrics: 

  • Test set F1 = 0.679
  • Test set AUC = 0.783
  • Test set accuracy = 0.707

The model was deployed via a simple html web app which can be launched using Github code.

tennis preds