Skip to content

Davisdenner/March_Machine_Learning_Mania_2026

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

23 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€ March Machine Learning Mania 2026

Kaggle Competition - Predict NCAA Tournament outcomes using advanced ML ensemble methods, ELO ratings, Stacking, and Seed Override strategies.

Python LightGBM CatBoost Kaggle License


Overview

This project tackles the March Machine Learning Mania 2026 Kaggle competition, which challenges participants to predict the probability of each possible matchup in the NCAA Men's and Women's Basketball Tournaments.

The pipeline combines feature engineering, custom ELO rating systems, LightGBM/XGBoost/CatBoost/LR ensemble stacking, Optuna hyperparameter optimization, Massey Ordinals, and Seed Override strategies to generate calibrated win probabilities for every potential game β€” evaluated using the Brier Score.


πŸ—‚οΈ Project Structure

kaggle_mania/
β”œβ”€β”€ notebooks/                  #Notebooks for exploratory analysis (EDA)
β”‚
β”œβ”€β”€ output/
β”‚   └── submission.csv          #Final Kaggle submission file
β”‚
└── src/
    β”œβ”€β”€ catboost_info/          #CatBoost training logs & metadata
    β”œβ”€β”€ config.py               #Global configuration (paths, hyperparameters, seeds)
    β”œβ”€β”€ data_loading.py         #Raw data ingestion and preprocessing
    β”œβ”€β”€ dataset_builder.py      #Feature matrix construction for train/test
    β”œβ”€β”€ elo_rating.py           #Custom ELO with MOV, custom prior and pre-tourney
    β”œβ”€β”€ ensemble.py             #Blending and Stacking ensemble logic
    β”œβ”€β”€ feature_engineering.py  #Advanced feature creation (stats, ELO PreTourney, seeds)
    β”œβ”€β”€ model_training.py       #Model training with Optuna hyperparameter tuning
    β”œβ”€β”€ monte_carlo.py          #Monte Carlo tournament bracket simulations
    β”œβ”€β”€ submission.py           #Formats submission CSV with Seed Override
    β”œβ”€β”€ validation.py           #Temporal CV splits and Brier Score evaluation
    └── main.py                 #Pipeline entrypoint β€” runs end-to-end

Methodology

1. Data Loading & Preprocessing

Raw NCAA historical data (Men's + Women's) is loaded, cleaned, and structured into season-level and game-level datasets. Massey Ordinals are filtered to the 15 most predictive ranking systems.

2. ELO Rating System

A custom ELO implementation with three key innovations:

  • Margin of Victory (MOV), FiveThirtyEight-style multiplier: a win by 20 points increases rating more than a win by 2
  • Custom Prior, teams don't start every season at the same rating; historically strong teams start higher based on their win percentage across all historical data
  • ELO PreTourney, captures each team's ELO at their last regular season game, before the tournament begins, a cleaner signal than end-of-season ELO
  • Home court adjustment, +-100 ELO points for home/away games
  • Soft reset between seasons, 75% carry-forward + 25% regression to custom prior

3. Feature Engineering

73 features engineered per matchup, including:

  • Season averages (offensive/defensive efficiency, tempo, FG%, 3PT%, AST/TO)
  • Tournament seeding differentials
  • ELO delta, absolute ELO, and ELO PreTourney for each team
  • Recent form, rolling window of last 14 games
  • Historical tournament wins (cumulative, no leakage)
  • Strength of Schedule
  • Massey Ordinals, 15 systems: POM, SAG, MOR, COL, DOL, WLK, ARG, BPI, RPI, KPI, RTH, DCI, REW, AP, USA

4. Model Training: Ensemble of 4 Models

Four base models trained with temporal cross-validation (9 folds, one season per fold):

Model Role
LightGBM Primary boosting model
XGBoost Diversity in boosting approach
CatBoost Robust to noisy features
Logistic Regression Strong linear baseline, consistently best single model

All GBMs use early stopping (50 rounds) to prevent overfitting.

5. Optuna Hyperparameter Optimization

Bayesian optimization via Optuna runs 30 trials each for LightGBM and CatBoost, using the first temporal fold as the optimization target. Parameters tuned: learning_rate, max_depth, num_leaves, subsample, colsample_bytree, reg_alpha, reg_lambda.

6. Stacking Ensemble

Instead of a simple weighted average, a meta-model (Logistic Regression) is trained on the Out-of-Fold (OOF) predictions of the 4 base models. This learns the optimal combination weights automatically, consistently outperforming manual blending.

Meta-model learned weights (final submission):

lr:  3.71  <- dominant signal
lgb: 2.07
xgb: 0.78
cat: 0.29

7. OOF Calibration

A calibrator is fitted on the stacked OOF predictions. Applied only if it improves the Brier Score, avoiding unnecessary distortion.

8. Seed Override

Conservative probability clipping for extreme seed matchups, protecting against catastrophic Brier Score penalties from upsets:

Matchup Favorite probability range
Seed 1 vs Seed 16 0.82 – 0.93
Seed 2 vs Seed 15 0.78 – 0.93
Seed 1 vs Seed 15 0.78 – 0.93

Seeds loaded directly from MNCAATourneySeeds.csv - 56 matchups protected in the final submission.

9. Submission Generation

Final probabilities formatted to Kaggle spec: one row per possible matchup (ID, Pred) for both Men's and Women's tournaments β€” 132,133 rows total. Predictions clipped to [0.05, 0.95].


Evaluation

Metric Description
Brier Score Primary competition metric mean squared error of probabilities
AUC-ROC Discrimination ability across all thresholds

Validation uses temporal splits, training on earlier seasons and validating on later ones, to simulate real prediction scenarios and prevent data leakage.


Model Evolution

Version OOF Brier Score Key Change
V1 Baseline ~0.220 5 Massey systems, no ELO prior
V2 Massey Expanded ~0.215 15 Massey systems
V3 Optuna ~0.210 Optuna on LGB + CatBoost
V4 ELO PreTourney ~0.110 ELO PreTourney feature
V5 Stacking ~0.090 Meta-model stacking
V6 ELO Prior ~0.088 Custom ELO prior
V7 Seed Override ~0.088 Seed Override + updated data
V8 Final 0.07667 Brier Score optimization + calibration

Setup & Usage

Prerequisites

Python >= 3.10

Install dependencies

pip install -r requirements.txt

Run the full pipeline

python src/main.py

The pipeline will:

  1. Load and preprocess all data (Men's + Women's)
  2. Build custom ELO prior from historical win rates
  3. Compute ELO ratings and ELO PreTourney across all seasons
  4. Engineer 73 features per matchup
  5. Run Optuna optimization (30 trials each for LGB and CatBoost)
  6. Train 4 base models with temporal cross-validation (9 folds)
  7. Fit Stacking meta-model on OOF predictions
  8. Apply OOF calibration if it improves Brier Score
  9. Apply Seed Override for extreme matchups
  10. Output output/submission.csv

Configuration

All key parameters live in src/config.py:

SEED = 33
ELO_INITIAL = 1500
ELO_K = 20
ELO_HOME_ADVANTAGE = 100
ROLLING_WINDOW = 14
STAGE = 2      # 1 = development, 2 = final submission
GENDER = "M"    # M = Men's, W = Women's

Feature Importance (Top 10 - LightGBM)

Rank Feature Description
1 Diff_ELO ELO rating differential
2 Diff_ELO_PreTourney ELO differential at tournament entry
3 Diff_Seed Seed number differential
4 Diff_HistTourneyWins Historical tournament wins differential
5 A_HistTourneyWins Team A historical tournament wins
6 B_Seed Team B seed number
7 Diff_Rolling_ScoreDiff Recent scoring margin differential
8 Diff_Massey_DCI Massey DCI system differential
9 A_Seed Team A seed number
10 Diff_Stl_mean Steals differential

Roadmap

  • ELO rating system with MOV (FiveThirtyEight style)
  • Custom ELO prior based on historical win rates
  • ELO PreTourney feature
  • Massey Ordinals, 15 systems
  • LGB + XGB + CatBoost + LR ensemble
  • Optuna hyperparameter optimization
  • Stacking with meta-model
  • OOF calibration
  • Seed Override for extreme matchups
  • Brier Score optimization (competition metric)
  • Men's + Women's combined submission

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.


Author

Davis Denner
Data Scientist Β· Kaggle Enthusiast
GitHub Β· LinkedIn


License

This project is licensed under the MIT License.


In data we trust, in brackets we fight, March Madness 2026

About

This project tackles the March Machine Learning Mania 2026 Kaggle competition, which challenges participants to predict the probability of each possible matchup in the NCAA Men's and Women's Basketball Tournaments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors