End-to-end ML system component: feature store → training → registry → inference bundle.
Reproducible ML training system for time-series demand forecasting.
Consumes point-in-time datasets from a feature store, enforces training contracts, trains and compares model candidates, evaluates on held-out splits, registers production models to MLflow, and produces inference bundles for downstream serving.
This repo covers the training and model selection lifecycle for bike demand forecasting. It does not cover feature engineering (P3), serving (P5), or monitoring.
Current capabilities:
- Parquet dataset loading from feature store exports
- Dataset validation (required columns, null checks, duplicate detection)
- Time-based train/val/test splitting
- Mean baseline and LightGBM model training
- RMSE and MAE evaluation on validation and test splits
- Config-driven candidate comparison with deterministic model selection
- MLflow experiment tracking with nested runs, param logging, and model artifacts
- Safe model registration (LightGBM only, baseline excluded)
- Reproducible inference bundle output
- Prefect flow orchestration
configs/training/ <- YAML configs per training mode
orchestration/flows/ <- Prefect flow definitions
scripts/ <- Convenience run scripts
src/ml_training_orchestrator/
├── artifacts/ <- Inference bundle builder
├── cli/ <- CLI entrypoint (train, train-and-register, run-flow)
├── data/ <- Loader, validation, time-based splitting
├── evaluation/ <- Metrics (RMSE, MAE) and model selection
├── features/ <- Feature selection helpers
├── models/ <- Mean baseline, LightGBM trainer, factory
├── pipelines/ <- Candidate training, data prep, train-and-register
├── registry/ <- Safe MLflow model registration
└── tracking/ <- MLflow logging helper
Input is a Parquet dataset exported from the feature store.
Column names are config-driven:
- entity_key → entity identifier
- timestamp_key → event timestamp (used for time-based splitting)
- target → prediction target
All other columns are treated as features (minus any in features.exclude).
Validation enforces:
- required columns present
- no nulls in key columns
- no duplicates on [entity_key, timestamp_key]
Configs live in configs/training/:
baseline.yaml→ single mean baseline runlightgbm.yaml→ single LightGBM runtrain_and_register.yaml→ multi-candidate comparison with selection and registration
All configs share dataset and split definitions.
The comparison config adds:
candidatesevaluation.primary_metricevaluation.directionregistry
The example configs reference datasets exported by sibling repos. Before running:
baseline.yaml,lightgbm.yaml,train_and_register.yamlexpect../mobility-feature-store/(P3) to exist with exported parquet datasetstrain_and_register_p2.yamlexpects../mobility-feature-pipeline/with its own export schema
pip install -e .ml-train train --config configs/training/baseline.yamlml-train train --config configs/training/lightgbm.yamlTrains all candidates (baseline + LightGBM), selects the winner by val_rmse, and registers the winner to MLflow if it is registry-compatible (LightGBM only).
Winner selection is deterministic:
- based on
evaluation.primary_metric(e.g. val_rmse) - optimized according to
evaluation.direction(min or max) - ties resolved by first candidate order in config
ml-train train-and-register --config configs/training/train_and_register.yamlSame lifecycle as train-and-register, wrapped in Prefect tasks for observability and rerunability.
ml-train run-flow --config configs/training/train_and_register.yamlAll runs log to the configured experiment.
- Single run → one MLflow run
- Comparison → one parent run with nested child runs (one per candidate)
Parent run:
- params: selected_candidate, selected_model_type
- metrics:
selected_<primary_metric> - tags: registry_status, bundle_path, prefect_flow (if applicable)
Child runs:
- metrics: val/test metrics
- artifacts: feature_cols.json, candidate_config.json, training_config.json
- model artifact (LightGBM only)
Registration is gated:
registry.enabledmust betruein the config- Only registry-compatible candidates (LightGBM) are eligible
- If the winner is the mean baseline, registration is skipped with reason logged
- Registered models are transitioned to the configured stage (e.g.
Staging)
Written to artifacts/<run_id>/.
Contents:
model_uri.txt→ MLflow model URI (runs:/<run_id>/model)feature_cols.json→ ordered feature listconfig.json→ full training configmetrics.json→ validation and test metricsmetadata.json→ run_id, candidate_name, model_type, created_at, input_dataset_name, input_dataset_version, started_at, completed_at
Bundle is metadata + contract only. Model remains in MLflow.
Rebuilding the same run_id preserves the original created_at.
Dataset lineage fields (input_dataset_name, input_dataset_version) are required and sourced from explicit dataset.name and dataset.version config fields. Bundle creation fails if either is missing, ensuring downstream serving observability has real lineage.
- P3 (mobility-feature-store): produces the point-in-time parquet exports consumed by this repo
- P4 (this repo): trains, evaluates, selects, registers, and bundles models
- P5 (mobility-serving-layer): serving layer that consumes inference bundles and registered models for online prediction
