Skip to content

scaleborg/ml-training-orchestrator

Repository files navigation

ml-training-orchestrator

Python MLflow LightGBM Prefect Status

End-to-end ML system component: feature store → training → registry → inference bundle.

Reproducible ML training system for time-series demand forecasting.

Consumes point-in-time datasets from a feature store, enforces training contracts, trains and compares model candidates, evaluates on held-out splits, registers production models to MLflow, and produces inference bundles for downstream serving.

System overview

Architecture

Scope

This repo covers the training and model selection lifecycle for bike demand forecasting. It does not cover feature engineering (P3), serving (P5), or monitoring.

Current capabilities:

  • Parquet dataset loading from feature store exports
  • Dataset validation (required columns, null checks, duplicate detection)
  • Time-based train/val/test splitting
  • Mean baseline and LightGBM model training
  • RMSE and MAE evaluation on validation and test splits
  • Config-driven candidate comparison with deterministic model selection
  • MLflow experiment tracking with nested runs, param logging, and model artifacts
  • Safe model registration (LightGBM only, baseline excluded)
  • Reproducible inference bundle output
  • Prefect flow orchestration

Architecture

configs/training/           <- YAML configs per training mode
orchestration/flows/        <- Prefect flow definitions
scripts/                    <- Convenience run scripts
src/ml_training_orchestrator/
├── artifacts/              <- Inference bundle builder
├── cli/                    <- CLI entrypoint (train, train-and-register, run-flow)
├── data/                   <- Loader, validation, time-based splitting
├── evaluation/             <- Metrics (RMSE, MAE) and model selection
├── features/               <- Feature selection helpers
├── models/                 <- Mean baseline, LightGBM trainer, factory
├── pipelines/              <- Candidate training, data prep, train-and-register
├── registry/               <- Safe MLflow model registration
└── tracking/               <- MLflow logging helper

Dataset contract

Input is a Parquet dataset exported from the feature store.

Column names are config-driven:

  • entity_key → entity identifier
  • timestamp_key → event timestamp (used for time-based splitting)
  • target → prediction target

All other columns are treated as features (minus any in features.exclude).

Validation enforces:

  • required columns present
  • no nulls in key columns
  • no duplicates on [entity_key, timestamp_key]

Configuration

Configs live in configs/training/:

  • baseline.yaml → single mean baseline run
  • lightgbm.yaml → single LightGBM run
  • train_and_register.yaml → multi-candidate comparison with selection and registration

All configs share dataset and split definitions.

The comparison config adds:

  • candidates
  • evaluation.primary_metric
  • evaluation.direction
  • registry

Usage

Prerequisites

The example configs reference datasets exported by sibling repos. Before running:

  • baseline.yaml, lightgbm.yaml, train_and_register.yaml expect ../mobility-feature-store/ (P3) to exist with exported parquet datasets
  • train_and_register_p2.yaml expects ../mobility-feature-pipeline/ with its own export schema

Install

pip install -e .

Run mean baseline

ml-train train --config configs/training/baseline.yaml

Run LightGBM

ml-train train --config configs/training/lightgbm.yaml

Run candidate comparison with registration

Trains all candidates (baseline + LightGBM), selects the winner by val_rmse, and registers the winner to MLflow if it is registry-compatible (LightGBM only).

Winner selection is deterministic:

  • based on evaluation.primary_metric (e.g. val_rmse)
  • optimized according to evaluation.direction (min or max)
  • ties resolved by first candidate order in config
ml-train train-and-register --config configs/training/train_and_register.yaml

Run via Prefect flow

Same lifecycle as train-and-register, wrapped in Prefect tasks for observability and rerunability.

ml-train run-flow --config configs/training/train_and_register.yaml

MLflow tracking

All runs log to the configured experiment.

  • Single run → one MLflow run
  • Comparison → one parent run with nested child runs (one per candidate)

Parent run:

  • params: selected_candidate, selected_model_type
  • metrics: selected_<primary_metric>
  • tags: registry_status, bundle_path, prefect_flow (if applicable)

Child runs:

  • metrics: val/test metrics
  • artifacts: feature_cols.json, candidate_config.json, training_config.json
  • model artifact (LightGBM only)

Model registration

Registration is gated:

  • registry.enabled must be true in the config
  • Only registry-compatible candidates (LightGBM) are eligible
  • If the winner is the mean baseline, registration is skipped with reason logged
  • Registered models are transitioned to the configured stage (e.g. Staging)

Inference bundle

Written to artifacts/<run_id>/.

Contents:

  • model_uri.txt → MLflow model URI (runs:/<run_id>/model)
  • feature_cols.json → ordered feature list
  • config.json → full training config
  • metrics.json → validation and test metrics
  • metadata.json → run_id, candidate_name, model_type, created_at, input_dataset_name, input_dataset_version, started_at, completed_at

Bundle is metadata + contract only. Model remains in MLflow.

Rebuilding the same run_id preserves the original created_at.

Dataset lineage fields (input_dataset_name, input_dataset_version) are required and sourced from explicit dataset.name and dataset.version config fields. Bundle creation fails if either is missing, ensuring downstream serving observability has real lineage.

Project context

  • P3 (mobility-feature-store): produces the point-in-time parquet exports consumed by this repo
  • P4 (this repo): trains, evaluates, selects, registers, and bundles models
  • P5 (mobility-serving-layer): serving layer that consumes inference bundles and registered models for online prediction

About

ML training system with feature store integration, model selection, MLflow registry, and inference bundle packaging

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages