Purpose
This document provides foundational MLOps knowledge for developers new to machine learning operations.
- New to MLOps → Read this first
- Familiar with MLOps → Skip to architecture.md
- Need specific concept → Use the index below
MLOps (Machine Learning Operations) applies DevOps practices to machine learning systems:
| DevOps Concept | MLOps Equivalent |
|---|---|
| Code versioning | Model versioning |
| CI/CD pipelines | Training pipelines |
| Staging → Production | Candidate → Production |
| Monitoring | Model performance metrics |
| Rollback | Model rollback |
A versioned store of trained models:
- Each model has a name (e.g.,
planner_model) - Each version is immutable
- Metadata tracks parameters, metrics, artefacts
In this system: MLflow Model Registry
Recording what happens during training:
- Parameters (learning rate, batch size)
- Metrics (loss, accuracy)
- Artefacts (model weights, configs)
In this system: MLflow Tracking
Centralised storage for ML features:
- Consistent feature computation
- Reuse across training and inference
In this system: PostgreSQL domain tables + event logs
Deploying models for inference:
- Low latency
- High availability
- Version management
In this system: Policy versions linked to MLflow models
Comparing model variants:
- Split traffic between versions
- Measure performance differences
- Statistical significance
In this system: Experiment Assignment Service + variants
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Data │ ──►│ Train │ ──►│ Evaluate│ ──►│ Deploy │ │
│ │ Prep │ │ │ │ │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ ▲ │ │
│ │ │ │
│ └──────────────────────────────────────────────┘ │
│ Feedback Loop │
│ │
└─────────────────────────────────────────────────────────────────────┘
- Collect raw data (events, logs)
- Transform into training format
- Version datasets
- Load data and base model
- Fine-tune or train from scratch
- Log everything to MLflow
- Offline replay against historical data
- Compare with baseline
- Gate before deployment
- Create policy version
- Create experiment variant
- Gradual rollout
| Term | Definition |
|---|---|
| Artefact | Any file produced by training (model, config, tokenizer) |
| Baseline | The current production model to compare against |
| Candidate | A model that passed offline eval, ready for online testing |
| Epoch | One complete pass through training data |
| Fine-tuning | Adapting a pre-trained model to a specific task |
| Hyperparameter | Training configuration (learning rate, batch size) |
| Inference | Using a model to make predictions |
| LoRA | Low-Rank Adaptation - efficient fine-tuning method |
| QLoRA | Quantized LoRA - memory-efficient fine-tuning |
| Rollback | Reverting to a previous model version |
| Run | A single training execution with specific parameters |
Every training run must be reproducible:
- Same data + same params = same model
- Version everything
Know what's happening at all times:
- Training progress
- Model performance
- Production metrics
Protect users from bad models:
- Offline evaluation gates
- Gradual rollouts
- Quick rollback capability
| Document | Purpose |
|---|---|
| mlflow-guide.md | MLflow specifics |
| training-workflow.md | Training process |
| offline-evaluation.md | Evaluation gates |
| model-promotion.md | Deployment lifecycle |
You will understand:
- Basic MLOps concepts
- The ML lifecycle
- Key terminology
- Why MLOps practices matter
Next Step: architecture.md