⚗️ FrothIQ — Mineral Process Quality Forecasting on Databricks

🌐 English · Español

⚗️ FrothIQ — Mineral Process Quality Forecasting on Databricks

Production-grade ML platform for mineral flotation process quality prediction. Built local-first, deployable to Azure Databricks. Includes statistical process control (SPC) charts and what-if simulation for operators.

End-to-end ML platform that ingests sensor data from a mineral flotation plant (real industrial data from a Brazilian iron-ore concentration plant), processes it through a Medallion architecture (Bronze → Silver → Gold) on Delta Lake, trains predictive models for % Iron and % Silica in concentrate output, and serves them through a Streamlit dashboard with SPC charts and a what-if simulator. The whole pipeline is reproducible local ≡ cloud — same code runs on a laptop or on a Databricks cluster.

✅ Status — modeling track + serving layer complete (2026-05-02). All notebooks (EDA → features → LightGBM all-rows → LightGBM fresh-only → SPC → What-if), FastAPI inference service, Streamlit dashboard, and drift monitoring are live. Deployment to Databricks documented. 47/47 tests pass. See the Roadmap.

📊 Headline results on the Kaggle flotation dataset (737K rows, 6 months)

Metric	All-rows model	Fresh-only model (notebook 02b)
Test RMSE on `% Iron Concentrate`	1.216	0.786 (−35.4%)
Test RMSE on `% Silica Concentrate`	1.152	0.823 (−28.5%)
Test R² on `% Iron Concentrate`	−0.171	−0.216
Train rows used	515,677	42,654

The headline finding is not the RMSE — it is the structural temporal distribution shift detected between train (Mar–Jun 2017) and test (Jul–Sep 2017). Fresh-only training cuts RMSE 28-35% by removing forward-fill noise from the supervision signal, but R² stays slightly negative because the test distribution moved.

SPC catches the shift dramatically — Shewhart Western Electric rules + CUSUM detect the regime change residue-by-residue:

SPC method on residuals	Signals fired	% of test rows
Shewhart rule 1 (±3σ)	19	0.21%
Shewhart rule 2 (2 of 3 ±2σ)	132	1.44%
Shewhart rule 3 (4 of 5 ±1σ)	2,074	22.68%
Shewhart rule 4 (8 same side)	8,816	96.40%
CUSUM (δ=1σ, h=4σ)	8,567	93.68%
EWMA (λ=0.2, L=3)	2,117	23.15%

The CUSUM Cl statistic ramps to ~1000 over thousands of rows — visual proof of sustained model bias as the plant operating regime drifts. This is exactly what production SPC is for: catching the moment a model starts being systematically wrong, before the lab QA confirms the quality drift.

Honest findings

The Kaggle flotation dataset has 91.73% forward-filled labels. Training on all rows treats forward-fills as ground truth and yields a model that beats the naive baseline by only 1.5–4%. Restricting to the 8.27% fresh lab readings is the methodologically correct path; documented in notebook 02b.
Even with the fresh-only filter, R² remains slightly negative. The dominant problem is temporal distribution shift between the first 70% of the timeline (train) and the last 15% (test). Operating regime, feed source, or instrument calibration changed mid-dataset — a real, common, and underreported phenomenon in industrial ML.
The feature importances validate physical interpretation. For % Iron Concentrate the top driver is pct_iron_feed (more iron in → more iron out — physically correct). For % Silica Concentrate the top drivers are starch flow and ore pulp density, exactly the reactives used to depress silica during reverse cationic flotation.
The what-if simulator is robust to single-point overrides (Δ predicted ≈ 0 for any pH override). The model correctly learned that one-instant excursions don't predict steady-state quality — only sustained changes (over a 30-min window or more) move the prediction. This is a feature, not a bug: in production, the simulator would override a contiguous window of cycles, not a single timestep.

🎯 What this project proves

Capability	Evidence
Mining domain	Real industrial data from a flotation plant (737K rows, 24 sensors)
Big data on Spark	PySpark pipeline runs local and on Databricks unchanged
Medallion architecture	Bronze (raw) → Silver (clean) → Gold (features) on Delta Lake
Production ML	LightGBM + LSTM (PyTorch) tracked in MLflow with model signatures
Statistical Process Control	Shewhart charts, CUSUM, Western Electric rules for operator alerts
What-if simulation	Operator dashboard: "what if I change pH from X to Y?"
MLOps rigor	Tests + CI, Docker, drift monitoring, model registry with aliases
Local-first → Cloud	Reproducible local equiv to Databricks; deployment guide included

🏗️ Architecture

┌──────────────────────┐    ┌────────────────────┐    ┌─────────────────────┐
│  Kaggle CSV          │───▶│  Bronze Layer      │───▶│  Silver Layer       │
│  (737K × 24)         │    │  (Delta — append)  │    │  (clean, typed)     │
│  Real flotation data │    └────────────────────┘    └──────────┬──────────┘
└──────────────────────┘                                          │
                                                                  ▼
       ┌─────────────────────────────┐                ┌────────────────────┐
       │  Streamlit dashboard        │◀───────────────│  Gold Layer        │
       │  • SPC charts               │                │  (features+target) │
       │  • What-if simulator        │                └──────────┬─────────┘
       │  • Live predictions         │                           │
       └─────────────┬───────────────┘                           ▼
                     │                                ┌────────────────────┐
                     │ HTTP                           │  MLflow Tracking   │
                     ▼                                │  + Model Registry  │
       ┌─────────────────────────────┐                │  (alias @prod)     │
       │  FastAPI service            │◀───────────────└──────────┬─────────┘
       │  /predict_quality           │                           │
       │  /sim_whatif                │                           ▼
       └─────────────────────────────┘                ┌────────────────────┐
                                                      │  Evidently AI      │
                                                      │  (drift monitoring)│
                                                      └────────────────────┘

📊 Dataset

Quality Prediction in a Mining Process by Edson Antonio Magalhaes (Kaggle, ~700K downloads).

Real industrial data from a Brazilian iron-ore concentration plant. The dataset:

737,453 rows sampled at 20-second intervals over ~6 months
24 columns: % Iron Feed, % Silica Feed, Starch Flow, Amina Flow, Ore Pulp Flow, Ore Pulp pH, Ore Pulp Density, 7 Flotation Column Air Flow columns, 7 Flotation Column Level columns, target % Iron Concentrate and % Silica Concentrate
Targets (lab measurements, hourly): % Iron Concentrate and % Silica Concentrate
Goal: predict outputs from upstream sensor measurements so operators can adjust parameters before product goes off-spec

📂 Project structure

.
├── src/frothiq/
│   ├── data/             # Kaggle ingestion, Bronze loaders
│   ├── features/         # Rolling stats, lag aggregations, frequency-domain
│   ├── models/
│   │   ├── baseline/     # LightGBM, XGBoost, scikit-learn baselines
│   │   ├── deep/         # LSTM (PyTorch) sequence models
│   │   ├── spc/          # Shewhart, CUSUM, Western Electric rules
│   │   └── whatif/       # What-if simulation (counterfactual predictions)
│   ├── serving/          # FastAPI app, Streamlit dashboard
│   └── utils/            # MLflow helpers, logging, configs
├── notebooks/            # 00_eda → 01_features → 02_baseline → 03_lstm → 04_spc → 05_whatif
├── data/                 # raw / interim / processed (gitignored)
├── configs/              # YAML configs per dataset / model
├── scripts/              # download_data.sh, train_*.py, deploy_databricks.sh
├── tests/                # pytest suite
├── .github/workflows/    # CI: ruff + pytest
└── docs/                 # ADRs, dataset cards, Databricks deployment guide

🚀 Quickstart

# 1. Clone
git clone https://github.com/jsanchez-ds/frothiq.git
cd frothiq

# 2. Install (Python 3.11+ recommended)
python -m venv .venv
.venv\Scripts\activate          # Windows (use 'source .venv/bin/activate' on Mac/Linux)
pip install -e ".[dev]"

# 3. Get Kaggle API token (one-time)
#    https://www.kaggle.com/settings → API → Create New Token
#    Save kaggle.json to ~/.kaggle/kaggle.json (or %USERPROFILE%\.kaggle on Windows)

# 4. Download the dataset (~125 MB)
bash scripts/download_data.sh

# 5. Open the EDA notebook
jupyter lab notebooks/00_eda.ipynb

🧪 Modeling approach

Model	Library	Target	Notes
LightGBM	`lightgbm`	`% Iron Concentrate`, `% Silica Concentrate`	Tabular baseline with rolling features
Quantile LightGBM	`lightgbm`	P10 / P50 / P90 of targets	Confidence intervals for SPC alerts
LSTM (PyTorch)	`torch`	Same	Sequence-aware; benchmark vs tabular
Western Electric SPC	`pyspc` (custom)	Sensor channels	Detection rules for operator alerts

Validation: temporal split (train: first 70%, val: next 15%, test: last 15%) — no random shuffling. The hour-level lab measurements are joined with 20-second sensor data via temporal forward-fill.

🗺️ Roadmap

📜 License

MIT — see LICENSE.

👤 Author

Jonathan Sánchez Pesantes — Industrial Engineer · Data Scientist 🔗 linkedin.com/in/jonasanchez · github.com/jsanchez-ds

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
docs		docs
notebooks		notebooks
scripts		scripts
src/frothiq		src/frothiq
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.es.md		README.es.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚗️ FrothIQ — Mineral Process Quality Forecasting on Databricks

📊 Headline results on the Kaggle flotation dataset (737K rows, 6 months)

Honest findings

🎯 What this project proves

🏗️ Architecture

📊 Dataset

📂 Project structure

🚀 Quickstart

🧪 Modeling approach

🗺️ Roadmap

📜 License

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚗️ FrothIQ — Mineral Process Quality Forecasting on Databricks

📊 Headline results on the Kaggle flotation dataset (737K rows, 6 months)

Honest findings

🎯 What this project proves

🏗️ Architecture

📊 Dataset

📂 Project structure

🚀 Quickstart

🧪 Modeling approach

🗺️ Roadmap

📜 License

👤 Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages