βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA SOURCES β
β ββββββββββββββββ β
β β Kaggle β β Car Price Dataset (CSV) β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATABRICKS WORKSPACE β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATA PROCESSING β β
β β β’ Upload CSV to DBFS β β
β β β’ Feature Engineering (car_age, km_per_year, etc.) β β
β β β’ Data Validation & Cleaning β β
β β β’ Train/Test Split β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MLFLOW EXPERIMENT TRACKING β β
β β β’ Linear Regression β Track params, metrics β β
β β β’ Random Forest β Track params, metrics β β
β β β’ XGBoost β Track params, metrics β β
β β β’ LightGBM β Track params, metrics β β
β β β’ Compare all models β β
β β β’ Log artifacts (plots, feature importance) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MLFLOW MODEL REGISTRY β β
β β β’ Register best model β β
β β β’ Version: 1.0, 2.0, 3.0... β β
β β β’ Stages: None β Staging β Production β Archived β β
β β β’ Model metadata & lineage β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATABRICKS MODEL SERVING (Optional) β β
β β β’ REST API endpoint β β
β β β’ Auto-scaling β β
β β β’ A/B Testing capability β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GITHUB REPOSITORY β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SOURCE CODE β β
β β β’ Data pipeline (src/data/) β β
β β β’ ML models (src/models/) β β
β β β’ Flask API (src/inference/) β β
β β β’ Tests (tests/) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β GITHUB ACTIONS (CI/CD) β β
β β β β
β β ββββββββββββββββββ ββββββββββββββββββ β β
β β β CI PIPELINE β β CD PIPELINE β β β
β β β β β β β β
β β β β’ Lint code β β β’ Build Docker β β β
β β β β’ Run tests β β β’ Push to β β β
β β β β’ Check cov β β registry β β β
β β β β’ Build Docker β β β’ Deploy to β β β
β β β β β Kubernetes β β β
β β ββββββββββββββββββ ββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DOCKER CONTAINER REGISTRY β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Docker Images (tagged versions) β β
β β β’ car-price-predictor:latest β β
β β β’ car-price-predictor:v1.0 β β
β β β’ car-price-predictor:main-abc123 β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KUBERNETES CLUSTER (Cloud/Local) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β NAMESPACE: mlops β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β DEPLOYMENT (car-price-deployment) β β β
β β β βββββββββββ βββββββββββ βββββββββββ β β β
β β β β Pod 1 β β Pod 2 β β Pod 3 β β β β
β β β β Flask β β Flask β β Flask β β β β
β β β β API β β API β β API β β β β
β β β βββββββββββ βββββββββββ βββββββββββ β β β
β β β Auto-scaling: 2-10 pods based on CPU/Memory β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β SERVICE (car-price-service) β β β
β β β Type: LoadBalancer β β β
β β β Port: 80 β 5000 β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β INGRESS (car-price-ingress) β β β
β β β HTTPS with TLS/SSL β β β
β β β Domain: car-price-api.yourdomain.com β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β CONFIGMAP & SECRETS β β β
β β β β’ Databricks credentials β β β
β β β β’ MLflow configuration β β β
β β β β’ Model names & versions β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β HORIZONTAL POD AUTOSCALER (HPA) β β β
β β β β’ Scale based on CPU (70% threshold) β β β
β β β β’ Scale based on Memory (80% threshold) β β β
β β β β’ Min: 2 pods, Max: 10 pods β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β END USERS / CLIENTS β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β Web UI β β Mobile β β API β β
β β (Frontend) β β App β β Clients β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β
β API Endpoints: β
β β’ GET /health β Health check β
β β’ GET / β Service info β
β β’ POST /predict β Single prediction β
β β’ POST /predict/batch β Batch predictions β
β β’ GET /model/info β Model metadata β
β β’ POST /model/reload β Reload model β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. DATA INGESTION
Kaggle CSV β Databricks DBFS
2. PREPROCESSING
Raw data β Cleaned data β Feature engineering
3. MODEL TRAINING
Multiple algorithms β MLflow tracking β Best model selection
4. MODEL REGISTRATION
Best model β MLflow Registry β Version control
5. MODEL STAGING
Registered model β Staging environment β Testing
6. MODEL PRODUCTION
Staging β Production β Load balancing
7. CONTAINERIZATION
Python code + Model β Docker image β Container registry
8. DEPLOYMENT
Docker image β Kubernetes pods β Auto-scaling
9. SERVING
HTTP requests β Load balancer β Pods β Predictions
10. MONITORING
Logs + Metrics β Dashboards β Alerts
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DEVELOPER WORKFLOW β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
Code changes
β
Git commit & push to GitHub
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONTINUOUS INTEGRATION (CI) β
β β
β 1. Checkout code β
β 2. Set up Python β
β 3. Install dependencies β
β 4. Lint code (flake8, black) β
β 5. Run unit tests β
β 6. Generate coverage report β
β 7. Build Docker image β
β 8. Test Docker image β
β β
β If all pass β
β Continue to CD β
β If any fail β β Stop and notify β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONTINUOUS DEPLOYMENT (CD) β
β β
β 1. Build production Docker image β
β 2. Tag image (version, git hash) β
β 3. Push to container registry β
β 4. Update Kubernetes manifests β
β 5. Apply to cluster β
β 6. Wait for rollout β
β 7. Verify deployment β
β 8. Run smoke tests β
β β
β If success β
β Service updated β
β If fail β β Automatic rollback β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
Production environment
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODEL LIFECYCLE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
[1] DEVELOPMENT
- Train multiple models
- Compare in MLflow
- Select best performer
β
[2] REGISTRATION
- Register in MLflow Registry
- Assign version number
- Add metadata & tags
β
[3] STAGING
- Transition to Staging stage
- Run validation tests
- A/B test with small traffic
- Monitor performance
β
[4] PRODUCTION
- Transition to Production
- Serve 100% traffic
- Continuous monitoring
- Track drift & performance
β
[5] MONITORING & RETRAINING
- Detect model drift
- Performance degradation?
Yes β Trigger retraining
No β Continue monitoring
β
[6] ARCHIVAL (when outdated)
- Archive old version
- Keep for audit trail
- Maintain lineage
DATA LAYER
βββ Source: Kaggle
βββ Storage: Databricks DBFS / Delta Lake
βββ Format: CSV β Parquet
PROCESSING LAYER
βββ Language: Python 3.9
βββ Libraries: pandas, numpy, scikit-learn
βββ Feature Eng: Custom transformers
βββ Validation: Custom validators
ML LAYER
βββ Frameworks: scikit-learn, XGBoost, LightGBM
βββ Experiment Tracking: MLflow
βββ Model Registry: MLflow Model Registry
βββ Versioning: Git + MLflow
βββ Deployment: MLflow Models
API LAYER
βββ Framework: Flask
βββ Protocol: HTTP/REST
βββ Format: JSON
βββ Auth: (Can add JWT/OAuth)
βββ Docs: (Can add Swagger)
CONTAINERIZATION LAYER
βββ Container: Docker
βββ Base Image: python:3.9-slim
βββ Build: Multi-stage
βββ Registry: GitHub Container Registry / Azure ACR
ORCHESTRATION LAYER
βββ Platform: Kubernetes
βββ Provider: Azure AKS / AWS EKS / GCP GKE
βββ Auto-scaling: HPA
βββ Load Balancing: K8s Service
βββ Ingress: Nginx Ingress
CI/CD LAYER
βββ Version Control: Git
βββ Repository: GitHub
βββ CI/CD: GitHub Actions
βββ Testing: pytest
βββ Quality: flake8, black, coverage
MONITORING LAYER (Optional)
βββ Metrics: Prometheus
βββ Visualization: Grafana
βββ Logging: ELK Stack
βββ Alerting: PagerDuty / Slack
This architecture provides:
- β Scalability: Auto-scaling with Kubernetes
- β Reliability: Health checks, auto-restart
- β Maintainability: Clean code, tests, docs
- β Observability: Logging, monitoring, metrics
- β Security: Secrets management, HTTPS
- β Automation: Complete CI/CD pipeline