Skip to content

supat-roong/fed-twin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– Federated Digital Twin with Kubeflow

Stack: Kubeflow

A personal project for learning Distributed Digital Twin Training using Federated Learning. This implementation acts as a simulation framework designed for local Kubernetes clusters (e.g., Kind, Minikube) to mock a distributed fleet of systems. While tested extensively on macOS (Colima/Kind), it is compatible with any local Kubernetes environment.


πŸ— Architecture: Digital Twins Meet Federated Learning

πŸ€– What are Digital Twins?

A Digital Twin is not just a replica, it's a living twin of a physical system that mirrors its behavior, state, and characteristics in real-time. In practice, digital twins continuously sync with their physical counterparts through sensors and data feeds.

Physical System ⟷ Digital Twin (Real-time Sync)
     πŸ€–      ⟷        πŸ’»

For this project: We use simulated environments (CartPole) as stand-ins for physical systems. While not connected to real hardware, they demonstrate the core FL+DT concepts by creating diverse physics variations.

The Challenge: Every physical system is unique (internal tolerances, environmental conditions, and hardware aging).

  • System A operates under specific stress conditions.
  • System B is a newer model with slightly different response times.
  • System C has unique operational wear and tear.

Traditional Approach: Train one model on one perfect simulation ❌
Our Approach: Create multiple digital twins, each with different physics βœ…

🌐 Federated Architecture

Instead of collecting all data centrally, each twin learns locally and shares only its "intelligence" (model weights).

graph TD
    Server[("☁️ FL Server (Aggregator)")]
    
    subgraph "Local Workforce (Edge)"
        T1["πŸ€– Digital Twin A<br/>(Local View)"]
        T2["πŸ€– Digital Twin B<br/>(Local View)"]
        T3["πŸ€– Digital Twin C<br/>(Local View)"]
    end

    %% Step 1: Broadcast
    Server -- "1. Broadcast Global Model" --> T1
    Server -- "1. Broadcast Global Model" --> T2
    Server -- "1. Broadcast Global Model" --> T3

    %% Step 2: Local Training (Self-loops for clarity)
    T1 -- "2. Local Training (Data stays here!)" --> T1
    T2 -- "2. Local Training (Data stays here!)" --> T2
    T3 -- "2. Local Training (Data stays here!)" --> T3

    %% Step 3: Upload
    T1 -- "3. Upload Weights Only" --> Server
    T2 -- "3. Upload Weights Only" --> Server
    T3 -- "3. Upload Weights Only" --> Server
    
    style Server fill:#f5f5f5,stroke:#333,stroke-width:2px
Loading

🌟 Why choose Federated Architecture?

Benefit How it works in this project
πŸ” Data Privacy Raw training data (states and transitions) never leaves the local Digital Twin.
πŸš€ Efficiency We only transmit model parameters, avoiding the need to transfer full datasets.
🌍 Diversity The global model learns from the unique physical variations of every twin simultaneously.
πŸ›‘οΈ Robustness If one twin has corrupted data or is offline, the global model still benefits from the rest of the fleet.
✨ Generalization The resulting policy is more robust than any model trained on a single environmental variation.

πŸ”„ The FL Training Cycle

Each round of federated learning follows a structured synchronization loop:

sequenceDiagram
    participant S as FL Server
    participant W as Workers (Twins)
    participant E as Neutral Eval Twin

    Note over S, E: Round N Starts
    S->>W: 1. Broadcast Global Model Weights
    rect rgb(240, 240, 240)
        Note right of W: 2. Parallel Local Training
        W->>W: Learn from unique physics
    end
    W->>S: 3. Return Updated Local Weights
    S->>S: 4. FedAvg Aggregation
    S->>E: 5. Evaluate Global Generalization
    Note over S, E: Round N+1 Continues...
Loading

🎯 The Goal: Generalization Through Knowledge Sharing

Core Objective: Build a single global model that works well across all physical variations by sharing knowledge across the fleet.

How Knowledge Sharing Works:

  • Twin 1 learns optimal strategies for one set of operational conditions.
  • Twin 2 identifies patterns that work in a different environment.
  • Twin 3 discovers edge-case adjustments unique to its state.

When these insights are aggregated, the global model learns:

  • πŸ’‘ Robust strategies that work across all observed conditions.
  • πŸ’‘ Generalized policies that adapt to unseen variations.
  • πŸ’‘ Collective intelligence gathered from the entire fleet.

The Result: A model that performs better on new, unseen variations than any individual twin could achieve alone. This is the power of federated learning applied to digital twinsβ€”collective intelligence through privacy-preserving knowledge sharing!


🌍 Deployment Modes: Single-Cluster vs. Multi-Cluster (Karmada)

This project supports two primary deployment topologies to accurately simulate digital twin environments at different scales:

1. Single Cluster (Local K8s / Kind)

  • What it is: All federated workers and the central aggregator run within the same Kubernetes cluster (e.g., a single kind cluster) under identical networking conditions.
  • Real Use Case: Simulates multiple interconnected digital twins operating within the same physical location (e.g., multiple machines on the floor of a single smart factory, or a fleet of autonomous robots coordinating within one warehouse). It's also ideal for rapid prototyping and CI/CD before scaling to a Multi-Cluster deployment.
  • Setup Command: make single-cluster-setup
  • Run Command: ./run_pipeline.sh all_single_cluster

2. Multi-Cluster (Multi-Cluster Federation)

  • What it is: True distributed federation using Karmada to manage multiple distinct Kubernetes clusters. The aggregator runs on a "Host" cluster, while digital twin workers are scheduled across geographically simulated "Member" clusters.
  • Real Use Case: Mimics real-world production FL where digital twins are geographically dispersed across different regions or edge locations, each with their own isolated local Kubernetes cluster (e.g., connected autonomous vehicles computing locally in different geographic zones, or separate smart factories across the globe). It forces the system to handle cross-cluster networking, latency resilience, and robust multi-cluster scheduling.
  • Setup Command: make multi-cluster-setup
  • Run Command: ./run_pipeline.sh all_multi_cluster

πŸ“ˆ Visual vs. Functional Pipelines

This project implements two distinct pipeline strategies to explore different aspects of the ML lifecycle:

1. Functional Pipelines (The "Workhorse")

  • Files: fed_twin_single_cluster_pipeline.py, single_twin_single_cluster_pipeline.py
  • Implementation: Uses a single PyTorchJob Custom Resource from the Kubeflow Training Operator.
  • Why use it: This is the efficient way to run experiments. Instead of launching individual pods for every round, the entire fleet orchestration is delegated to the Training Operator. It handles distributed synchronization natively, making it much faster.
  • UI Representation: Shows as a single, clean "Training" node in the Kubeflow graph.

2. Visual Pipelines (The "Narrative")

  • Files: fed_twin_visual_single_cluster_pipeline.py, single_twin_visual_single_cluster_pipeline.py
  • Implementation: Creates individual KFP components for every training and evaluation step.
  • Why use it: Kubeflow's default representation can be opaque. These pipelines provide better observability by mapping each round and worker to a unique component, making it easy to track the flow of weights and parallel training in the Kubeflow UI.

Federated Learning DAG (fed_twin_visual)

Federated DAG This visualization shows multiple training pods running in parallel for each round, followed by a synchronization step where model weights are aggregated before proceeding to the next iteration.

Single Agent DAG (single_twin_visual)

Single Agent DAG In contrast, the single agent DAG shows a linear progression of training and evaluation rounds, where a single pod learns sequentially without the need for aggregation.

πŸ“Š Performance & Analytics

The project includes an automated analysis suite that generates insights after every experiment.

1. Federated vs. Single Agent Comparison

Concept: Compares the learning efficiency and final performance of the global federated model against a single isolated agent. This metric validates whether collaborative learning across diverse environments yields a more robust policy than learning in single environment.

  • Analysis Script: src/analysis/compare_results.py
  • Generated Plot: plots/comparison_result.png

Performance Comparison

Key Findings:

  • Federated Learning (Green) achieves significantly higher rewards by leveraging knowledge from diverse physics
  • Single Agent (Red) learns from only one environment, limiting its generalization capability
  • FL demonstrates better generalization and more stable growth through collective learning
  • Both models show initial improvement, but single agents stuck at lower performance due to overfitting, whereas FL reaches higher performance due to better generalization.

2. Worker Training Dynamics (Worker Diversity)

Concept: Measures the variance in training rewards across different twins. In a healthy FL system, we expect individual workers to have different learning curves as they adapt to their unique physical variations, while the global model aggregates these diverse insights.

  • Analysis Script: src/analysis/worker_diversity.py
  • Generated Plot: plots/worker_diversity.png

3. Generalization Gap

Concept: Measures the difference between a model's performance on its training environment vs. a neutral evaluation environment. A smaller gap indicates that the model has truly learned robust policies rather than just memorizing a specific condition. Federated learning typically minimizes this gap by forcing the model to solve for multiple physics variations simultaneously.

  • Analysis Script: src/analysis/generalization_gap.py
  • Generated Plot: plots/generalization_gap_{type}.png

πŸ§ͺ MLflow Tracking

The project uses MLflow for centralized experiment tracking and metric visualization.

πŸ› Unified Tracking Strategy

We implement a "Single Execution, Single Run" strategy to keep the experiment history clean:

  • Experiment by Type: Runs are grouped into experiments based on the pipeline type (e.g., Fed-Twin-FL, Fed-Twin-Single-Visual).
  • Unified Runs: Each pipeline execution creates exactly one MLflow run. All parallel workers and sequential rounds log to this unique run.
  • Prefix-Based Metrics: Metrics are prefixed with worker IDs (e.g., train-twin-1/reward, eval-twin-global/loss) to distinguish between different sources within the same timeline.

πŸ–₯ Monitoring & Debugging

Both MLflow and Kubeflow Pipelines (KFP) provide specialized UIs for monitoring.

Note

Artifacts (model weights and metrics) are stored in the integrated MinIO bucket via S3-compatible API.


πŸš€ Getting Started

1. Prerequisites

  • Kubernetes Cluster: A local cluster like Kind or Minikube.
  • Container Runtime: Docker Desktop, Colima, or Podman.
  • Tools: kubectl, python 3.10+, and uv.

2. Local Setup

You can setup the local development clusters and deploy the infrastructure using make:

# Setup Single-Cluster development environment
make single-cluster-setup

# OR Setup Multi-Cluster federation environment
make multi-cluster-setup

3. Pipeline Execution

Run the pipelines using run_pipeline.sh:

./run_pipeline.sh all_single_cluster            # Run all single-cluster pipelines sequentially
./run_pipeline.sh all_multi_cluster        # Run all multi-cluster pipelines sequentially

4. Teardown

To destroy the local clusters, you can run:

make single-cluster-teardown         # Teardown Single-Cluster mode
# OR
make multi-cluster-teardown     # Teardown Multi-Cluster mode

πŸ“‚ Repository Structure

  • /src/core: The core code of the project, including engine.py (physics simulation), client.py (RL training), and server.py (FL aggregation).
  • /src/pipelines: Definitions for Kubeflow Pipelines (KFP).
  • /src/analysis: Python scripts for generating professional plots and metrics analysis.
  • /metrics: Consolidated CSV results from every cluster run.
  • /plots: Generated visualizations showing project performance.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


About

A Federated Learning framework for distributed Digital Twins on Kubernetes. Simulates diverse robotic environments to collaboratively train a robust policy using PyTorch, Flower, and Kubeflow Pipelines.

Topics

Resources

License

Stars

Watchers

Forks

Contributors