| sidebar_position | 2 | ||||
|---|---|---|---|---|---|
| title | Quickstart: Clone to First Training Job | ||||
| description | Deploy infrastructure and submit your first robotics training job in 8 steps | ||||
| author | Microsoft Robotics-AI Team | ||||
| ms.date | 2026-02-22 | ||||
| ms.topic | tutorial | ||||
| keywords |
|
Deploy the full Azure NVIDIA Robotics stack and submit a training job in ~1.5-2 hours. This guide uses full-public networking and Access Keys authentication for the simplest path.
Note
This guide expands on the Getting Started hub.
| Requirement | Details |
|---|---|
| Azure subscription | Contributor + User Access Administrator roles |
| GPU quota | Standard_NC24ads_A100_v4 in target region |
| NVIDIA NGC account | Sign up at https://ngc.nvidia.com/ for API key |
| Development environment | Devcontainer (recommended) or local tools |
See Prerequisites for installation commands and version requirements.
Clone the repository and initialize the development environment.
git clone https://github.com/microsoft/physical-ai-toolchain.git
cd physical-ai-toolchainUse the devcontainer (recommended) or run local setup:
./setup-dev.shAuthenticate with Azure and register required resource providers.
source infrastructure/terraform/prerequisites/az-sub-init.sh
bash infrastructure/terraform/prerequisites/register-azure-providers.shVerify your subscription:
az account show --query "{name:name, id:id}" -o tableCreate a Terraform variables file for the full-public deployment path. From the repository root:
cd infrastructure/terraform
cp terraform.tfvars.example terraform.tfvarsEdit terraform.tfvars with these values:
project_name = "robotics"
environment = "dev"
location = "eastus"
gpu_vm_size = "Standard_NC24ads_A100_v4"
enable_azure_ml = true
enable_osmo = true
enable_vpn_gateway = false
enable_private_dns = falseTip
For private networking, set enable_vpn_gateway = true and enable_private_dns = true. See the Infrastructure Guide for details.
Initialize and apply the Terraform configuration. This step takes ~30-40 minutes.
terraform init
terraform plan -out=tfplan
terraform apply tfplanVerify deployment:
terraform outputConnect to the AKS cluster:
az aks get-credentials \
--resource-group "$(terraform output -raw resource_group_name)" \
--name "$(terraform output -raw aks_cluster_name)"Deploy GPU Operator, KAI Scheduler, and the AzureML extension. From the repository root:
cd infrastructure/setup
bash 01-deploy-robotics-charts.sh
bash 02-deploy-azureml-extension.shVerify GPU operator pods:
kubectl get pods -n gpu-operatorDeploy the OSMO control plane and backend using Access Keys authentication.
bash 03-deploy-osmo-control-plane.sh
bash 04-deploy-osmo-backend.sh --use-access-keysVerify OSMO pods:
kubectl get pods -n osmo-control-planeNavigate to the scripts directory and submit a training job. From the repository root:
cd scripts
bash submit-osmo-training.shScripts auto-detect configuration from Terraform outputs. Override values with CLI arguments or environment variables as needed. See Scripts Reference for all submission options.
Confirm the training job is running:
kubectl get pods -n osmo-control-plane --watchCheck OSMO training status through the OSMO web UI or query pod logs:
kubectl logs -n osmo-control-plane -l app=osmo-training --tail=50Destroy all infrastructure when finished to stop incurring costs. From the repository root:
cd infrastructure/terraform
terraform destroySee Cost Considerations for detailed pricing.
| Resource | Description |
|---|---|
| MLflow Integration | Track experiments with MLflow |
| Deployment Guide | Full deployment reference and options |
| Contributing Guide | Development workflow and code standards |