Mosaic is an open-source always-on observability tool for GPU collective communication, providing near real-time visibility into performance and reliability issues in large-scale AI workloads. It treats collective communication as first-class OpenTelemetry data, enabling correlation with GPU, network, and system signals in a single view.
At GPU scale, where failures and inefficiencies are inevitable, Mosaic makes these issues visible early, shifting observability from postmortem analysis to continuous operations.
No offline tracing, no bespoke pipelines, and no invasive instrumentation.
To get Mosaic up and running in your environment, follow our Quick Start Guide.
For a deep dive into core concepts, architecture, and advanced configuration, visit Mosaic Documentation
