Menu
The New Stack·March 17, 2026

Uber's Michelangelo: From Monolith to Global ML Mesh on Kubernetes

This article details Uber's architectural evolution of its machine learning platform, Michelangelo, from a monolithic structure to a global, cloud-native mesh built on Kubernetes. It highlights the challenges of scaling ML infrastructure, abstracting away complexities for data scientists, and achieving multi-cloud batch orchestration, offering key insights into distributed system design for large-scale AI platforms.

Read original on The New Stack

The Challenge: Scaling Uber's ML Infrastructure

Initially, Uber faced significant scaling issues with its machine learning efforts. Data scientists were spending 80% of their time on infrastructure 'plumbing' rather than model development, and deploying a single model to production took months. The proprietary Michelangelo platform, while centralizing ML, eventually hit a scaling wall, necessitating a fundamental architectural shift from a monolithic legacy stack to a cloud-native Kubernetes foundation to handle 30 million predictions per second and beyond.

Architectural Shift: Kubernetes as a Distributed State Machine

Uber re-architected Michelangelo using Kubernetes not just for container orchestration, but as a robust distributed state machine to manage the complex, distributed state of thousands of ML models and projects. This involved addressing limitations of standard Kubernetes deployments, particularly with etcd's performance under high-cardinality data.

Scaling the Control Plane: CRDs and Transparent Persistence

  • 100+ Custom Resource Definitions (CRDs): Uber defined an extensive ecosystem of CRDs to represent the entire ML lifecycle, allowing for a rich, domain-specific data model while leveraging standard Kubernetes API and controller-runtime.
  • Transparent Relational Persistence: To overcome etcd's scaling limitations for high-cardinality relational data, an abstraction layer was built. This layer synchronizes Kubernetes object metadata with a horizontally scalable MySQL backend, transparently to developers, allowing for complex joins and low-latency filtering.
  • High-Efficiency Reconciliation: Mapping CRDs to an optimized relational schema enabled efficient querying and management of ML project metadata.

Curing Stranded Compute: Unified Batch Federation

To address the problem of 'stranded compute' across dozens of regional clusters, where some clusters were overloaded while others were underutilized, Uber implemented a Unified Batch Federation Layer. This system utilizes a `PropagationPolicy` CRD. Engineers submit jobs to a 'Virtual Regional Cluster,' and a Federated Controller acts as a global traffic cop, dynamically assigning workloads to physical clusters with available GPU capacity, achieving a 99.9% scheduling success rate.

Bridging the Friction Gap with Uniflow

To simplify complex ML workflows, Uber developed Uniflow, a Python-native workflow service. Unlike general-purpose ETL engines, Uniflow prioritizes ML-specific needs, offering a Python-first experience, resource-aware scheduling that tracks specialized hardware, and 'write once, run anywhere' capabilities, ensuring consistency from local debugging to global cloud deployment.

💡

Domain-Specific Abstractions

A key takeaway from Uniflow's design is the importance of providing domain-specific abstractions. While Kubernetes provides a powerful foundation, simplifying developer experience for specific use cases (like ML workflows) often requires higher-level tools tailored to the domain.

The Universal Compute Mesh: Multi-Cloud Orchestration

Recognizing single-cloud dependency as a strategic risk, Uber evolved Michelangelo into a cloud-agnostic batch processing layer, or 'Universal Compute Mesh.' This layer spans multiple public cloud providers (GCP, OCI, AWS, Azure) and Uber's own data centers, all managed by a single control plane. The federation layer and `PropagationPolicy` CRD abstract away provider-specific networking and storage primitives, while a global data mesh ensures unified and secure data access across cloud boundaries, making them invisible to ML models.

UberMichelangeloKubernetesMachine LearningMLOpsMulti-cloudCRDsDistributed Systems

Comments

Loading comments...