This article details Uber's architectural evolution of its machine learning platform, Michelangelo, from a monolithic structure to a global, cloud-native mesh built on Kubernetes. It highlights the challenges of scaling ML infrastructure, abstracting away complexities for data scientists, and achieving multi-cloud batch orchestration, offering key insights into distributed system design for large-scale AI platforms.
Read original on The New StackInitially, Uber faced significant scaling issues with its machine learning efforts. Data scientists were spending 80% of their time on infrastructure 'plumbing' rather than model development, and deploying a single model to production took months. The proprietary Michelangelo platform, while centralizing ML, eventually hit a scaling wall, necessitating a fundamental architectural shift from a monolithic legacy stack to a cloud-native Kubernetes foundation to handle 30 million predictions per second and beyond.
Uber re-architected Michelangelo using Kubernetes not just for container orchestration, but as a robust distributed state machine to manage the complex, distributed state of thousands of ML models and projects. This involved addressing limitations of standard Kubernetes deployments, particularly with etcd's performance under high-cardinality data.
To address the problem of 'stranded compute' across dozens of regional clusters, where some clusters were overloaded while others were underutilized, Uber implemented a Unified Batch Federation Layer. This system utilizes a `PropagationPolicy` CRD. Engineers submit jobs to a 'Virtual Regional Cluster,' and a Federated Controller acts as a global traffic cop, dynamically assigning workloads to physical clusters with available GPU capacity, achieving a 99.9% scheduling success rate.
To simplify complex ML workflows, Uber developed Uniflow, a Python-native workflow service. Unlike general-purpose ETL engines, Uniflow prioritizes ML-specific needs, offering a Python-first experience, resource-aware scheduling that tracks specialized hardware, and 'write once, run anywhere' capabilities, ensuring consistency from local debugging to global cloud deployment.
Domain-Specific Abstractions
A key takeaway from Uniflow's design is the importance of providing domain-specific abstractions. While Kubernetes provides a powerful foundation, simplifying developer experience for specific use cases (like ML workflows) often requires higher-level tools tailored to the domain.
Recognizing single-cloud dependency as a strategic risk, Uber evolved Michelangelo into a cloud-agnostic batch processing layer, or 'Universal Compute Mesh.' This layer spans multiple public cloud providers (GCP, OCI, AWS, Azure) and Uber's own data centers, all managed by a single control plane. The federation layer and `PropagationPolicy` CRD abstract away provider-specific networking and storage primitives, while a global data mesh ensures unified and secure data access across cloud boundaries, making them invisible to ML models.