This article details Delivery Hero's journey in deprecating Google Analytics and developing an internal, highly scalable, and cost-effective user tracking platform. It covers the architectural decisions, challenges faced regarding data quality and real-time processing, and the strategies employed for testing, rollout, and continuous optimization. The system achieved superior data quality and lower costs compared to its predecessor.
Read original on InfoQ ArchitectureDelivery Hero decided to replace Google Analytics with an in-house solution due to several critical limitations and external factors. The primary drivers included the need for real-time data (GA provided data only once or twice a day), unlimited event types (GA had definable event limits), and GDPR compliance concerns regarding storing sensitive user data with a third party. Cost optimization was also a significant factor, with an initial goal to not exceed GA costs, which was later surpassed by achieving a 25% cost reduction.
The user tracking system began with a simplistic, highly scalable architecture comprising a central API and two Pub/Sub message processors (one for fallback). This initial design allowed them to handle significant load with zero issues. Over time, as requirements evolved, additional services were introduced around this core API to address concerns like reliability, data validation, and diverse SDK support for mobile and frontend clients. Data is streamed to real-time consumers via Pub/Sub and stored in BigQuery for other consumers.
Architectural Principle: Start Simple, Iterate and Expand
Delivery Hero's approach highlights the value of starting with a minimal, scalable core and incrementally adding complexity and specialized services as problems arise and requirements solidify. This iterative development avoids over-engineering upfront and allows the architecture to naturally evolve to meet growing demands.
The team focused on improving data quality, achieving an 85% order match rate with GA and exceeding 91% with their internal tool. This was driven by fixing SDK data loss issues and building a more reliable ingestion infrastructure. Cost optimization was measured by cost per message, leading to a 25% reduction compared to GA. Scalability was addressed through load testing with real data, including simulating peak loads three times higher than typical, which enabled the system to withstand unexpected traffic surges without incidents or data loss. The simplified architecture inherently supported high scalability.