This article details the architectural evolution of a high-volume Treasure Hunt Engine for a gaming platform. It highlights how an initial microservices architecture struggled with consistency and latency under scale, leading to a critical re-evaluation of service boundaries and the adoption of a more modular, event-driven design with an event store.
Read original on Dev.to #architectureThe Treasure Hunt Engine, designed for high-volume event processing in a large-scale gaming environment, initially faced significant performance and consistency issues despite being built with microservices. The system experienced frequent `java.lang.OutOfMemoryError` and `org.apache.kafka.common.errors.TimeoutException` under load, indicating fundamental architectural flaws rather than mere scaling challenges. This scenario underscores the importance of well-defined service boundaries and communication patterns in distributed systems, especially when handling thousands of concurrent events.
Early attempts to resolve the issues involved vertical and horizontal scaling (adding nodes, optimizing Kafka) and introducing a Redis caching layer. These measures offered only temporary relief, with error rates remaining high (up to 30% during peak) and mean time to recovery exceeding two hours. This failure highlighted that the core problem was not insufficient resources but a tightly coupled architecture with unclear service boundaries, complex inter-service communication via REST and message queues, and difficult error handling.
Lesson Learned: Scaling Cannot Fix Bad Design
Simply throwing more hardware or adding a cache layer rarely solves fundamental design flaws. Poorly defined service boundaries and tight coupling can quickly negate the benefits of a microservices approach, leading to distributed monoliths.
The team refactored the engine into a more modular, event-driven architecture. Key decisions included: breaking down the system into smaller, independent services aligned with specific business capabilities; introducing an event store using Apache Kafka and Apache Cassandra to establish a single source of truth for all events; and implementing a new consistency model combining eventual consistency with transactional logging. This shift decoupled services, improving flexibility and scalability, and required adopting new technologies like Scala and Akka.
The Power of an Event Store
An event store using technologies like Kafka and Cassandra provides an immutable log of all system changes. This is crucial for achieving eventual consistency, enabling services to react to events independently, and offering robust auditability and replay capabilities in complex distributed systems.
The refactoring led to dramatic improvements: error rates dropped to less than 1%, MTTR reduced to under 30 minutes, and the system successfully handled over 10,000 concurrent events with over 50% latency reduction. Operational costs decreased by over 30%. Monitoring with Grafana and Prometheus became more effective, allowing proactive issue detection. These results underscore the tangible benefits of a well-executed architectural overhaul focused on clear boundaries and event-driven principles.