This article details Airbnb's journey to build a unified, internally managed knowledge graph infrastructure to scale its critical identity graph. It covers the architectural evolution from relational and third-party solutions, the challenges faced, the technical stack chosen (JanusGraph + DynamoDB), and key optimizations for performance and stability. The migration highlights the trade-offs of build vs. buy, emphasizing control over performance and operational overhead for large-scale graph workloads.
Read original on Airbnb EngineeringAirbnb's identity graph is a foundational component for Trust and Safety, mapping relationships between users to detect suspicious activities and identify linked accounts. It grew to 7 billion nodes and 11 billion edges, with 5 million new edges daily. This scale presented significant challenges: ensuring scalability for writes and complex, multi-hop queries, mitigating long-tail latency from high-fanout nodes, and maintaining system stability under heavy load.
To overcome the limitations of fragmented graph solutions (relational 'graphs', offline graphs, DIY open-source, managed PaaS), Airbnb developed a paved-path, multi-tenant internal platform. The core technology stack chosen was JanusGraph (a distributed, open-source graph database) with DynamoDB as the storage backend and OpenSearch for indexing. This combination offered storage separation, allowing Airbnb to leverage DynamoDB's scalability and reliability while maintaining control over the graph logic layer.
Why JanusGraph with DynamoDB?
JanusGraph's pluggable storage backend was crucial. It allowed Airbnb to decouple the graph processing logic from the underlying persistent storage. This enables rapid iteration on graph features without reimplementing distributed storage operations, and provides flexibility to evolve the storage layer independently.
The migration yielded significant improvements: superior query performance (especially P99 latency reduction), enhanced system stability (no more manual reboots, faster incident response), and robust scalability (10x write QPS compared to the previous vendor solution).