Dropbox Tech·January 28, 2026

Dropbox Dash's Context Engine: Index-Based Retrieval, Knowledge Graphs, and LLM-Powered Ranking for Enterprise Search

Dropbox Dash utilizes an advanced context engine to unify and search enterprise content across various third-party applications. This system employs an index-based retrieval approach, detailed content understanding, and knowledge graphs to enrich data and enhance search relevance. Key architectural decisions involve choosing pre-processing over federated retrieval and optimizing LLM usage for judging relevance and prompt optimization with DSPy.

AI & ML Infrastructure Distributed Systems Databases & Storage

Read original on Dropbox Tech

Architecting Dropbox Dash's Context Engine

Dropbox Dash addresses the challenge of fragmented enterprise content by building a centralized context engine. This engine ingests data from numerous third-party applications via custom connectors, normalizing and enriching it for unified search and AI-driven queries. The core architectural decision revolves around index-based retrieval, prioritizing pre-processing at ingestion time over on-the-fly federated retrieval for improved performance, data enrichment, and access to company-wide content.

Content Ingestion and Understanding

Connectors: Custom crawlers handle diverse third-party APIs, accounting for unique rate limits, API quirks, and ACL/permission systems.
Normalization: Files are converted into a standardized format like Markdown.
Extraction & Enrichment: Key information (titles, metadata, links, embeddings) is extracted. Multimodal understanding is crucial for complex content types like images (CLIP-based models), PDFs, audio (transcription), and video (scene understanding, embedding generation).

Knowledge Graph Construction

After initial content understanding, Dropbox models relationships between information pieces using knowledge graphs. This cross-app intelligence is vital for providing richer context. For instance, connecting meeting invites to documents, attendees, and project management tasks. A significant insight is the creation of "knowledge bundles" (summaries of graphs that are then indexed) rather than relying solely on traditional graph databases, addressing latency and query pattern challenges. These bundles are processed through the same index pipeline as other content, generating lexical and semantic embeddings.

💡

Index-Based vs. Federated Retrieval

When designing a unified search or context engine, a critical architectural decision is between federated (on-the-fly processing) and index-based (pre-processed at ingestion) retrieval. Index-based retrieval offers faster query times, enriched data, and broader access to content but requires significant upfront engineering effort and robust ingestion pipelines to manage freshness and cost. Federated retrieval is simpler to start but sacrifices performance, comprehensive content access, and sophisticated ranking due to reliance on external API performance and token limitations.

Retrieval and Ranking

Data Stores: A hybrid approach is used, combining a lexical index (BM25) with a vector store for dense embeddings.
Multi-pass Ranking: Retrieved results undergo multiple ranking passes, incorporating personalization and access control (ACLs).
LLM as a Judge: An LLM evaluates retrieval quality and synthesizes information for improved relevance, moving beyond human-click metrics common in traditional search.

Optimizing LLM Usage and MCP

To mitigate challenges with LLM context window limits and slow Multi-hop Context Processing (MCP), Dropbox implemented several optimizations. They introduced "super tools" to consolidate multiple retrieval tools into one, significantly reducing token usage. Knowledge graphs also help by providing concise, relevant information. Furthermore, tool results are stored locally, outside the LLM context window, and sub-agents with narrower toolsets are used for complex queries, selected by a classifier. DSPy is employed for prompt optimization to improve LLM effectiveness.

RAGKnowledge GraphLLMSearch EngineVector DatabaseBM25Content IngestionAPI Integration