Menu
Cloudflare Blog·May 28, 2026

Building Cloudflare's Unified Data Lakehouse and AI Data Agent

Cloudflare tackled data sprawl by creating Town Lake, a unified data lakehouse built on Apache Trino and Iceberg on R2, providing a single SQL interface for diverse data sources. They also developed Skipper, an AI data agent for natural language querying, emphasizing governed access, PII detection, and Cloudflare's own platform services for infrastructure. This architecture addresses challenges like disparate data systems, sampling issues, and tribal knowledge, enabling comprehensive and secure data insights.

Read original on Cloudflare Blog

The Challenge: Data Sprawl at Cloudflare

Cloudflare processes over a billion events per second, leading to immense data across dozens of systems like Postgres, ClickHouse, Kafka, and BigQuery. This created significant challenges for data access and analysis, characterized by: * Disparate Systems: Engineers needed to query multiple databases with different credentials, languages, and retention policies to get a complete picture. * Sampled Data vs. Accuracy: Existing analytics pipelines downsampled data for dashboards, which was unsuitable for critical functions like billing or security investigations requiring full fidelity. * External Dependencies: Reliance on external cloud vendors for internal reporting increased costs and introduced critical dependencies. * Tribal Knowledge: Finding the right data required deep, often undocumented, knowledge of specific table locations, schemas, and join conditions, making self-service nearly impossible. To overcome these issues, Cloudflare aimed to build a unified platform providing fresh, accurate, and unsampled data where needed, fast downsampled data for exploration, and robust security and governance.

Town Lake: Cloudflare's Data Lakehouse Architecture

Town Lake is designed as a data lakehouse, combining the benefits of data lakes (cost-effective storage, schema flexibility) with data warehouses (structured querying, ACID transactions). It leverages Cloudflare's own platform components extensively.

ℹ️

Key Architectural Components

The platform integrates several critical services to achieve its goals: * Query Engine: Apache Trino acts as the federated query engine, enabling single SQL queries to join data across Postgres, ClickHouse, and Iceberg tables on R2 without intermediate materialization. * Data Catalog: R2 Data Catalog, powered by Apache Iceberg, manages cold and warm data on R2. Iceberg provides schema evolution, time travel, partition evolution, and efficient data compaction (e.g., per-minute data aging to hourly, then daily) to reduce storage costs while maintaining queryability. * Metadata Catalog: DataHub stores comprehensive metadata for every table, column, owner, lineage, and glossary term, making data discoverable and understandable. * Access Control: Lifeguard manages access rules in D1, dynamically pulling user/group memberships to generate JSON policies for Trino. It provides early access denial to users without permission. * PII Detection: Skimmer is a continuous PII scanner using Workers AI to classify column contents, feeding findings into DataHub and Lifeguard for review and policy enforcement. * ELT Engine: Transformer, built on Workflows, manages SQL-based ELT jobs, compiling DAGs for execution on Trino with state managed by Durable Objects and definitions in R2. * Ingestion: An orchestrator on Kubernetes extracts data from operational systems (Postgres, ClickHouse), transforms it to Parquet, and loads it into R2 as Iceberg tables, supporting full-replace or incremental-append models.

Governance by Construction: Default-Closed Security

Unlike traditional approaches of "open by default, restrict by exception," Town Lake implements a "default-closed" security model. New tables are inaccessible until reviewed and approved by humans, based on Skimmer's PII detection. This process is largely automated, and users are guided through self-serve permission requests, improving both security and user experience. Sensitive columns are redacted by default in query results, with opt-in PII access requiring explicit permission and logging for auditing purposes.

Skipper: The AI Data Agent

Skipper is Cloudflare's conversational AI agent built on top of Town Lake and Cloudflare's Workers AI platform. It allows users to query data using natural language, translating questions into SQL, executing them via Trino, and presenting results in tables or charts. Skipper maintains context for follow-up questions and can self-correct queries based on results. It leverages multiple layers of grounded context, including schema and usage metadata from DataHub, and human annotations, to prevent hallucinations and ensure accurate query generation.

data lakehousedata platformapache trinoapache icebergai agentdata governancecloudflarer2

Comments

Loading comments...