Menu
Pinterest Engineering·March 6, 2026

Pinterest's Unified Context-Intent Embeddings for Scalable Text-to-SQL Analytics

This article details Pinterest's architectural evolution of its Text-to-SQL system into an Analytics Agent, designed to handle over 100,000 tables and 2,500 users. It focuses on using unified context-intent embeddings for semantic retrieval and leveraging structural/statistical patterns with governance-aware ranking to generate trustworthy and validated SQL queries from natural language, addressing the complexities of large-scale data warehouses.

Read original on Pinterest Engineering

Architecting a Scalable Text-to-SQL Analytics Agent

Pinterest tackled the challenge of building a reliable Text-to-SQL system for a massive data warehouse (over 100,000 tables) by moving beyond simple keyword matching. Their solution, an "Analytics Agent," integrates advanced AI and data governance principles to assist analysts in discovering tables, reusing queries, and generating validated SQL. This system represents a significant architectural shift from basic RAG-based approaches to a more sophisticated, intent-driven knowledge retrieval and query generation mechanism.

Foundation: Data Governance and AI-Ready Data

Before developing the AI agent, Pinterest invested heavily in data governance. They reduced their data warehouse footprint from 400K to 100K tables through a rigorous tiering program (Tier 1 for production-quality, Tier 2 for team-owned, Tier 3 for temporary/legacy). This process, documented in PinCat (based on DataHub), ensured tables had clear ownership, documentation, and quality standards, making the data warehouse manageable and suitable for AI-driven processes. This highlights that robust data governance is a prerequisite for effective AI/ML in large-scale data environments.

Two Pillars: Context-Intent Embeddings and Structural/Statistical Patterns

The core of Pinterest's Analytics Agent relies on two complementary dimensions to encode analytical knowledge from historical SQL queries:

  • Unified Context-Intent Embeddings: Transforms SQL queries into rich natural-language descriptions capturing the business question or "analytical intent." This involves a three-step pipeline: context injection (adding Pinterest-specific semantic info from PinCat), SQL-to-Text (using an LLM to generate summaries, analytical questions, and detailed breakdowns, stripping temporal specifics for generalizability), and Text-to-Embedding (converting descriptions into vector representations for semantic retrieval).
  • Structural & Statistical Patterns with Governance-Aware Ranking: Extracts hard facts from query history, including join patterns, common filters, aggregation logic, and subquery structures. It also aggregates statistical signals like table co-occurrence, query success rates, usage recency/volume, and author expertise. These are combined with governance metadata (table tiers, freshness) to provide a "governance-aware ranking," ensuring the system surfaces not just relevant but also trustworthy and validated solutions.
💡

System Design Takeaway

This architecture demonstrates how combining deep semantic understanding (intent embeddings) with pragmatic, empirically validated structural knowledge (query patterns) and strong data governance leads to a more reliable and scalable Text-to-SQL system. The separation of "what" (intent) from "how" (patterns) is a key design principle.

These two dimensions work together: intent embeddings enable flexible semantic search, while structural patterns provide concrete, validated SQL building blocks. When an analyst asks a question, the system uses intent to retrieve relevant historical queries and then leverages the associated structural patterns and governance signals to construct a highly reliable and performant SQL query.

Key Architectural Decisions

  • LLM for SQL-to-Text Translation: Instead of directly querying LLMs with schema, Pinterest uses LLMs to convert historical SQL into structured, generalizable natural language descriptions of intent and logic. This pre-processing step creates a reusable knowledge base.
  • Semantic Retrieval: Leveraging vector embeddings to match user questions with past analytical intents, overcoming limitations of keyword-based search that struggle with variations in phrasing or schema.
  • Governance-Aware Ranking: Integrating data governance metadata (e.g., table tiers, ownership, freshness) into the ranking algorithm for retrieved results, ensuring that the generated SQL is not just semantically relevant but also uses authoritative and reliable data sources. This is crucial for trustworthiness at scale.
  • Pattern Extraction: Automatically deriving validated join keys, filters, and aggregation logic from successful historical queries to guide future query generation, minimizing the risk of incorrect or inefficient SQL.
Text-to-SQLLLMRAGData GovernanceVector EmbeddingsSemantic SearchAnalyticsData Warehouse

Comments

Loading comments...