Menu
Pinterest Engineering·January 13, 2026

Pinterest's PinLanding: Building a Multimodal AI System for Shopping Collection Generation

Pinterest's PinLanding system tackles the challenge of generating precise, navigable shopping collections from billions of products at web scale. It leverages multimodal AI, including Vision-Language Models (VLMs) and LLMs, to create a content-first pipeline that understands user intent and attributes products. The system emphasizes scalable batch inference using Ray and distributed computation with Spark for efficient collection construction.

Read original on Pinterest Engineering

PinLanding is a sophisticated system designed to automatically generate shopping collections from Pinterest's vast catalog of billions of items. This addresses a common challenge in large-scale e-commerce and social discovery platforms: organizing products into structured, browsable collections. The system shifts from traditional user search history and manual curation to a content-first approach, leveraging multimodal AI to generate collections directly from product content while still aligning with user search patterns.

Architecture Overview

The PinLanding pipeline is structured around four main components:

  1. Understanding user search patterns to characterize shopping intent and identify demand gaps.
  2. Building a refined shopping collection vocabulary using multimodal LLMs and LLM-as-judge for quality control.
  3. Constructing product feeds from the derived attributes.
  4. Continuously evaluating and evolving the system for AI-native search behavior and improved performance.

Multimodal Attribute Generation and Curation

To create rich product representations, each product (Pin) is treated as a multimodal tuple of image and metadata. An initial Vision-Language Model (VLM) generates candidate attributes. To combat sparsity and redundancy in raw VLM outputs, a curation pipeline is implemented. This involves statistical filtering by frequency, embedding-based clustering to merge similar attributes, and an LLM-as-judge step to filter and rank attributes for searchability and semantic coherence. This process converts a long tail of raw VLM outputs into a compact, high-quality attribute vocabulary.

💡

Scaling Attribute Assignment with Dual-Encoder Models

Instead of running expensive VLM inference for every product, PinLanding employs a CLIP-style dual-encoder model. One encoder processes product image/text to generate a product embedding, and another processes attribute phrases to generate attribute embeddings in the same space. During inference, all products and attributes are embedded once, and attributes are assigned based on embedding similarity. This significantly reduces computational cost and improves attribute consistency across the catalog.

Scalable Batch Inference and Feed Construction

Operating at the scale of millions of Pins and candidate topics requires a robust distributed infrastructure. PinLanding uses Ray for scalable batch inference of attribute assignments. The inference pipeline is a streaming job with distinct stages: data loading/preprocessing (CPU cluster), ML inference (GPU pool for CLIP-based classifier), and feed construction. This allows for independent scaling of CPU-bound and GPU-bound tasks and ensures efficient resource utilization. Apache Spark is then used for distributed computation to construct product feeds based on attribute matching and relevance scoring. Candidate joins are optimized through attribute-based partitioning and pre-filters.

multimodal AILLMsVLMbatch inferenceRaySparke-commerceshopping collectionsinformation retrievalscalable systems

Comments

Loading comments...