Pinterest's PinLanding system tackles the challenge of generating precise, navigable shopping collections from billions of products at web scale. It leverages multimodal AI, including Vision-Language Models (VLMs) and LLMs, to create a content-first pipeline that understands user intent and attributes products. The system emphasizes scalable batch inference using Ray and distributed computation with Spark for efficient collection construction.
Read original on Pinterest EngineeringPinLanding is a sophisticated system designed to automatically generate shopping collections from Pinterest's vast catalog of billions of items. This addresses a common challenge in large-scale e-commerce and social discovery platforms: organizing products into structured, browsable collections. The system shifts from traditional user search history and manual curation to a content-first approach, leveraging multimodal AI to generate collections directly from product content while still aligning with user search patterns.
The PinLanding pipeline is structured around four main components:
To create rich product representations, each product (Pin) is treated as a multimodal tuple of image and metadata. An initial Vision-Language Model (VLM) generates candidate attributes. To combat sparsity and redundancy in raw VLM outputs, a curation pipeline is implemented. This involves statistical filtering by frequency, embedding-based clustering to merge similar attributes, and an LLM-as-judge step to filter and rank attributes for searchability and semantic coherence. This process converts a long tail of raw VLM outputs into a compact, high-quality attribute vocabulary.
Scaling Attribute Assignment with Dual-Encoder Models
Instead of running expensive VLM inference for every product, PinLanding employs a CLIP-style dual-encoder model. One encoder processes product image/text to generate a product embedding, and another processes attribute phrases to generate attribute embeddings in the same space. During inference, all products and attributes are embedded once, and attributes are assigned based on embedding similarity. This significantly reduces computational cost and improves attribute consistency across the catalog.
Operating at the scale of millions of Pins and candidate topics requires a robust distributed infrastructure. PinLanding uses Ray for scalable batch inference of attribute assignments. The inference pipeline is a streaming job with distinct stages: data loading/preprocessing (CPU cluster), ML inference (GPU pool for CLIP-based classifier), and feed construction. This allows for independent scaling of CPU-bound and GPU-bound tasks and ensures efficient resource utilization. Apache Spark is then used for distributed computation to construct product feeds based on attribute matching and relevance scoring. Candidate joins are optimized through attribute-based partitioning and pre-filters.