InfoQ Architecture·June 8, 2026

Pinterest's Data-Driven URL Deduplication System: MIQPS

This article details Pinterest's Minimal Important Query Param Set (MIQPS) system, designed to deduplicate URLs at scale within their content ingestion pipeline. MIQPS leverages a data-driven approach by generating content fingerprints to determine which URL query parameters are essential for page identity versus noise, significantly reducing processing overhead for duplicate content across millions of diverse domains. The system highlights the trade-offs between offline analysis for rule generation and efficient runtime application in large-scale distributed systems.

Distributed Systems Performance & Scaling Databases & Storage

Read original on InfoQ Architecture

At the scale of Pinterest, ingesting content from millions of domains presents a significant challenge: URL deduplication. Many URLs, especially from merchant and publisher websites, can point to the same underlying content despite having varying query parameters (e.g., tracking IDs, session tokens). Processing each variant as unique leads to substantial, unnecessary costs for fetching, rendering, and indexing. Traditional rule-based approaches, relying on static allowlists or denylists, prove inadequate for the 'long tail' of diverse and evolving URL structures.

The MIQPS Approach to URL Normalization

Pinterest developed the Minimal Important Query Param Set (MIQPS), a data-driven system to address this. Instead of predefined rules, MIQPS infers the importance of query parameters by observing their impact on page content. This is achieved by:

Corpus Collection: Gathering a large set of URLs from ingestion pipelines.
Parameter Grouping: Grouping URLs based on query parameter patterns.
Content Fingerprinting: Rendering pages and generating unique fingerprints for content comparison.
Importance Inference: Evaluating if removing a parameter significantly changes the content fingerprint. If it does, the parameter is deemed 'important' and retained; otherwise, it's considered 'noise' and removed during normalization.

💡

Why not use Canonical Tags?

The article highlights that canonical tags, often used for SEO to indicate preferred URLs, are frequently missing, inconsistent, or polluted with tracking parameters. This makes them unreliable for automated, large-scale deduplication across a vast and varied external web.

System Architecture and Trade-offs

The MIQPS architecture separates offline analysis from runtime processing. Expensive operations like content rendering and parameter evaluation are performed offline. The output is an 'importance map' stored in a configuration service. At runtime, URL processing systems apply these precomputed rules. This design leverages the observation that URL structures evolve slowly, making offline computation a practical trade-off for balancing freshness, cost, and operational complexity in large-scale ingestion systems. The system also includes anomaly detection to prevent erroneous downgrades of important parameters and applies early exit logic to improve efficiency by stopping evaluation once mismatch rates exceed a threshold during limited tests.

URL NormalizationDeduplicationContent IngestionWeb CrawlingDistributed SystemsScalabilityData-DrivenOffline Processing

Comments

Loading comments...

Architecture Design

Design this yourself

Design a large-scale content ingestion pipeline for a platform like Pinterest or a search engine, focusing on an efficient and scalable URL deduplication component. This component should leverage a data-driven approach like MIQPS, utilizing content fingerprinting and an offline analysis workflow to identify and normalize important query parameters. Detail the architecture of both the offline rule generation process and the online runtime application, including considerations for data consistency, fault tolerance, and efficiency across millions of diverse domains.

Practice Interview

Focus: data-driven URL deduplication system using content fingerprinting and offline analysis for query parameter importance

Other design angles

· Design a web crawler's URL processing module that incorporates adaptive URL normalization, allowing for dynamic learning of query parameter importance.· Design a system to manage and serve an 'importance map' for URL query parameters, ensuring low-latency lookups and robust anomaly detection for updates.· Architect a generic content fingerprinting service that can be integrated into various ingestion pipelines for detecting duplicate or near-duplicate web content.

Pinterest's Data-Driven URL Deduplication System: MIQPS

The MIQPS Approach to URL Normalization

System Architecture and Trade-offs

Comments

Architecture Design

Related Lessons