This article details Pinterest's Minimal Important Query Param Set (MIQPS) system, designed to deduplicate URLs at scale within their content ingestion pipeline. MIQPS leverages a data-driven approach by generating content fingerprints to determine which URL query parameters are essential for page identity versus noise, significantly reducing processing overhead for duplicate content across millions of diverse domains. The system highlights the trade-offs between offline analysis for rule generation and efficient runtime application in large-scale distributed systems.
Read original on InfoQ ArchitectureAt the scale of Pinterest, ingesting content from millions of domains presents a significant challenge: URL deduplication. Many URLs, especially from merchant and publisher websites, can point to the same underlying content despite having varying query parameters (e.g., tracking IDs, session tokens). Processing each variant as unique leads to substantial, unnecessary costs for fetching, rendering, and indexing. Traditional rule-based approaches, relying on static allowlists or denylists, prove inadequate for the 'long tail' of diverse and evolving URL structures.
Pinterest developed the Minimal Important Query Param Set (MIQPS), a data-driven system to address this. Instead of predefined rules, MIQPS infers the importance of query parameters by observing their impact on page content. This is achieved by:
Why not use Canonical Tags?
The article highlights that canonical tags, often used for SEO to indicate preferred URLs, are frequently missing, inconsistent, or polluted with tracking parameters. This makes them unreliable for automated, large-scale deduplication across a vast and varied external web.
The MIQPS architecture separates offline analysis from runtime processing. Expensive operations like content rendering and parameter evaluation are performed offline. The output is an 'importance map' stored in a configuration service. At runtime, URL processing systems apply these precomputed rules. This design leverages the observation that URL structures evolve slowly, making offline computation a practical trade-off for balancing freshness, cost, and operational complexity in large-scale ingestion systems. The system also includes anomaly detection to prevent erroneous downgrades of important parameters and applies early exit logic to improve efficiency by stopping evaluation once mismatch rates exceed a threshold during limited tests.