Menu
Pinterest Engineering·April 20, 2026

Smarter URL Normalization for Content Deduplication at Scale

Pinterest engineered the Minimal Important Query Param Set (MIQPS) algorithm to dynamically identify and strip irrelevant URL parameters, crucial for deduplicating content at their vast scale. This system reduces redundant processing by distinguishing between parameters that affect content (e.g., product ID) and those that are purely for tracking, ultimately improving efficiency and catalog quality. The solution leverages content fingerprinting and a multi-layer normalization strategy combining static rules with learned dynamic ones.

Read original on Pinterest Engineering

Content deduplication is a critical challenge for platforms like Pinterest that ingest vast amounts of data from diverse sources. The core problem lies in recognizing that multiple URLs, often differing only by tracking parameters, can point to the exact same content. Inefficiently fetching and processing each URL variant leads to significant waste of computational resources.

The MIQPS Algorithm: Dynamic Parameter Learning

The Minimal Important Query Param Set (MIQPS) algorithm addresses this by empirically determining which URL parameters are essential for content identity. It operates on the principle that if removing a parameter changes the page content, it's important; otherwise, it's noise. This analysis is performed independently for each domain, acknowledging that parameter semantics can vary widely across different merchant sites.

  1. Collect URL Corpus: Accumulate all observed URLs per domain.
  2. Group URLs by Query Parameter Pattern: Group URLs that share the same set of parameter names. This ensures parameters are evaluated in their specific context (e.g., 'ref' can be neutral on a product page but non-neutral on a comparison page).
  3. Test Each Parameter: For each parameter within a pattern, sample URLs and compare the content ID of the original URL against a modified URL where the parameter is removed. If the content ID changes significantly (above a threshold T%), the parameter is classified as non-neutral.
💡

Content ID as Ground Truth

Instead of relying on often unreliable `` tags, MIQPS uses a "content ID"—a fingerprint derived from the page's rendered visual content. This robust approach ensures accurate deduplication regardless of a merchant's site metadata quality. For those without a sophisticated rendering pipeline, alternatives like DOM tree hashing or HTTP response body checksums can serve a similar purpose.

System Architecture and Multi-Layer Normalization

The MIQPS system integrates into Pinterest's content processing pipeline. An offline job computes the MIQPS maps, which are then published to a configuration store. At runtime, the URL processor uses these maps, combined with static normalization rules (for well-known platforms), to normalize incoming URLs. This multi-layered strategy ensures broad coverage and efficiency.

ComponentRole in MIQPS
URL normalizationdeduplicationcontent acquisitiondata processingweb crawlingscalemachine learningmicroservices

Comments

Loading comments...