InfoQ Architecture·March 16, 2026

DoorDash's DashCLIP: Multimodal Semantic Search for E-commerce

DoorDash developed DashCLIP, a multimodal machine learning system, to enhance product discovery and ranking by aligning product images, text, and user queries in a unified embedding space. This system addresses the limitations of traditional search methods in diverse marketplaces by leveraging contrastive learning and a two-stage training pipeline. DashCLIP integrates these embeddings into a K-nearest neighbor search and downstream ranking models, significantly improving engagement metrics and serving as a foundational representation for various ML tasks.

AI & ML Infrastructure Distributed Systems Performance & Scaling

Read original on InfoQ Architecture

The Challenge of Multimodal Search in E-commerce

DoorDash's marketplace, spanning groceries, retail, and pharmaceuticals, presents a significant challenge for traditional search and recommendation systems. These systems often struggle with the semantic relationships between product images, descriptions, and user intent, especially when structured metadata or historical engagement data is scarce. DashCLIP was developed to overcome these limitations by understanding and aligning information across different modalities.

DashCLIP Architecture Overview

DashCLIP is a multimodal machine learning system built on contrastive learning principles, similar to CLIP. Its core architecture consists of multiple encoders that generate vector embeddings for different data types:

Unimodal Encoders: Separate encoders process product images and text descriptions.
Multimodal Encoder: Integrates signals from both image and text to create a combined product representation.
Query Encoder: Maps user search queries into the same embedding space as products.

ℹ️

Semantic Embedding Space

The goal is to place semantically related items (e.g., a query for "fresh produce" and an image of apples) close together in the shared embedding space, while pushing unrelated items apart. This allows for flexible matching even with incomplete textual descriptions or when visual attributes are key.

Two-Stage Training Pipeline

DoorDash employed a two-stage training approach to build DashCLIP:

Stage 1: Continual Pretraining: Adapts pretrained vision-language models to the e-commerce domain using ~400,000 product image and title pairs from their catalog. This stage focuses on learning robust multimodal product representations.
Stage 2: Query-Product Alignment: Aligns user queries with product embeddings using a Query Catalog Contrastive (QCC) loss. This stage leverages a large dataset (700,000 human-annotated pairs expanded to 32 million via GPT-based labeling) to bring relevant query-product pairs closer.

Integration and Impact on Ranking Systems

Once trained, DashCLIP embeddings are integrated into DoorDash's existing ranking system. Query embeddings are used to retrieve candidate products via K-nearest neighbor (KNN) search. These candidates are then scored by downstream ranking models that consider additional signals like user behavior, context, and product popularity. This architecture enables more semantically relevant retrieval and ranking, improving engagement metrics in online A/B experiments. The embeddings also generalize to other tasks, such as aisle category prediction.

multimodal AIsemantic searchembeddingcontrastive learninge-commercerankingmachine learning architecturevector database