Menu
Meta Engineering·June 25, 2026

Designing Privacy-Aware Infrastructure with Hybrid AI-Human Asset Classification

This article from Meta Engineering details the architectural approach to building privacy-aware infrastructure, focusing on a hybrid AI-human system for data asset classification. It addresses the challenges of classifying diverse data types in AI-native environments to enforce privacy policies reliably. The core design principles involve rich context building, decoupling evaluation from optimization, and distilling stable AI inferences into deterministic rules for scalable and auditable enforcement.

Read original on Meta Engineering

The Challenge of Privacy-Aware Data Classification

In the AI-native era, privacy controls are paramount but face significant challenges. Systems need a precise understanding of data to enforce policies like retention, access, or anonymization. The complexity arises from data ambiguity (e.g., 'age' can mean a person's age or a cache TTL) and the sheer volume and velocity of new data modalities, derived features, and evolving policy interpretations introduced by AI products. Manual review cannot keep pace, necessitating an automated yet reliable approach.

Core Operational Concerns for Privacy-Aware Infrastructure (PAI)

  • Understand: Classifying data assets accurately and consistently.
  • Discover: Identifying relevant data flows for specific policy questions.
  • Enforce: Applying retention, access, purpose, and sharing constraints.
  • Demonstrate: Providing verifiable evidence for compliance.
ℹ️

Asset Classification as the Foundation

Asset classification is the load-bearing base of PAI. If data is misclassified, all downstream processes (discovery, enforcement, compliance) inherit those errors. Assets can range from database columns to nested payload fields, ML features, or embeddings, requiring classification to follow the data's meaning, not just its structure.

Meta's Hybrid AI-Human Classification Pattern

Meta employs a hybrid approach combining AI with human oversight for asset classification at scale. This system is designed to learn from ambiguous signals while ensuring production enforcement relies on low-latency, replayable, and auditable logic. The goal is not "LLMs everywhere," but strategic use of LLMs to interpret novelty and distill findings into deterministic rules.

Key Principles of the Design

  • Context beats prompts: Instead of optimizing prompts for raw, noisy data, focus on building rich "evidence briefs" with supporting/contradicting signals and provenance. This significantly improves model accuracy.
  • Decouple evaluation from optimization: LLM recommendations are not ground truth. An independent evaluation loop with human-reviewed labels and regression gates prevents the system from measuring drift instead of true progress.
  • Distill stable behavior into deterministic rules: LLMs handle ambiguity and cold start scenarios, but stable, validated patterns are converted into versioned, auditable rules. This progressively shrinks the LLM's role in routine production enforcement, making it deterministic and efficient.

The Two-Lane Operating Pattern

The system operates with a dual-lane architecture:

  1. Deterministic Path: Most requests (around 85%) are resolved by deterministic rules in single-digit milliseconds, after context assembly.
  2. LLM Fallback Path: Approximately 15% of requests, typically novel or ambiguous cases, fall back to the LLM, which is slower (seconds) and budgeted separately.
  3. Offline Learning Loop: A nightly process samples served decisions, adjudicates them against human-reviewed truth, and re-evaluates. Stable patterns identified here are distilled into new rules.
  4. Rule Promotion: Distilled rules are promoted back into the live decision funnel, ensuring a continuous learning and refinement cycle.
📌

Example: Evidence Brief for 'user_payload.email_address'

An evidence brief for a field like `user_payload.email_address` might include: - Supporting signal: Lineage to a user-facing logging pipeline (high weight). - Supporting signal: Semantic annotation indicating EMAIL-like data (high weight). - Contradicting signal: Ownership metadata pointing to an infrastructure team, not user-facing product (low weight). - Suppressed signal: Existing privacy label (removed to prevent circular reasoning).

privacy-aware infrastructuredata classificationLLMshybrid AIdeterministic rulesdata governancesystem designdata lineage

Comments

Loading comments...