This article from Meta Engineering details the architectural approach to building privacy-aware infrastructure, focusing on a hybrid AI-human system for data asset classification. It addresses the challenges of classifying diverse data types in AI-native environments to enforce privacy policies reliably. The core design principles involve rich context building, decoupling evaluation from optimization, and distilling stable AI inferences into deterministic rules for scalable and auditable enforcement.
Read original on Meta EngineeringIn the AI-native era, privacy controls are paramount but face significant challenges. Systems need a precise understanding of data to enforce policies like retention, access, or anonymization. The complexity arises from data ambiguity (e.g., 'age' can mean a person's age or a cache TTL) and the sheer volume and velocity of new data modalities, derived features, and evolving policy interpretations introduced by AI products. Manual review cannot keep pace, necessitating an automated yet reliable approach.
Asset Classification as the Foundation
Asset classification is the load-bearing base of PAI. If data is misclassified, all downstream processes (discovery, enforcement, compliance) inherit those errors. Assets can range from database columns to nested payload fields, ML features, or embeddings, requiring classification to follow the data's meaning, not just its structure.
Meta employs a hybrid approach combining AI with human oversight for asset classification at scale. This system is designed to learn from ambiguous signals while ensuring production enforcement relies on low-latency, replayable, and auditable logic. The goal is not "LLMs everywhere," but strategic use of LLMs to interpret novelty and distill findings into deterministic rules.
The system operates with a dual-lane architecture:
Example: Evidence Brief for 'user_payload.email_address'
An evidence brief for a field like `user_payload.email_address` might include: - Supporting signal: Lineage to a user-facing logging pipeline (high weight). - Supporting signal: Semantic annotation indicating EMAIL-like data (high weight). - Contradicting signal: Ownership metadata pointing to an infrastructure team, not user-facing product (low weight). - Suppressed signal: Existing privacy label (removed to prevent circular reasoning).