This article from Netflix details an agentic workflow for Observational Causal Inference (OCI) that leverages software agents (Actor, Critic) to automate repetitive tasks while augmenting human data scientists (Principals). The system emphasizes rigorous diagnostics like covariate balance and overlap checks, and provides mechanisms for human inspection of agent-generated artifacts. This architecture aims to improve the credibility and efficiency of causal analyses at scale.
Read original on Netflix Tech BlogNetflix has developed an agentic workflow to automate and enhance Observational Causal Inference (OCI) tasks. This system is designed to reduce the toil associated with OCI, such as checking covariate balance and conducting sensitivity analyses, allowing human practitioners to focus on higher-level tasks like framing questions and scrutinizing assumptions. The workflow integrates AI agents within Netflix's existing OCI toolkit, emphasizing transparency and human oversight.
The workflow operates on a "target trial emulation" philosophy, comparing observational analyses to an ideal A/B test. It defines three key personas to manage the OCI process:
Design Diagnostics for Credibility
The system embeds critical design diagnostics to ensure fair comparisons and robust conclusions. These include checking covariate balance (standardized mean difference < 0.2), overlap of propensity scores (between 0.1 and 0.9), placebo outcome tests (no significant effect on pre-treatment variables), and sensitivity analyses to hidden confounders. Failures in these diagnostics trigger remediation playbooks for the agent.
To augment human evaluation in the absence of ground truth for OCI, agents publish artifacts such as plans, specifications, plots, and notebooks. These are version-controlled and uploaded to a file store, allowing principals to inspect and re-execute any step. This transparent approach, coupled with human oversight, forms a system of "process audits" that builds trust in the agent's analyses. The system also handles orchestrating follow-up analyses, such as sensitivity analyses with varied parameters or time-series generation across multiple data partitions, reducing manual effort and ensuring consistency.
A key challenge highlighted is early adopter bias, where initial users of a new feature are systematically different. The system addresses this by having the Critic agent detect issues like poor overlap. For example, to overcome poor overlap, the Actor agent can apply Crump-style trimming, which excludes units with extreme propensity scores (e.g., outside [0.1, 0.9]). This narrows the population for which the ATE is estimated but significantly increases the credibility of the result by focusing on members where treatment assignment is plausible.