Netflix Tech Blog·June 23, 2026

Netflix's AI Video Editing Models: Vera (Layered Diffusion) and VOID (Interaction Deletion)

This article from Netflix details two research explorations in AI-powered video editing: Vera, a layered video diffusion model for content-preserving edits, and VOID, a video inpainting model for physically plausible object and interaction deletion. It highlights architectural challenges in existing generative video models and how Netflix is tackling them through specialized model designs and novel training data generation pipelines.

AI & ML Infrastructure Distributed Systems

Read original on Netflix Tech Blog

Netflix is developing advanced AI video editing tools to enhance content creation efficiency for promotional assets. The primary challenge with current generative video models is their tendency to regenerate entire video frames, leading to unintended alterations of original footage and physically implausible results when removing objects. To address these issues, Netflix introduces two research models: Vera and VOID.

Vera: Layered Video Diffusion for Content Preservation

Vera is a novel layered video diffusion framework designed for content-preserving video editing. Unlike models that regenerate all pixels, Vera generates only the necessary changes as separate edit layers (edit layer and alpha matte), which are then composed with the original footage. This approach ensures that pixels outside the edited regions remain perfectly intact, preserving identities, performances, and other details from the source video.

Architectural Innovations in Vera

MoT (Mixture-of-Transformers) Design: To handle the substantially different distributions of Vera's three target outputs (edit layer, alpha matte, composite layer), it uses a Mixture-of-Transformers (MoT) design. This involves three separate Diffusion Transformers (DiTs), one for each output. Each DiT maintains its own QKV projections and FFN weights, allowing specialization while enabling cross-layer interaction through joint self-attention.
Specialized Training Data: A key challenge was the lack of public layered video datasets. Netflix built its own dataset (486k frames at 832x480 resolution) organized into synthetic composites, realistic single-object videos, and realistic multi-object videos with effects. This proprietary data provides high-quality supervision for alpha matting, object addition, and background change tasks, improving composition quality and handling complex dynamic scenes.

VOID: Physically Plausible Video Object and Interaction Deletion

VOID tackles the problem of unnatural physics in object removal by performing physically plausible inpainting. It not only removes an object but also reconstructs the scene as if the object was never there, correcting for interactions such as collisions or gravity. This is crucial for maintaining physical continuity in complex scenes.

Two-Pass Inference Pipeline: VOID uses a VLM-based reasoning pipeline to analyze a scene and identify causally affected regions when an object is removed. This physical reasoning is encoded into a 'quadmask' which guides the diffusion model in a first pass to generate a physically plausible counterfactual video. A second pass, triggered if 'object morphing' is detected, re-runs inference with flow-warped noise to stabilize object shapes along new trajectories.
Simulation-Based Training Data: VOID's training data leverages the Kubric simulation engine and HUMOTO human motion capture dataset to generate synthetic counterfactual video pairs. This involves re-simulating scenes with the target object removed, ensuring the alternate outcome adheres to strict laws of physics and generating corresponding quadmasks to guide the model.

💡

System Design Implications

Designing robust AI-powered video editing systems requires not only advanced generative models but also careful consideration of data generation pipelines, specialized model architectures (like MoT for multi-output problems), and mechanisms to ensure content preservation and physical realism. The multi-stage inference (e.g., VOID's two-pass pipeline) indicates a modular approach to address specific failure modes in generative AI.

AI video editinggenerative AIdiffusion modelsmachine learning architecturecontent creationmodel trainingdata pipelinesNetflix

Comments

Loading comments...

Architecture Design

Design this yourself

Design a scalable AI-powered video editing platform for a large media company like Netflix, incorporating advanced generative models like Vera for layered edits and VOID for physically plausible object removal. Detail the architecture for handling diverse editing tasks, managing complex data pipelines for model training (including synthetic data generation), and ensuring high quality and real-time processing for video assets.

Practice Interview

Focus: layered video diffusion model for content-preserving edits and physically plausible video object deletion

Other design angles

· Design the data generation and management pipeline specifically for training next-generation layered video diffusion models like Vera, focusing on scalability and dataset quality.· Architect a real-time inference system for AI video editing models like VOID, considering the computational demands of multi-pass inference and integration with existing video processing workflows.· Design a distributed system for managing and orchestrating various AI video editing tasks, including task queuing, resource allocation for GPUs, and monitoring model performance and output quality.