This article from Netflix details two research explorations in AI-powered video editing: Vera, a layered video diffusion model for content-preserving edits, and VOID, a video inpainting model for physically plausible object and interaction deletion. It highlights architectural challenges in existing generative video models and how Netflix is tackling them through specialized model designs and novel training data generation pipelines.
Read original on Netflix Tech BlogNetflix is developing advanced AI video editing tools to enhance content creation efficiency for promotional assets. The primary challenge with current generative video models is their tendency to regenerate entire video frames, leading to unintended alterations of original footage and physically implausible results when removing objects. To address these issues, Netflix introduces two research models: Vera and VOID.
Vera is a novel layered video diffusion framework designed for content-preserving video editing. Unlike models that regenerate all pixels, Vera generates only the necessary changes as separate edit layers (edit layer and alpha matte), which are then composed with the original footage. This approach ensures that pixels outside the edited regions remain perfectly intact, preserving identities, performances, and other details from the source video.
VOID tackles the problem of unnatural physics in object removal by performing physically plausible inpainting. It not only removes an object but also reconstructs the scene as if the object was never there, correcting for interactions such as collisions or gravity. This is crucial for maintaining physical continuity in complex scenes.
System Design Implications
Designing robust AI-powered video editing systems requires not only advanced generative models but also careful consideration of data generation pipelines, specialized model architectures (like MoT for multi-output problems), and mechanisms to ensure content preservation and physical realism. The multi-stage inference (e.g., VOID's two-pass pipeline) indicates a modular approach to address specific failure modes in generative AI.