This article details an optimization technique called Asynchronous Frame Generation Pipeline for generative AI video inference on Amazon EC2 G7e instances. It addresses the GPU bottleneck caused by synchronous data transfer and host-side processing during VAE decoding, significantly improving GPU utilization and reducing latency by overlapping compute, data transfer, and post-processing.
Read original on AWS Architecture BlogSynthesia, an enterprise AI video platform, leverages Amazon EC2 G7e instances for generative AI video inference, specifically for models like latent diffusion that use Variational Auto Encoders (VAEs). A common challenge in such GPU-memory intensive workloads is inefficient GPU utilization due to synchronous operations, where GPU compute stalls while waiting for data transfer and host-side processing of decoded video frames.
Latent diffusion models perform their diffusion process in a compressed latent space. The final step involves decoding this latent video into human-readable pixel frames using the VAE decoder. To manage resource intensity, videos are typically split into chunks (e.g., 4 consecutive pixel frames per chunk). In a traditional "Sequential Frame Generation Pipeline," after a chunk is processed on the GPU, its pixel frames are synchronously transferred to host (CPU) memory and written to storage. This synchronous copy and write operation prevents the GPU from immediately starting on the next chunk, leading to significant GPU stalls and reduced overall utilization.
The proposed "Asynchronous Frame Generation Pipeline" aims to overcome this bottleneck by overlapping GPU computation, device-to-host (D2H) data transfer, and host-side post-processing. This is achieved through several key architectural components:
Performance Impact
Adopting this asynchronous technique resulted in increased GPU kernel utilization from 82% to 99.9%, leading to an 8.2% decrease in latency and a corresponding increase in throughput for video decoding benchmarks on Amazon EC2 G7e instances.
The implementation leverages PyTorch and the `torch.cuda.Stream` and `torch.cuda.Event` APIs to manage asynchronous execution and synchronization. The figure below illustrates the interaction between these components.