This article explores how Large Language Model (LLM) architectures have evolved to address the memory demands of the Key-Value (KV) cache, a critical component for conversational AI. It details architectural changes like Grouped-Query Attention (GQA), Multi-Head Latent Attention (MLA), and sliding windows, which significantly reduce the per-token memory footprint. The discussion extends to the system-level implications of KV cache management, including eviction strategies, prompt caching, and the architectural gap for medium-term memory that necessitates external systems like RAG and vector databases.
Read original on Hacker NewsThe Key-Value (KV) cache is fundamental to transformer-based LLMs, storing pre-computed key and value vectors for past tokens in a conversation. This mechanism transforms the computational complexity of generating new tokens from quadratic (reprocessing all previous tokens) to linear, drastically improving inference speed. However, this memory optimization comes with a significant cost: the KV cache consumes substantial GPU memory, directly impacting operational expenses and scalability. For instance, early models like GPT-2 required 300 KiB per token, meaning a 4,000-token conversation could consume 1.2 GB of GPU memory just for the cache.
Over six years, LLM architectures have introduced several innovations to reduce KV cache size without compromising model quality:
Beyond internal architectural improvements, the lifecycle and limitations of the KV cache necessitate external system design considerations. KV caches are volatile, often evicted from GPU memory after short periods (e.g., 5-10 minutes). This leads to noticeable delays as conversations are "cold started" by rebuilding the cache, a cost reflected in API pricing (e.g., OpenAI and Anthropic offer discounts for cached prompts).
The Memory Void
The gap between the volatile KV cache (working memory) and the model's permanent trained weights (long-term knowledge) represents a significant architectural void for medium-term memory. This absence forces engineers to build heuristic scaffolding using external systems like Retrieval-Augmented Generation (RAG), vector databases for similarity search, and simple file systems for conversation logs to persist context across sessions.