ByteByteGo·June 16, 2026

Architectural Evolution of Open-Weight Large Language Models

This article explores how open-weight models have transformed the AI landscape by fostering collaboration and innovation. It delves into the architectural choices, particularly the Mixture-of-Experts (MoE) transformer, and various attention strategies and training approaches that define the current generation of LLMs. Understanding these architectural and training decisions is crucial for designing and deploying scalable AI systems.

AI & ML Infrastructure Distributed Systems Performance & Scaling

Read original on ByteByteGo

Open-Weight vs. Closed-Weight Models

The distinction between open-weight and closed-weight models is fundamental in the AI landscape. A closed-weight model is typically accessed via an API, with the model's trained parameters remaining proprietary on the company's servers. Users cannot run the model on their hardware or fine-tune it. In contrast, an open-weight model makes its trained parameters publicly available, allowing anyone to download, run, and adapt the model. While the full training data and code usually remain private, the published weights and detailed technical reports enable collaborative innovation, similar to open-source software but focused on the model artifact rather than the full source code.

The Mixture-of-Experts (MoE) Architecture

The Mixture-of-Experts (MoE) transformer has become the de facto architectural skeleton for many frontier open-weight LLMs. Traditional (dense) models activate all parameters for every processed word, leading to high computational costs for large models. MoE addresses this by replacing single feed-forward layers with multiple 'expert' sub-networks and a router component. This design allows the model to store vast amounts of knowledge (total parameters) while only activating a small subset of experts per word (active parameters), significantly reducing inference cost and improving efficiency for large models. This is a critical architectural trade-off for scaling LLMs.

💡

MoE Efficiency Insight

When evaluating the operational cost and inference speed of MoE models, the active parameters count is more relevant than the total parameters. A trillion-parameter MoE model can be as cost-effective as a 200-billion-parameter model if their active parameter counts are similar.

Attention Strategies for Memory Optimization

Managing the KV-cache, which stores information from previous words to avoid recomputation, is a key challenge for long conversations in LLMs. Three main strategies have emerged to optimize memory usage:

Grouped-Query Attention (GQA): Simplifies implementation by sharing cached information across groups of attention heads, offering good memory reduction. Used by models like Qwen3 and Llama 4.
Multi-Head Latent Attention (MLA): Compresses cached information into a smaller latent representation. It saves more memory than GQA but introduces additional computational overhead for compression/decompression. Adopted by DeepSeek and Kimi K2.
Sparse Attention: Selects only the most relevant previous words to attend to, making it highly effective for very long contexts by reducing the computational load of attending to every token. Requires careful design to avoid missing important information. Used by DeepSeek and Zhipu AI's GLM-5.

Expert Count, Sparsity, and Shared Experts

The number of experts in an MoE model (ranging from 16 to 384 in recent models) represents a design decision impacting memory footprint and training effectiveness. More experts can improve learning at a fixed compute budget but increase the total memory required for the model. Another architectural debate revolves around including a 'shared expert' that processes every word, providing a baseline capability. While some models like DeepSeek V3 and Kimi K2 include one, others like Qwen3 have dropped it, indicating an evolving consensus on its necessity.

Training Approaches: Pre-training vs. Post-training

While pre-training (learning from trillions of tokens) provides the model's foundational knowledge, post-training is where significant differentiation now occurs. Key post-training techniques include:

Reinforcement Learning with Verifiable Rewards: The model's outputs are objectively checked for correctness (e.g., code compilation, math answers), and rewards guide further training. This was a breakthrough for DeepSeek R1.
Distillation: A larger 'teacher' model's outputs are used to train smaller 'student' models, enabling the creation of more efficient, compact models that retain high performance. Llama 4 uses co-distillation as part of its training regimen.

LLMsAI ArchitectureMixture-of-ExpertsMoEOpen-Weight ModelsAttention MechanismsDeep LearningScalability