This article explores how open-weight models have transformed the AI landscape by fostering collaboration and innovation. It delves into the architectural choices, particularly the Mixture-of-Experts (MoE) transformer, and various attention strategies and training approaches that define the current generation of LLMs. Understanding these architectural and training decisions is crucial for designing and deploying scalable AI systems.
Read original on ByteByteGoThe distinction between open-weight and closed-weight models is fundamental in the AI landscape. A closed-weight model is typically accessed via an API, with the model's trained parameters remaining proprietary on the company's servers. Users cannot run the model on their hardware or fine-tune it. In contrast, an open-weight model makes its trained parameters publicly available, allowing anyone to download, run, and adapt the model. While the full training data and code usually remain private, the published weights and detailed technical reports enable collaborative innovation, similar to open-source software but focused on the model artifact rather than the full source code.
The Mixture-of-Experts (MoE) transformer has become the de facto architectural skeleton for many frontier open-weight LLMs. Traditional (dense) models activate all parameters for every processed word, leading to high computational costs for large models. MoE addresses this by replacing single feed-forward layers with multiple 'expert' sub-networks and a router component. This design allows the model to store vast amounts of knowledge (total parameters) while only activating a small subset of experts per word (active parameters), significantly reducing inference cost and improving efficiency for large models. This is a critical architectural trade-off for scaling LLMs.
MoE Efficiency Insight
When evaluating the operational cost and inference speed of MoE models, the active parameters count is more relevant than the total parameters. A trillion-parameter MoE model can be as cost-effective as a 200-billion-parameter model if their active parameter counts are similar.
Managing the KV-cache, which stores information from previous words to avoid recomputation, is a key challenge for long conversations in LLMs. Three main strategies have emerged to optimize memory usage:
The number of experts in an MoE model (ranging from 16 to 384 in recent models) represents a design decision impacting memory footprint and training effectiveness. More experts can improve learning at a fixed compute budget but increase the total memory required for the model. Another architectural debate revolves around including a 'shared expert' that processes every word, providing a baseline capability. While some models like DeepSeek V3 and Kimi K2 include one, others like Qwen3 have dropped it, indicating an evolving consensus on its necessity.
While pre-training (learning from trillions of tokens) provides the model's foundational knowledge, post-training is where significant differentiation now occurs. Key post-training techniques include: