The New Stack·June 1, 2026

JetBrains Mellum2: A Specialized 12B-Parameter MoE Model for AI Agent Infrastructure

JetBrains has open-sourced Mellum2, a 12B-parameter Mixture-of-Experts (MoE) coding model optimized for infrastructure-layer AI agent tasks like routing, retrieval pipelines, and sub-agent coordination. Designed for speed and efficient inference in production environments, Mellum2 offers an alternative to proprietary models, allowing for private on-premises deployment and greater operational control, particularly relevant for enterprises building their own AI infrastructure.

AI & ML Infrastructure Distributed Systems Performance & Scaling

Read original on The New Stack

Introduction to Mellum2's Architectural Philosophy

JetBrains' Mellum2 is introduced as a "focal model" — a specialized, fast AI component designed for specific, high-frequency tasks within software engineering environments, rather than competing with broad-spectrum "frontier models." This design philosophy emphasizes efficiency and specialization, making it suitable for integration into larger, complex AI agent systems where specific components handle dedicated workloads.

Mixture-of-Experts (MoE) Architecture for Performance

Mellum2 leverages a Mixture-of-Experts (MoE) architecture, a key system design choice for achieving high performance at scale. This design consists of 12 billion total parameters but activates only 2.5 billion parameters per token. The core idea is to route each input token to a subset of the model's 64 specialized "experts," rather than processing it through the entire network. This approach significantly reduces the computational cost and latency during inference, while still benefiting from a large overall model capacity. This makes it particularly effective for production deployments with concurrent loads.

Sparse Activation: Only a fraction of the model's parameters (2.5B out of 12B) are active for each token, leading to faster inference.
Parallel Processing: Tokens are routed to specific experts, enabling a degree of parallelization.
Cost Efficiency: The architectural choice makes the model behave more like a 2.5B model in terms of inference cost, despite its larger total parameter count, crucial for high-volume requests.

Performance Benchmarks and Trade-offs

Model	Single Request (TPS)	Concurrent Load (TPS)	Active Parameters

Benchmarks show Mellum2 matching or exceeding competitors like Qwen2.5-7B under concurrent loads. While excelling in code-specific tasks (scoring 78.4% on EvalPlus), the model makes a deliberate trade-off: its narrower training focus on code and developer documentation means it does not perform as well on broader reasoning or knowledge evaluations compared to more general-purpose models. This highlights a common system design decision: optimizing for a specific use case often means sacrificing generality.

Deployment Flexibility and Operational Control

ℹ️

Self-Hosting AI Infrastructure

Mellum2's open-source nature (Apache 2.0 license with open weights) is a critical design consideration for enterprises. Unlike proprietary models that rely on external APIs for inference, Mellum2 allows companies to host and operate the model on-premises. This provides enhanced control over data privacy, security, infrastructure, and future model development, appealing to organizations with strict compliance requirements or a desire to reduce vendor lock-in for critical AI components. This approach shifts the operational burden and responsibility to the deploying organization but grants maximum flexibility.

AI agentsLLMMixture of ExpertsMoEinferenceopen-source AIon-premises deploymentcode generation