Dev.to #systemdesign·June 18, 2026

Optimizing LLM Workloads: The Two-Queue Discipline for Unified Memory Systems

This article details a system design principle, the "two-queue discipline," to prevent resource exhaustion and kernel panics when running diverse LLM workloads on machines with unified memory. It distinguishes between local-heavy and remote-API tasks, advocating for separate, controlled queues to manage their distinct resource saturation profiles, thereby improving system stability and performance.

Performance & Scaling Distributed Systems AI & ML Infrastructure

Read original on Dev.to #systemdesign

The article addresses a common challenge in managing mixed workloads involving local Large Language Models (LLMs) and calls to remote LLM APIs, particularly on systems with unified memory architectures like Apple Silicon. The core issue is resource contention leading to system instability, including kernel panics, when local-heavy tasks (like model loading or downloads) and remote-API fleet tasks (like parallel API calls) are run concurrently without proper management.

Understanding the Two Task Classes

The author categorizes LLM-related tasks into two primary classes based on where they cause resource saturation:

Local-heavy tasks: These directly saturate local machine resources such as unified memory bandwidth, disk I/O, and CPU cores. Examples include `ollama pull` for model downloads, loading large models into LM Studio with `keep_alive`, or running extensive local test suites.
Remote-API fleet tasks: These primarily saturate network connections and external API rate limits, not local compute resources. Examples include parallel subagent dispatches making concurrent calls to cloud LLM providers, or batch jobs fanning out to external services.

The Two-Queue Discipline Rule

ℹ️

Core Principle

Local-heavy tasks run serially, one at a time; remote-API fleet tasks run with bounded concurrency; never cross-mix the two. This prevents the distinct saturation failure modes of each task class from compounding and causing system collapse.

Local-heavy tasks: Serial only. Only one such task should be active at any given time. If an `ollama pull` is in progress, no other model loads or heavy operations should start concurrently.
Remote-API fleet tasks: Bounded concurrency. A suggested default is ">=5 concurrent tasks." These are network-bound, so local resources are less impacted, but it's crucial to respect network connection limits and cloud provider rate limits.
Never cross-mix: If a local-heavy task is running, the remote-API queue is paused. Conversely, if remote-API tasks are active, no local-heavy tasks are initiated. This strict separation is key to stability.

Future Trajectories and System Design Implications

Auto-scheduling: Automating the "pre-flight gate" checks (load average, disk space, existing heavy tasks) into a robust scheduler that intelligently manages task queues based on real-time system metrics. This shifts from manual discipline to codified policy.
Cross-machine fleet coordination: Extending the two-queue logic to a multi-node architecture, where different machines specialize in local-heavy or remote-API tasks. This introduces challenges in distributed queue state management (e.g., using Redis for smaller setups, or a durable queue for larger fleets).
Predictive saturation modeling: Implementing logic to predict if a new task will exceed the practical working set of unified memory *before* dispatch, preventing errors proactively rather than reacting to failures. This requires understanding the dynamic memory footprint considering fragmentation, caches, and kernel overhead.
Observability: Building a dedicated monitoring surface that visualizes queue states, real-time load, and projected memory headroom. This provides immediate feedback on potential rule violations, enabling early intervention.

LLM orchestrationresource managementunified memoryconcurrencyqueueingkernel panicsystem stabilityarchitecture patterns

Comments

Loading comments...

Architecture Design

Design this yourself

Design an orchestration layer for managing mixed AI workloads, including both local-heavy LLM operations (model loading, inference) and remote-API fleet calls. Implement a two-queue discipline with a robust scheduler that enforces serial execution for local tasks, bounded concurrency for remote tasks, and strict non-mixing. Include mechanisms for pre-flight resource checks and real-time observability of queue states and system health.

Practice Interview

Focus: task queuing and resource arbitration for mixed workloads

Other design angles

· Design a resource arbitration service for a multi-node cluster where different nodes specialize in local-heavy vs. remote-API LLM tasks, focusing on inter-node queue coordination and fault tolerance.· Design an automated scheduler for a single-machine LLM development environment that dynamically prioritizes and sequences local and remote tasks to prevent resource contention and optimize throughput.· Design a system for predictive saturation modeling to prevent OOM errors and kernel panics in unified memory systems running diverse AI workloads, detailing the metrics collected and the decision-making logic for task admission.