The article addresses a common challenge in managing mixed workloads involving local Large Language Models (LLMs) and calls to remote LLM APIs, particularly on systems with unified memory architectures like Apple Silicon. The core issue is resource contention leading to system instability, including kernel panics, when local-heavy tasks (like model loading or downloads) and remote-API fleet tasks (like parallel API calls) are run concurrently without proper management.
Understanding the Two Task Classes
The author categorizes LLM-related tasks into two primary classes based on where they cause resource saturation:
- Local-heavy tasks: These directly saturate local machine resources such as unified memory bandwidth, disk I/O, and CPU cores. Examples include `ollama pull` for model downloads, loading large models into LM Studio with `keep_alive`, or running extensive local test suites.
- Remote-API fleet tasks: These primarily saturate network connections and external API rate limits, not local compute resources. Examples include parallel subagent dispatches making concurrent calls to cloud LLM providers, or batch jobs fanning out to external services.
The Two-Queue Discipline Rule
ℹ️Core Principle
Local-heavy tasks run serially, one at a time; remote-API fleet tasks run with bounded concurrency; never cross-mix the two. This prevents the distinct saturation failure modes of each task class from compounding and causing system collapse.
- Local-heavy tasks: Serial only. Only one such task should be active at any given time. If an `ollama pull` is in progress, no other model loads or heavy operations should start concurrently.
- Remote-API fleet tasks: Bounded concurrency. A suggested default is ">=5 concurrent tasks." These are network-bound, so local resources are less impacted, but it's crucial to respect network connection limits and cloud provider rate limits.
- Never cross-mix: If a local-heavy task is running, the remote-API queue is paused. Conversely, if remote-API tasks are active, no local-heavy tasks are initiated. This strict separation is key to stability.
Future Trajectories and System Design Implications
- Auto-scheduling: Automating the "pre-flight gate" checks (load average, disk space, existing heavy tasks) into a robust scheduler that intelligently manages task queues based on real-time system metrics. This shifts from manual discipline to codified policy.
- Cross-machine fleet coordination: Extending the two-queue logic to a multi-node architecture, where different machines specialize in local-heavy or remote-API tasks. This introduces challenges in distributed queue state management (e.g., using Redis for smaller setups, or a durable queue for larger fleets).
- Predictive saturation modeling: Implementing logic to predict if a new task will exceed the practical working set of unified memory *before* dispatch, preventing errors proactively rather than reacting to failures. This requires understanding the dynamic memory footprint considering fragmentation, caches, and kernel overhead.
- Observability: Building a dedicated monitoring surface that visualizes queue states, real-time load, and projected memory headroom. This provides immediate feedback on potential rule violations, enabling early intervention.