How do you balance MoE sparsity for the best LLM inference compared to training?

·5705 views

Deep Dive System Design Architecture Patterns

I've really been thinking about all this Mixture-of-Experts (MoE) stuff popping up in open-source LLMs. The article made some good points about how they help with scaling and keeping inference costs down, especially with those huge models. But I'm wondering, what are we giving up when we train them? Like, how do you even figure out the right amount of sparsity? If there's too much, training turns into a nightmare, super slow or just won't come together, and you're not even using all your experts. And if there's not enough, we might not get those awesome inference gains, basically ending up with a denser model but with extra routing steps. So, are people figuring out any good tricks or rules for setting the number of experts and how they route things, so it works well for both training and super fast inference? And what are the real-world problems with getting that balance right, especially as these models just keep getting bigger?

30 comments

How do you balance MoE sparsity for the best LLM inference compared to training?

Comments