Mixture of Experts (MoE)

Mixture of Experts (MoE) is a neural network architecture where the model consists of multiple specialized sub-networks (called "experts") and a gating mechanism that routes each input to the most relevant experts. Only a subset of experts is active for any given input, making the model faster and cheaper to run despite having more total parameters.

How MoE works: the gating network receives the input and determines which experts are most relevant. Typically 2-4 experts (out of 8-64 total) are activated per input. The activated experts process the input, their outputs are weighted and combined, and the model produces its response.

Benefits of MoE: larger model capacity without proportionally larger compute costs, faster inference (only a fraction of parameters are active), specialization (different experts learn different domains or skills), and scalability (adding experts is easier than making a single model larger).

Notable MoE models include Mixtral 8x7B (Mistral AI), which has 8 experts with 7B parameters each but only activates 2 per token — giving GPT-3.5-level quality at much lower cost. GPT-4 is also believed to use an MoE architecture.

For AI agent builders, MoE architecture means: better cost-performance trade-offs (more capable models at lower price points), faster response times (smaller active computation per request), and specialized handling of different query types (different experts for code, conversation, analysis, etc.).

Related Terms

Transformer

Neural Network

Foundation Model

Deep Learning

Build AI Agents Without Code