Intermediate·4 min read

Mixture of Experts (MoE)

Mixture of Experts (MoE) is a neural network architecture where only a subset of the model's parameters are activated for each input token. Instead of

Definition

Mixture of Experts (MoE) is a neural network architecture where only a subset of the model's parameters are activated for each input token. Instead of one monolithic feed-forward network, MoE replaces the FFN layer with multiple "expert" FFNs plus a routing mechanism that selects which experts handle each token. This enables massive model capacity with sub-linear compute costs.

The Core Idea

A 140B parameter dense model activates all 140B parameters for every token.

A 140B parameter MoE model might activate only 20B parameters per token (by routing to 2 of 8 experts).

`

Dense FFN: MoE FFN:

Input → [FFN] → Output Input → [Router] → Expert 2 ┐

→ Expert 5 ┘→ Combine → Output

`

Same (or larger) capacity, much less compute per token.

Architecture

Standard MoE Layer (replaces FFN in each Transformer block)

`

Token embedding

[Router / Gating Network]

↓ selects top-K experts

[Expert 1] [Expert 2] ... [Expert N]

(only K activate for this token)

↓ weighted combination

Output embedding

`

Router / Gating Mechanism

  • A small linear layer that maps token embedding → softmax over N experts
  • Top-K selection: typically K=2 (each token uses 2 experts)
  • The selected experts' outputs are weighted by the router's softmax scores
  • `

    gates = softmax(W_router × token_embedding)

    top_k_indices = argsort(gates)[-K:]

    output = Σ gates[i] × Expert_i(token) for i in top_k_indices

    `

    MoE Parameters

    | Parameter | Typical Value | Effect |

    |-----------|--------------|--------|

    | N (total experts) | 8, 16, 64, 128 | Total model capacity |

    | K (active experts per token) | 1, 2 | Compute per token |

    | Expert size | Same as dense FFN | Individual expert capacity |

    | Router type | Top-K softmax | Routing strategy |

    Famous MoE Models

    | Model | Experts | Active | Total Params | Active Params |

    |-------|---------|--------|-------------|--------------|

    | GPT-4 (estimated) | ~16 | ~2 | ~1.8T | ~220B |

    | Mixtral 8×7B | 8 | 2 | ~46B | ~12.9B |

    | Mixtral 8×22B | 8 | 2 | ~141B | ~39B |

    | DeepSeek-V2 | 160 | 6 | 236B | 21B |

    | DeepSeek-V3 | 256 | 8 | 671B | 37B |

    | Grok-1 | 8 | 2 | 314B | ~85B |

    | LLaMA-MoE (various) | Various | Various | Various | Various |

    Advantages

    Compute Efficiency

  • A 46B MoE model (Mixtral 8×7B) runs at the compute cost of a ~13B dense model
  • Higher capacity/compute ratio than dense models
  • Each token only activates 2 experts → 75% of the expert FFN parameters unused per forward pass
  • Quality

  • MoE models at the same active parameter count outperform dense models
  • Specialists may emerge: different experts handle different types of knowledge
  • Throughput at Scale

  • For the same quality target, MoE requires fewer FLOPs per token
  • Better inference throughput when all experts fit in memory
  • Disadvantages

    Memory Requirements

    All expert weights must be loaded into GPU memory, even if only 2/8 are used:

  • Mixtral 8×7B: ~46B params → ~92GB in fp16 → requires 2× A100 80GB
  • A 13B dense model would require only ~26GB
  • Memory ≠ compute: MoE models are compute-efficient but memory-heavy
  • Load Balancing Challenge

    If all tokens route to the same 1–2 experts ("expert collapse"):

  • Some experts become overloaded
  • Others become unused (dead experts)
  • Fix: auxiliary load balancing loss added during training to encourage uniform routing
  • Training Instability

    MoE training is harder than dense:

  • Router can learn degenerate patterns
  • Requires careful initialization and loss balancing
  • Communication overhead in distributed training
  • Communication Overhead

    In distributed training/inference, different experts may be on different GPUs:

  • Token must be "sent" to the GPU hosting its selected expert
  • All-to-all communication overhead at scale
  • Expert Specialization

    Research shows MoE experts do develop specialization:

  • Some experts activate more for code, others for natural language
  • Some activate for specific languages
  • Specialization is emergent — not programmed
  • Fine-tuning MoE Models

    Standard LoRA works well for MoE:

  • Apply LoRA adapters to attention layers
  • Optionally apply to expert FFN layers
  • Router weights typically frozen during fine-tuning
  • MoE vs. Dense: When to Choose

    | Use Case | Prefer MoE | Prefer Dense |

    |----------|------------|-------------|

    | Quality per FLOP | MoE wins | — |

    | Memory-constrained | — | Dense wins |

    | Fast single-GPU inference | — | Dense wins |

    | Large-scale serving | MoE wins | — |

    | Fine-tuning ease | Slight edge Dense | — |

    Related Concepts

  • Transformer, Parameters, Scaling Laws, Inference, Fine-Tuning, Quantization

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 11).