Mixture of Experts (MoE) — FDE@ProdAI Blog

Definition

Mixture of Experts (MoE) is a neural network architecture where only a subset of the model's parameters are activated for each input token. Instead of one monolithic feed-forward network, MoE replaces the FFN layer with multiple "expert" FFNs plus a routing mechanism that selects which experts handle each token. This enables massive model capacity with sub-linear compute costs.

The Core Idea

A 140B parameter dense model activates all 140B parameters for every token.

A 140B parameter MoE model might activate only 20B parameters per token (by routing to 2 of 8 experts).

Dense FFN: MoE FFN:

Input → [FFN] → Output Input → [Router] → Expert 2 ┐

→ Expert 5 ┘→ Combine → Output

Same (or larger) capacity, much less compute per token.

Architecture

Standard MoE Layer (replaces FFN in each Transformer block)

Token embedding

↓

[Router / Gating Network]

↓ selects top-K experts

[Expert 1] [Expert 2] ... [Expert N]

(only K activate for this token)

↓ weighted combination

Output embedding

Router / Gating Mechanism

A small linear layer that maps token embedding → softmax over N experts
Top-K selection: typically K=2 (each token uses 2 experts)
The selected experts' outputs are weighted by the router's softmax scores

gates = softmax(W_router × token_embedding)

top_k_indices = argsort(gates)[-K:]

output = Σ gates[i] × Expert_i(token) for i in top_k_indices

MoE Parameters

| Parameter | Typical Value | Effect |

|-----------|--------------|--------|

| N (total experts) | 8, 16, 64, 128 | Total model capacity |

| K (active experts per token) | 1, 2 | Compute per token |

| Expert size | Same as dense FFN | Individual expert capacity |

| Router type | Top-K softmax | Routing strategy |

Famous MoE Models

|-------|---------|--------|-------------|--------------|

| GPT-4 (estimated) | ~16 | ~2 | ~1.8T | ~220B |

| Mixtral 8×7B | 8 | 2 | ~46B | ~12.9B |

| Mixtral 8×22B | 8 | 2 | ~141B | ~39B |

| DeepSeek-V2 | 160 | 6 | 236B | 21B |

| DeepSeek-V3 | 256 | 8 | 671B | 37B |

| Grok-1 | 8 | 2 | 314B | ~85B |

Advantages

Compute Efficiency

A 46B MoE model (Mixtral 8×7B) runs at the compute cost of a ~13B dense model
Higher capacity/compute ratio than dense models
Each token only activates 2 experts → 75% of the expert FFN parameters unused per forward pass

Quality

MoE models at the same active parameter count outperform dense models
Specialists may emerge: different experts handle different types of knowledge

Throughput at Scale

For the same quality target, MoE requires fewer FLOPs per token
Better inference throughput when all experts fit in memory

Disadvantages

Memory Requirements

All expert weights must be loaded into GPU memory, even if only 2/8 are used:

Mixtral 8×7B: ~46B params → ~92GB in fp16 → requires 2× A100 80GB
A 13B dense model would require only ~26GB
Memory ≠ compute: MoE models are compute-efficient but memory-heavy

Load Balancing Challenge

If all tokens route to the same 1–2 experts ("expert collapse"):

Some experts become overloaded
Others become unused (dead experts)
Fix: auxiliary load balancing loss added during training to encourage uniform routing

Training Instability

MoE training is harder than dense:

Router can learn degenerate patterns
Requires careful initialization and loss balancing
Communication overhead in distributed training

Communication Overhead

In distributed training/inference, different experts may be on different GPUs:

Token must be "sent" to the GPU hosting its selected expert
All-to-all communication overhead at scale

Expert Specialization

Research shows MoE experts do develop specialization:

Some experts activate more for code, others for natural language
Some activate for specific languages
Specialization is emergent — not programmed

Fine-tuning MoE Models

Standard LoRA works well for MoE:

Apply LoRA adapters to attention layers
Optionally apply to expert FFN layers
Router weights typically frozen during fine-tuning

MoE vs. Dense: When to Choose

| Use Case | Prefer MoE | Prefer Dense |

|----------|------------|-------------|

| Quality per FLOP | MoE wins | — |

| Memory-constrained | — | Dense wins |

| Fast single-GPU inference | — | Dense wins |

| Large-scale serving | MoE wins | — |

| Fine-tuning ease | Slight edge Dense | — |

Related Concepts

Transformer, Parameters, Scaling Laws, Inference, Fine-Tuning, Quantization