Definition
Latency is the time elapsed between submitting a prompt to an LLM and receiving the output. It encompasses network transmission, server-side queuing, model computation (prefill + decode), and response delivery. Latency is a critical quality-of-service metric for LLM applications.
Latency Components
`
Total Latency = Network (client→server)
+ Queue wait time
+ Prefill time (process input tokens)
+ Decode time (generate output tokens × N)
+ Network (server→client)
`
Key Latency Metrics
Time to First Token (TTFT)
- Time from sending request to receiving the first output token
- Dominated by: network RTT + queue + prefill computation
- Critical for perceived responsiveness in chat UIs
- Streaming enables users to see output while generation continues
- How fast the model generates each subsequent token
- TPOT = milliseconds per token; TPS = tokens per second (inverse)
- Dominated by: memory bandwidth (reading KV cache + weights)
- Typical: 20–150 TPS for hosted APIs
- Total time from request to complete response
- = TTFT + (output_tokens × TPOT)
- Relevant for batch jobs and non-streaming use cases
- Consistency of generation speed across tokens
- High variance → choppy streaming experience
- A small "draft" model generates K tokens quickly
- The large "target" model verifies all K tokens in one forward pass
- Accepted tokens are kept; rejected tokens fall back to the target's output
- Net effect: 2–3× speedup at same quality
- Cache the KV states for repeated system prompts / documents
- Cache hit → prefill cost drops by ~90%
- Supported by: Claude (Anthropic), GPT-4o (OpenAI prompt caching)
- INT8 or INT4 weights → smaller memory footprint → faster memory reads
- Modest quality loss, significant speed gain
- Reorders attention computation to be I/O efficient
- Reduces memory bandwidth requirements for attention
- 2–4× faster attention computation
- Not faster — just makes latency feel lower
- First tokens appear immediately; users read while rest generates
- Use the smallest model that meets quality requirements
- GPT-3.5 vs. GPT-4: ~5× cost, ~3× faster
- LangSmith, Helicone, Braintrust — LLM observability platforms
- Custom logging: timestamp at request, first chunk, last chunk
- p50 / p95 / p99 percentiles — don't rely on averages
- Test under realistic load (latency degrades under high concurrency)
- Inference, Token, Context Window, Streaming, KV Cache, Throughput, Quantization, Prompt Caching
Time per Output Token (TPOT) / Tokens per Second (TPS)
End-to-End Latency
Time Between Tokens (TBT)
Typical Latency Ranges (2024)
| Model / Service | TTFT | TPS |
|----------------|------|-----|
| GPT-4o | ~0.5–1s | 40–80 |
| Claude 3.5 Sonnet | ~0.5–1.5s | 50–100 |
| GPT-3.5 Turbo | ~0.3–0.5s | 80–150 |
| Local LLaMA 3 8B (GPU) | ~0.1–0.3s | 50–200 |
| Local LLaMA 3 70B (GPU) | ~0.3–0.8s | 15–50 |
Factors Affecting Latency
Input Side
| Factor | Effect on Latency |
|--------|------------------|
| Prompt length (tokens) | ↑ Prefill time linearly |
| Context window usage | ↑ Prefill time |
| KV cache miss (no caching) | ↑ TTFT significantly |
Model Side
| Factor | Effect on Latency |
|--------|------------------|
| Model size (parameters) | ↑ Both TTFT and TPOT |
| Number of layers | ↑ TPOT |
| Attention head count | ↑ Memory bandwidth requirements |
| Quantization level | ↓ TPOT significantly |
Output Side
| Factor | Effect on Latency |
|--------|------------------|
| Output length (tokens) | ↑ Total latency linearly |
| max_tokens setting | Doesn't affect speed, only when to stop |
Infrastructure Side
| Factor | Effect on Latency |
|--------|------------------|
| GPU memory bandwidth | ↓ TPOT when higher |
| GPU count (tensor parallel) | ↓ TPOT |
| Server load / queue depth | ↑ TTFT when high |
| Geographic region | ↑ Network RTT when far |
| Prompt caching hit | ↓ TTFT significantly |
Latency vs. Throughput Trade-off
| Optimization | Latency Effect | Throughput Effect |
|-------------|---------------|-----------------|
| Small batch size | ↓ (good for latency) | ↓ (bad for throughput) |
| Large batch size | ↑ (bad for latency) | ↑ (good for throughput) |
| Speculative decoding | ↓ Latency | ≈ Same or slightly better |
| Quantization (INT8/INT4) | ↓ Latency | ↑ Throughput |
Latency Reduction Techniques
Speculative Decoding
Prompt Caching
Quantization
Flash Attention
Streaming
Smaller Models
Latency SLAs (Service Level Agreements)
Common real-world targets by use case:
| Use Case | TTFT Target | TPOT Target |
|----------|-------------|-------------|
| Interactive chat | < 500ms | < 30ms |
| Autocomplete / copilot | < 200ms | < 20ms |
| Async document processing | < 5s | Any |
| Background batch jobs | No TTFT target | Throughput-focused |