Streaming (Token Streaming) — FDE@ProdAI Blog

Definition

Streaming is the practice of delivering LLM output tokens to the user incrementally as they are generated, rather than waiting for the complete response. It is the fundamental UX pattern behind all modern LLM chat interfaces — the characteristic "typing" appearance of AI responses.

Why Streaming Matters

Without streaming, a 500-token response at 50 TPS takes 10 seconds before the user sees anything. With streaming, the first token appears in ~0.5s and the user reads as generation continues.

Without streaming: [10 second wait] → entire response appears at once

With streaming: [0.5s] → first token → token → token → token → ...

Psychologically: streaming feels dramatically faster even though total generation time is identical.

How Streaming Works

Server-Sent Events (SSE)

The standard HTTP-based streaming protocol for LLM APIs:

HTTP Response: Content-Type: text/event-stream

data: {"delta": {"text": "The "}}

data: {"delta": {"text": "capital "}}

data: {"delta": {"text": "of "}}

data: {"delta": {"text": "France "}}

data: {"delta": {"text": "is "}}

data: {"delta": {"text": "Paris."}}

data: [DONE]

WebSockets

Bidirectional streaming for interactive applications (speech, real-time collaboration).

Streaming with Major APIs

OpenAI

`python

from openai import OpenAI

client = OpenAI()

with client.chat.completions.create(

model="gpt-4o",

messages=[{"role": "user", "content": "Count to 10"}],

stream=True,

) as stream:

for chunk in stream:

if chunk.choices[0].delta.content:

print(chunk.choices[0].delta.content, end="", flush=True)

Anthropic Claude

`python

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(

model="claude-sonnet-4-6",

max_tokens=1024,

messages=[{"role": "user", "content": "Count to 10"}],

) as stream:

for text in stream.text_stream:

print(text, end="", flush=True)

AWS Bedrock (Converse Stream)

`python

response = bedrock.converse_stream(

modelId="us.anthropic.claude-sonnet-4-6",

messages=[{"role": "user", "content": [{"text": "Count to 10"}]}]

)

for event in response["stream"]:

if "contentBlockDelta" in event:

print(event["contentBlockDelta"]["delta"]["text"], end="")

Streaming Events Beyond Text

Modern streaming APIs send structured events, not just text chunks:

| Event Type | Description |

|------------|-------------|

| message_start | Metadata about the response (model, id) |

| content_block_start | Beginning of a content block (text, tool_use, thinking) |

| content_block_delta | Incremental content update |

| content_block_stop | Block completed |

| message_delta | Token counts, stop reason |

| message_stop | Complete response finished |

Streaming with Tool Use

When the model calls a tool during streaming:

1. Stream: text tokens up to the tool call

2. Stream: the tool call parameters (JSON, incrementally)

3. Pause: developer executes the tool

4. Resume: continue streaming the response after tool result

Building a Streaming UI

`javascript

// React example

const [response, setResponse] = useState("");

const streamResponse = async (prompt) => {

const stream = await fetchStreamFromAPI(prompt);

for await (const chunk of stream) {

setResponse(prev => prev + chunk.text); // append each token

}

};

Streaming Performance Metrics

| Metric | What It Measures |

|--------|-----------------|

| Time to First Token (TTFT) | Latency before streaming starts |

| Tokens per second (TPS) | Streaming speed |

| Time Between Tokens (TBT) | Smoothness — high variance = choppy |

| Time to Last Token (TTLT) | Total response time |

Streaming in Production Systems

Backend Proxy Pattern

User ← SSE ← [Your Backend] ← SSE ← [LLM API]

Your backend can:

Validate/filter output before streaming to user
Apply guardrails mid-stream
Log the complete response for observability
Add rate limiting or auth

Mid-Stream Interruption

Users may want to stop generation early:

Frontend sends abort signal
Backend cancels the upstream API request
API providers support request cancellation to stop billing

Streaming vs. Batch for Different Use Cases

| Use Case | Recommendation |

|----------|---------------|

| Interactive chat | Stream always |

| Document generation (user watching) | Stream |

| Background processing (no user) | Batch (simpler code) |

| Evaluation/testing pipelines | Batch |

| Voice synthesis (TTS) | Stream (chunk audio as text arrives) |

Streaming with Extended Thinking

Reasoning models stream differently:

1. Stream thinking tokens (may be hidden to user, shown in dev tools)

2. Stream final response tokens

3. Thinking blocks complete before text blocks start

Claude's streaming API sends separate events for thinking vs. text content blocks.

Related Concepts

Inference, Latency, Token, API, Reasoning Models, Tool Use, Context Window

Definition

Why Streaming Matters

How Streaming Works

Server-Sent Events (SSE)

WebSockets

Streaming with Major APIs

OpenAI

Anthropic Claude

AWS Bedrock (Converse Stream)

Streaming Events Beyond Text

Streaming with Tool Use

Building a Streaming UI

Streaming Performance Metrics

Streaming in Production Systems

Backend Proxy Pattern

Mid-Stream Interruption

Streaming vs. Batch for Different Use Cases

Streaming with Extended Thinking

Related Concepts

Go Deeper With Live Instruction