Intermediate·4 min read

Streaming (Token Streaming)

Streaming is the practice of delivering LLM output tokens to the user incrementally as they are generated, rather than waiting for the complete respon

Definition

Streaming is the practice of delivering LLM output tokens to the user incrementally as they are generated, rather than waiting for the complete response. It is the fundamental UX pattern behind all modern LLM chat interfaces — the characteristic "typing" appearance of AI responses.

Why Streaming Matters

Without streaming, a 500-token response at 50 TPS takes 10 seconds before the user sees anything. With streaming, the first token appears in ~0.5s and the user reads as generation continues.

`

Without streaming: [10 second wait] → entire response appears at once

With streaming: [0.5s] → first token → token → token → token → ...

`

Psychologically: streaming feels dramatically faster even though total generation time is identical.

How Streaming Works

Server-Sent Events (SSE)

The standard HTTP-based streaming protocol for LLM APIs:

`

HTTP Response: Content-Type: text/event-stream

data: {"delta": {"text": "The "}}

data: {"delta": {"text": "capital "}}

data: {"delta": {"text": "of "}}

data: {"delta": {"text": "France "}}

data: {"delta": {"text": "is "}}

data: {"delta": {"text": "Paris."}}

data: [DONE]

`

WebSockets

Bidirectional streaming for interactive applications (speech, real-time collaboration).

Streaming with Major APIs

OpenAI

`python

from openai import OpenAI

client = OpenAI()

with client.chat.completions.create(

model="gpt-4o",

messages=[{"role": "user", "content": "Count to 10"}],

stream=True,

) as stream:

for chunk in stream:

if chunk.choices[0].delta.content:

print(chunk.choices[0].delta.content, end="", flush=True)

`

Anthropic Claude

`python

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(

model="claude-sonnet-4-6",

max_tokens=1024,

messages=[{"role": "user", "content": "Count to 10"}],

) as stream:

for text in stream.text_stream:

print(text, end="", flush=True)

`

AWS Bedrock (Converse Stream)

`python

response = bedrock.converse_stream(

modelId="us.anthropic.claude-sonnet-4-6",

messages=[{"role": "user", "content": [{"text": "Count to 10"}]}]

)

for event in response["stream"]:

if "contentBlockDelta" in event:

print(event["contentBlockDelta"]["delta"]["text"], end="")

`

Streaming Events Beyond Text

Modern streaming APIs send structured events, not just text chunks:

| Event Type | Description |

|------------|-------------|

| message_start | Metadata about the response (model, id) |

| content_block_start | Beginning of a content block (text, tool_use, thinking) |

| content_block_delta | Incremental content update |

| content_block_stop | Block completed |

| message_delta | Token counts, stop reason |

| message_stop | Complete response finished |

Streaming with Tool Use

When the model calls a tool during streaming:

1. Stream: text tokens up to the tool call

2. Stream: the tool call parameters (JSON, incrementally)

3. Pause: developer executes the tool

4. Resume: continue streaming the response after tool result

Building a Streaming UI

`javascript

// React example

const [response, setResponse] = useState("");

const streamResponse = async (prompt) => {

const stream = await fetchStreamFromAPI(prompt);

for await (const chunk of stream) {

setResponse(prev => prev + chunk.text); // append each token

}

};

`

Streaming Performance Metrics

| Metric | What It Measures |

|--------|-----------------|

| Time to First Token (TTFT) | Latency before streaming starts |

| Tokens per second (TPS) | Streaming speed |

| Time Between Tokens (TBT) | Smoothness — high variance = choppy |

| Time to Last Token (TTLT) | Total response time |

Streaming in Production Systems

Backend Proxy Pattern

`

User ← SSE ← [Your Backend] ← SSE ← [LLM API]

`

Your backend can:

  • Validate/filter output before streaming to user
  • Apply guardrails mid-stream
  • Log the complete response for observability
  • Add rate limiting or auth
  • Mid-Stream Interruption

    Users may want to stop generation early:

  • Frontend sends abort signal
  • Backend cancels the upstream API request
  • API providers support request cancellation to stop billing
  • Streaming vs. Batch for Different Use Cases

    | Use Case | Recommendation |

    |----------|---------------|

    | Interactive chat | Stream always |

    | Document generation (user watching) | Stream |

    | Background processing (no user) | Batch (simpler code) |

    | Evaluation/testing pipelines | Batch |

    | Voice synthesis (TTS) | Stream (chunk audio as text arrives) |

    Streaming with Extended Thinking

    Reasoning models stream differently:

    1. Stream thinking tokens (may be hidden to user, shown in dev tools)

    2. Stream final response tokens

    3. Thinking blocks complete before text blocks start

    Claude's streaming API sends separate events for thinking vs. text content blocks.

    Related Concepts

  • Inference, Latency, Token, API, Reasoning Models, Tool Use, Context Window

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 12).