Advanced·5 min read

Multimodality

Multimodality in LLMs refers to the ability of a model to process, understand, and generate content across multiple types of data modalities — not jus

Definition

Multimodality in LLMs refers to the ability of a model to process, understand, and generate content across multiple types of data modalities — not just text, but also images, audio, video, and structured data. A multimodal LLM (MLLM) can accept mixed inputs and reason across modalities.

Modalities in AI Models

| Modality | Description | Input Example | Output Example |

|----------|-------------|--------------|----------------|

| Text | Natural language | Prompts, documents | Responses, summaries |

| Image | Static visuals | Photos, diagrams, screenshots | Descriptions, captions |

| Audio | Sound/speech | Voice recordings | Transcriptions, speech |

| Video | Moving images | Recorded clips | Descriptions, timestamps |

| Code | Programming languages | Source files | Generated code |

| Structured data | Tables, JSON, CSV | Spreadsheets | Analysis, SQL queries |

| Documents | PDFs with layout | Business reports | Extraction, Q&A |

Input vs. Output Modalities

Not all models are symmetric — many accept multi-modal inputs but produce text output only:

| Capability | Examples |

|------------|---------|

| Text + Image → Text | GPT-4o, Claude 3.5, Gemini |

| Text + Audio → Text | Whisper, GPT-4o Audio |

| Text → Image | DALL-E 3, Midjourney, Stable Diffusion |

| Text → Audio | ElevenLabs, OpenAI TTS |

| Text + Image + Video → Text | Gemini 1.5 Pro |

| Any → Any (native) | GPT-4o (approaching this) |

How Image Understanding Works in Transformers

Vision Encoder

Images are processed by a visual encoder (typically a Vision Transformer / ViT) before being fed to the language model:

1. Image → split into patches (e.g., 14×14 pixel patches)

2. Each patch → linear projection → patch embedding vector

3. Sequence of patch embeddings → fed to the LLM's attention layers

4. LLM reasons over both text tokens and image patch tokens together

Connector / Projection Layer

A projection layer maps vision encoder outputs to the LLM's embedding dimension:

`

Image → [Vision Encoder] → visual features → [Projection MLP] → LLM embedding space → [LLM]

`

End-to-End Training

Modern multimodal models are trained end-to-end or fine-tuned on (image, text) paired datasets:

  • Image captioning datasets (COCO, CC12M)
  • Visual QA datasets (VQA, GQA, ScienceQA)
  • Document understanding datasets (DocVQA)
  • Interleaved image-text web data (MMC4)
  • Multimodal Tasks

    | Task | Input | Output |

    |------|-------|--------|

    | Image captioning | Image | Text description |

    | Visual QA | Image + Question | Text answer |

    | OCR / Document Q&A | Document image | Extracted text/answers |

    | Chart/diagram analysis | Chart image | Data interpretation |

    | Code from screenshot | UI screenshot | HTML/CSS/code |

    | Medical image analysis | X-ray/MRI | Clinical description |

    | Video understanding | Video frames | Summary/events |

    | Audio transcription | Audio file | Text transcript |

    Leading Multimodal Models (2024–2025)

    | Model | Modalities | Notes |

    |-------|-----------|-------|

    | GPT-4o | Text, Image, Audio | Native multimodal, real-time |

    | Claude 3.5 Sonnet | Text, Image, PDF | Strong document understanding |

    | Gemini 1.5 Pro | Text, Image, Audio, Video | 1M token context, video native |

    | LLaVA / LLaVA-1.6 | Text, Image | Open-source vision model |

    | Qwen2-VL | Text, Image, Video | Strong open-source option |

    | Pixtral | Text, Image | Mistral's vision model |

    Audio Multimodality

    Speech-to-Text (ASR)

  • Whisper (OpenAI): industry-standard open-source ASR
  • Deepgram, AssemblyAI: API-based transcription
  • Text-to-Speech (TTS)

  • OpenAI TTS, ElevenLabs, Azure Cognitive Services
  • Native Audio LLMs

  • GPT-4o Audio: processes raw audio natively (not speech-to-text first)
  • Gemini: native audio understanding
  • Multimodal Challenges

    | Challenge | Description |

    |-----------|-------------|

    | Hallucination on images | Model describes objects not present in image |

    | Cultural/context bias | Images from underrepresented contexts misunderstood |

    | Small text in images | OCR quality degrades at small font sizes |

    | Complex charts | Mathematical/scientific charts require specialized training |

    | Long video | Processing many frames within context limits |

    | Audio with noise | Background noise degrades transcription quality |

    Multimodal RAG

    Extend RAG to handle images and mixed documents:

  • Index document pages as images + extracted text
  • Retrieve relevant page images
  • Feed retrieved images + query to multimodal LLM
  • Particularly powerful for PDFs with charts, tables, diagrams
  • Practical Use Cases

    | Industry | Use Case | Modalities |

    |----------|---------|-----------|

    | Healthcare | Medical report analysis | Image + Text |

    | Finance | Chart interpretation from reports | Image + Text |

    | Retail | Product image search + description | Image → Text |

    | Legal | Contract image OCR + analysis | Image + Text |

    | Education | Diagram explanation | Image + Text |

    | Accessibility | Image-to-audio description | Image → Text → Audio |

    Related Concepts

  • LLM, Embeddings, Vision Transformer, RAG, Inference, Token, Fine-Tuning

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 7).