Definition
Embeddings are dense numerical vectors that represent tokens (or sentences, documents, images) in a continuous high-dimensional space. They encode semantic meaning so that similar concepts are geometrically close to each other.
Why Embeddings?
- Computers can't process raw text — they need numbers
- One-hot encoding (10,000-dim sparse vector) is inefficient and captures no meaning
- Embeddings are compact (256–4096 dims) and encode rich semantic relationships
- Semantic similarity: "king" and "queen" are close; "king" and "car" are far
- Arithmetic:
king - man + woman ≈ queen(Word2Vec famous example) - Contextual vs. static:
- Absolute (sinusoidal) — original Transformer, fixed sine/cosine patterns
- Learned absolute — trained position vectors (BERT, GPT-2)
- Relative (RoPE) — Rotary Position Embedding, encodes relative distance; used by LLaMA, Mistral, GPT-NeoX
- ALiBi — adds a linear bias to attention based on distance; good for length generalization
- The full embedding space is also called latent space
- High-dimensional geometry encodes language structure
- Nearest-neighbor search in this space = semantic search
- Dimensionality reduction (t-SNE, UMAP) used to visualize clusters
- Semantic search: embed query + documents → cosine similarity → retrieve relevant docs
- RAG (Retrieval-Augmented Generation): store doc embeddings in vector DB, retrieve at query time
- Clustering: group similar documents
- Classification: feed embedding into a classifier head
- Anomaly detection: find outliers in embedding space
text-embedding-3-large(OpenAI) — 3072 dimsamazon.titan-embed-text-v2(AWS Bedrock)all-MiniLM-L6-v2(HuggingFace/SentenceTransformers) — lightweightnomic-embed-text— open source, long context- Token, Tokenization, Latent Space, Attention, RAG, Vector Database
How Token Embeddings Work
1. Each token ID maps to a row in an embedding matrix (shape: vocab_size × embed_dim)
2. At inference, the model does a simple lookup: token_id → embedding vector
3. This embedding matrix is learned during pre-training
4. The same matrix is often used (transposed) at the output layer to predict the next token (weight tying)
Dimensions (Typical Values by Model Size)
| Model Size | Embedding Dimension |
|------------|-------------------|
| Small (125M) | 768 |
| Medium (1.3B) | 2048 |
| Large (7B) | 4096 |
| XL (70B+) | 8192 |
Properties of Good Embeddings
- Static (Word2Vec, GloVe): one fixed vector per word regardless of context
- Contextual (BERT, GPT): each occurrence gets a different vector based on surrounding tokens
Types of Embeddings
| Type | Description | Use Case |
|------|-------------|----------|
| Token embeddings | Per-token lookup vectors | Input to every transformer |
| Positional embeddings | Encode position in sequence | Combined with token embeddings |
| Sentence embeddings | Single vector for entire sentence | Semantic search, RAG |
| Image embeddings | Encode visual content | Multimodal models |
| Document embeddings | Encode entire documents | Long-doc retrieval |
Positional Embeddings
Transformers have no inherent notion of order — all tokens are processed in parallel. Positional embeddings add position information:
Embedding Space (Latent Space)
Practical Uses of Embeddings
Similarity Metrics
| Metric | Formula | Notes |
|--------|---------|-------|
| Cosine similarity | cos(θ) = A·B / (|A||B|) | Most common, direction-based |
| Dot product | A·B | Fast, used in attention |
| Euclidean distance | √Σ(ai-bi)² | Magnitude-sensitive |