Transformers & Attention Mechanisms
The architecture that revolutionized AI: how attention mechanisms and transformers changed everything.
The Transformer Revolution
The paper "Attention Is All You Need" (Vaswani et al., 2017) introduced the transformer architecture, which has become the foundation of nearly all state-of-the-art NLP models and is expanding into vision and other domains.
Transformers replaced RNNs by relying entirely on attention mechanisms, enabling parallel processing of sequences and learning long-range dependencies more effectively.
Self-Attention Mechanism
Self-attention allows each position in a sequence to attend to all positions, computing relevance scores between every pair of tokens. This is the core innovation that makes transformers powerful.
The Math Behind Self-Attention
- Create Q, K, V matrices: Query, Key, Value from input embeddings using learned weight matrices
- Compute attention scores: Score = QK^T / √d_k (scaled dot-product)
- Apply softmax: Attention_weights = softmax(Score)
- Weighted sum: Output = Attention_weights × V
The scaling factor √d_k prevents dot products from growing too large, which would push softmax into regions with small gradients.
Why Self-Attention Works
- Parallel computation: Unlike RNNs, all positions processed simultaneously
- Long-range dependencies: Direct connections between any two positions
- Dynamic attention: Attention patterns learned from data
- Interpretability: Can visualize what each token attends to
Multi-Head Attention
Instead of single attention, transformers use multiple attention "heads" in parallel. Each head can learn different types of relationships: syntactic, semantic, positional, etc.
How It Works
- Run h parallel attention mechanisms with different learned projections
- Concatenate the h outputs
- Apply final linear projection
Typical values: h=8 or h=16 heads. Each head operates on d_k = d_model/h dimensions.
Transformer Architecture
Encoder Stack
Processes input sequence. Each layer has:
- Multi-head self-attention
- Add & Norm (residual + layer norm)
- Feed-forward network (2 linear layers)
- Add & Norm
Decoder Stack
Generates output sequence. Each layer has:
- Masked multi-head self-attention
- Add & Norm
- Cross-attention to encoder output
- Add & Norm
- Feed-forward network
- Add & Norm
Positional Encoding
Since transformers have no recurrence, they need explicit position information. Original transformer uses sinusoidal encodings: PE(pos, 2i) = sin(pos/10000^(2i/d)), PE(pos, 2i+1) = cos(pos/10000^(2i/d))
Modern variants often use learned positional embeddings or relative position encodings.
Transformer Variants
BERT (Bidirectional Encoder)
Uses only encoder stack. Trained with masked language modeling and next sentence prediction. Excellent for understanding tasks (classification, QA).
GPT (Decoder-Only)
Uses only decoder stack (without cross-attention). Trained autoregressively to predict next token. Excellent for generation tasks.
Vision Transformer (ViT)
Applies transformers to images by splitting into patches
DETR
Object detection with transformers, no hand-crafted components
Computational Challenges
The O(n²) Problem
Self-attention computes pairwise interactions between all tokens, resulting in O(n²) time and memory complexity. This becomes prohibitive for long sequences (thousands of tokens).
Solutions:
- Sparse attention: Longformer, BigBird - attend to subset of tokens
- Linear attention: Performers, Linformer - approximate attention in linear time
- Flash Attention: IO-aware algorithms for faster computation
- Sliding window: Local attention with limited context window
Key Takeaways
- →Self-attention computes relevance between all token pairs
- →Transformers enable parallel processing, unlike sequential RNNs
- →Multi-head attention learns diverse relationships simultaneously
- →Encoder-only (BERT), decoder-only (GPT), and encoder-decoder serve different purposes
- →Quadratic complexity motivates efficient attention variants