Transformers & Attention Mechanisms

The architecture that revolutionized AI: how attention mechanisms and transformers changed everything.

The Transformer Revolution

The paper "Attention Is All You Need" (Vaswani et al., 2017) introduced the transformer architecture, which has become the foundation of nearly all state-of-the-art NLP models and is expanding into vision and other domains.

Transformers replaced RNNs by relying entirely on attention mechanisms, enabling parallel processing of sequences and learning long-range dependencies more effectively.

Self-Attention Mechanism

Self-attention allows each position in a sequence to attend to all positions, computing relevance scores between every pair of tokens. This is the core innovation that makes transformers powerful.

The Math Behind Self-Attention

Create Q, K, V matrices: Query, Key, Value from input embeddings using learned weight matrices
Compute attention scores: Score = QK^T / √d_k (scaled dot-product)
Apply softmax: Attention_weights = softmax(Score)
Weighted sum: Output = Attention_weights × V

The scaling factor √d_k prevents dot products from growing too large, which would push softmax into regions with small gradients.

Why Self-Attention Works

Parallel computation: Unlike RNNs, all positions processed simultaneously
Long-range dependencies: Direct connections between any two positions
Dynamic attention: Attention patterns learned from data
Interpretability: Can visualize what each token attends to

Multi-Head Attention

Instead of single attention, transformers use multiple attention "heads" in parallel. Each head can learn different types of relationships: syntactic, semantic, positional, etc.

How It Works

Run h parallel attention mechanisms with different learned projections
Concatenate the h outputs
Apply final linear projection

Typical values: h=8 or h=16 heads. Each head operates on d_k = d_model/h dimensions.

Transformer Architecture

Encoder Stack

Processes input sequence. Each layer has:

Multi-head self-attention
Add & Norm (residual + layer norm)
Feed-forward network (2 linear layers)
Add & Norm

Decoder Stack

Generates output sequence. Each layer has:

Masked multi-head self-attention
Add & Norm
Cross-attention to encoder output
Add & Norm
Feed-forward network
Add & Norm

Positional Encoding

Since transformers have no recurrence, they need explicit position information. Original transformer uses sinusoidal encodings: PE(pos, 2i) = sin(pos/10000^(2i/d)), PE(pos, 2i+1) = cos(pos/10000^(2i/d))

Modern variants often use learned positional embeddings or relative position encodings.

Transformer Variants

BERT (Bidirectional Encoder)

Uses only encoder stack. Trained with masked language modeling and next sentence prediction. Excellent for understanding tasks (classification, QA).

GPT (Decoder-Only)

Uses only decoder stack (without cross-attention). Trained autoregressively to predict next token. Excellent for generation tasks.

Vision Transformer (ViT)

Applies transformers to images by splitting into patches

DETR

Object detection with transformers, no hand-crafted components

Computational Challenges

The O(n²) Problem

Self-attention computes pairwise interactions between all tokens, resulting in O(n²) time and memory complexity. This becomes prohibitive for long sequences (thousands of tokens).

Solutions:

Sparse attention: Longformer, BigBird - attend to subset of tokens
Linear attention: Performers, Linformer - approximate attention in linear time
Flash Attention: IO-aware algorithms for faster computation
Sliding window: Local attention with limited context window

Key Takeaways

→Self-attention computes relevance between all token pairs
→Transformers enable parallel processing, unlike sequential RNNs
→Multi-head attention learns diverse relationships simultaneously
→Encoder-only (BERT), decoder-only (GPT), and encoder-decoder serve different purposes
→Quadratic complexity motivates efficient attention variants

Computer Vision Large Language Models