Training & Optimization

How neural networks learn: backpropagation, gradient descent, and optimization techniques that make deep learning possible.

Backpropagation

Backpropagation is the algorithm that enables neural networks to learn. It efficiently computes gradients of the loss function with respect to all parameters using the chain rule of calculus. These gradients tell us how to adjust parameters to reduce the loss.

The Algorithm

Forward pass: Compute predictions and loss
Backward pass: Compute gradients starting from output layer
Apply chain rule: ∂L/∂wᵢ = (∂L/∂y) × (∂y/∂z) × (∂z/∂wᵢ)
Update parameters: wᵢ = wᵢ - α × ∂L/∂wᵢ

The beauty of backpropagation is its efficiency: computing gradients for all parameters requires just one forward and one backward pass.

Gradient Descent Variants

Batch Gradient Descent

Computes gradients using entire dataset. Accurate but slow for large datasets.

θ = θ - α∇J(θ)

Stochastic Gradient Descent (SGD)

Updates parameters after each example. Fast but noisy.

θ = θ - α∇J(θ; xᵢ, yᵢ)

Mini-batch GD

Best of both worlds. Updates using small batches (32-256 examples). Industry standard.

SGD with Momentum

Accumulates velocity from past gradients. Helps escape local minima and accelerates convergence.

Advanced Optimizers

Adam (Adaptive Moment Estimation)

The most popular optimizer in deep learning. Combines momentum and adaptive learning rates per parameter. Maintains running averages of gradients and their squares.

Automatically adjusts learning rate for each parameter
Works well with sparse gradients
Default choice for most applications
Typical hyperparameters: β₁=0.9, β₂=0.999, α=0.001

RMSprop

Adapts learning rate per parameter using moving average of squared gradients. Good for RNNs.

AdaGrad

Adapts learning rate based on historical gradients. Works well for sparse data but can be too aggressive.

Learning Rate Scheduling

The learning rate is perhaps the most important hyperparameter. Scheduling helps balance fast initial learning with fine-tuning.

Step Decay

Reduce learning rate by factor every N epochs. Simple and effective.

Cosine Annealing

Smoothly decreases learning rate following cosine curve. Popular in modern training.

Warm-up

Start with small learning rate, gradually increase. Critical for transformer training.

OneCycleLR

Cycles learning rate from low to high to low. Enables faster training.

Regularization Techniques

L2 Regularization (Weight Decay)

Adds penalty term to loss: L = Loss + λ∑w². Prevents large weights, encouraging simpler models.

Typical λ values: 1e-4 to 1e-5

Dropout

Randomly drops neurons during training. Rate: 0.2-0.5

Early Stopping

Stop training when validation loss stops improving.

Data Augmentation

Artificially expand dataset with transformations.

Hyperparameter Tuning

Key Hyperparameters

Learning rate: Most important. Try: 1e-4, 3e-4, 1e-3
Batch size: 32, 64, 128, 256 (constrained by memory)
Number of layers/units: Start small, increase if underfitting
Dropout rate: 0.2-0.5 for fully connected layers
Weight decay: 1e-5 to 1e-4

Pro tip: Use random search or Bayesian optimization instead of grid search. Monitor validation metrics, not training loss.

Key Takeaways

→Backpropagation efficiently computes gradients using the chain rule
→Adam is the default optimizer; works well in most cases
→Learning rate scheduling improves convergence and final performance
→Regularization prevents overfitting; combine multiple techniques
→Hyperparameter tuning is essential; use systematic search methods

Deep Learning Natural Language Processing