Training & Optimization
How neural networks learn: backpropagation, gradient descent, and optimization techniques that make deep learning possible.
Backpropagation
Backpropagation is the algorithm that enables neural networks to learn. It efficiently computes gradients of the loss function with respect to all parameters using the chain rule of calculus. These gradients tell us how to adjust parameters to reduce the loss.
The Algorithm
- Forward pass: Compute predictions and loss
- Backward pass: Compute gradients starting from output layer
- Apply chain rule: ∂L/∂wᵢ = (∂L/∂y) × (∂y/∂z) × (∂z/∂wᵢ)
- Update parameters: wᵢ = wᵢ - α × ∂L/∂wᵢ
The beauty of backpropagation is its efficiency: computing gradients for all parameters requires just one forward and one backward pass.
Gradient Descent Variants
Batch Gradient Descent
Computes gradients using entire dataset. Accurate but slow for large datasets.
θ = θ - α∇J(θ)
Stochastic Gradient Descent (SGD)
Updates parameters after each example. Fast but noisy.
θ = θ - α∇J(θ; xᵢ, yᵢ)
Mini-batch GD
Best of both worlds. Updates using small batches (32-256 examples). Industry standard.
SGD with Momentum
Accumulates velocity from past gradients. Helps escape local minima and accelerates convergence.
Advanced Optimizers
Adam (Adaptive Moment Estimation)
The most popular optimizer in deep learning. Combines momentum and adaptive learning rates per parameter. Maintains running averages of gradients and their squares.
- Automatically adjusts learning rate for each parameter
- Works well with sparse gradients
- Default choice for most applications
- Typical hyperparameters: β₁=0.9, β₂=0.999, α=0.001
RMSprop
Adapts learning rate per parameter using moving average of squared gradients. Good for RNNs.
AdaGrad
Adapts learning rate based on historical gradients. Works well for sparse data but can be too aggressive.
Learning Rate Scheduling
The learning rate is perhaps the most important hyperparameter. Scheduling helps balance fast initial learning with fine-tuning.
Step Decay
Reduce learning rate by factor every N epochs. Simple and effective.
Cosine Annealing
Smoothly decreases learning rate following cosine curve. Popular in modern training.
Warm-up
Start with small learning rate, gradually increase. Critical for transformer training.
OneCycleLR
Cycles learning rate from low to high to low. Enables faster training.
Regularization Techniques
L2 Regularization (Weight Decay)
Adds penalty term to loss: L = Loss + λ∑w². Prevents large weights, encouraging simpler models.
Typical λ values: 1e-4 to 1e-5
Dropout
Randomly drops neurons during training. Rate: 0.2-0.5
Early Stopping
Stop training when validation loss stops improving.
Data Augmentation
Artificially expand dataset with transformations.
Hyperparameter Tuning
Key Hyperparameters
- Learning rate: Most important. Try: 1e-4, 3e-4, 1e-3
- Batch size: 32, 64, 128, 256 (constrained by memory)
- Number of layers/units: Start small, increase if underfitting
- Dropout rate: 0.2-0.5 for fully connected layers
- Weight decay: 1e-5 to 1e-4
Pro tip: Use random search or Bayesian optimization instead of grid search. Monitor validation metrics, not training loss.
Key Takeaways
- →Backpropagation efficiently computes gradients using the chain rule
- →Adam is the default optimizer; works well in most cases
- →Learning rate scheduling improves convergence and final performance
- →Regularization prevents overfitting; combine multiple techniques
- →Hyperparameter tuning is essential; use systematic search methods