Computer Vision
Teaching machines to see, understand, and interpret visual information from the world.
What is Computer Vision?
Computer vision enables machines to derive meaningful information from digital images, videos, and other visual inputs. It involves understanding scenes, detecting objects, recognizing faces, tracking motion, and reconstructing 3D environments.
Image Classification
The fundamental computer vision task: given an image, assign it to one of several predefined categories. CNNs revolutionized this field, achieving superhuman performance on many benchmarks.
Key Architectures
- AlexNet (2012): 8 layers, 60M parameters. ImageNet breakthrough
- VGGNet (2014): 16-19 layers with 3x3 convolutions throughout
- ResNet (2015): Skip connections enabling 152+ layer networks
- EfficientNet (2019): Compound scaling of width, depth, and resolution
- Vision Transformers (2020): Apply transformer architecture to images
Object Detection
More complex than classification: detect multiple objects in an image and localize them with bounding boxes. Critical for autonomous vehicles, robotics, and surveillance systems.
R-CNN Family
Region-based CNNs. Propose regions, then classify each.
- R-CNN (2014)
- Fast R-CNN (2015)
- Faster R-CNN (2015)
- Mask R-CNN (2017) - adds segmentation
YOLO (You Only Look Once)
Single-stage detector. Extremely fast, real-time detection. YOLOv5-v8 are popular.
Semantic Segmentation
Classify every pixel in an image. More granular than object detection - produces pixel-level understanding of scenes.
Popular Architectures
- FCN (Fully Convolutional Network): First end-to-end segmentation network
- U-Net: Encoder-decoder with skip connections. Popular in medical imaging
- DeepLab: Uses atrous convolutions for larger receptive fields
- Mask R-CNN: Extends Faster R-CNN with instance segmentation
Advanced Topics
Facial Recognition
FaceNet, ArcFace - learn embeddings where same person has similar vectors
Pose Estimation
Detect human body keypoints (joints, limbs) for motion capture and AR
Image Generation
GANs, Diffusion Models (DALL-E, Stable Diffusion) create photorealistic images
3D Reconstruction
SLAM, NeRF - reconstruct 3D scenes from 2D images
Key Takeaways
- →CNNs are the foundation of modern computer vision
- →Object detection extends classification with localization
- →Segmentation provides pixel-level understanding
- →Vision transformers are challenging CNN dominance