Computer Vision

Teaching machines to see, understand, and interpret visual information from the world.

What is Computer Vision?

Computer vision enables machines to derive meaningful information from digital images, videos, and other visual inputs. It involves understanding scenes, detecting objects, recognizing faces, tracking motion, and reconstructing 3D environments.

Image Classification

The fundamental computer vision task: given an image, assign it to one of several predefined categories. CNNs revolutionized this field, achieving superhuman performance on many benchmarks.

Key Architectures

AlexNet (2012): 8 layers, 60M parameters. ImageNet breakthrough
VGGNet (2014): 16-19 layers with 3x3 convolutions throughout
ResNet (2015): Skip connections enabling 152+ layer networks
EfficientNet (2019): Compound scaling of width, depth, and resolution
Vision Transformers (2020): Apply transformer architecture to images

Object Detection

More complex than classification: detect multiple objects in an image and localize them with bounding boxes. Critical for autonomous vehicles, robotics, and surveillance systems.

R-CNN Family

Region-based CNNs. Propose regions, then classify each.

R-CNN (2014)
Fast R-CNN (2015)
Faster R-CNN (2015)
Mask R-CNN (2017) - adds segmentation

YOLO (You Only Look Once)

Single-stage detector. Extremely fast, real-time detection. YOLOv5-v8 are popular.

Semantic Segmentation

Classify every pixel in an image. More granular than object detection - produces pixel-level understanding of scenes.

Popular Architectures

FCN (Fully Convolutional Network): First end-to-end segmentation network
U-Net: Encoder-decoder with skip connections. Popular in medical imaging
DeepLab: Uses atrous convolutions for larger receptive fields
Mask R-CNN: Extends Faster R-CNN with instance segmentation

Advanced Topics

Facial Recognition

FaceNet, ArcFace - learn embeddings where same person has similar vectors

Pose Estimation

Detect human body keypoints (joints, limbs) for motion capture and AR

Image Generation

GANs, Diffusion Models (DALL-E, Stable Diffusion) create photorealistic images

3D Reconstruction

SLAM, NeRF - reconstruct 3D scenes from 2D images

Key Takeaways

CNNs are the foundation of modern computer vision
Object detection extends classification with localization
Segmentation provides pixel-level understanding
Vision transformers are challenging CNN dominance

Natural Language Processing Transformers