Summary of Computer Vision: 3rd lecture (object recognition, convolutional neural networks)

Summary of the Video: "Computer Vision: 3rd lecture (object recognition, Convolutional Neural Networks)"

Overview

This lecture builds on prior computer vision topics such as image segmentation, feature detection, and clustering, and focuses on object recognition methods and Convolutional Neural Networks (CNNs). It covers:

Definitions and challenges in object recognition
Traditional approaches: Template Matching and histograms
Introduction to machine learning for vision
Fundamentals of neural networks and training
Convolutional Neural Networks: architecture, advantages, and training
Practical aspects: hyperparameter tuning, AutoML, hardware acceleration
Popular CNN architectures and their impact
Preview of next lecture on temporal models and video processing

Main Ideas and Concepts

1. Recap of Previous Content

Segmentation: from simple binarization to complex methods using Gestalt theory, histograms, quad trees, and K-means clustering.
Feature detection: Huff transform for lines, Harris corner detector for corners, RANSAC for robust parameter estimation.
Challenges in feature selection and application in vision tasks.

2. Object Recognition Tasks and Terminology

Classification: Assign a label to the entire image (e.g., dog, car).
Localization: Find the location of an object in an image (bounding box).
Detection: Combination of localization + classification of objects.
Object recognition is an umbrella term covering these related but distinct tasks.
Challenges include variations in translation, rotation, scale, occlusion, illumination, viewpoint, camera distortions, and intra-class variability.

3. Traditional Object Recognition Approaches

Template Matching:
- Slide a template over the image and compute correlation to find best match.
- Advantages: simple, fast, works well in controlled environments.
- Disadvantages: sensitive to scale, rotation, illumination changes, and shape variations.
Histogram-based Feature Voting:
- Extract features, build hash tables, and vote for candidate models.
- More robust than Template Matching to rotation and deformation.
- Computationally more expensive and requires feature detectors.

4. Machine Learning for Object Recognition

Traditional ML algorithms (SVM, k-NN, decision trees, Bayesian models) perform poorly on raw image pixels.
Feature extraction (e.g., Harris corners, edge detectors) is needed before ML.
Challenge: which features to choose? Often trial and error.
Neural networks allow learning features and classification jointly, especially deep neural networks.

5. Introduction to Neural Networks

Structure: input layer, one or more hidden layers, output layer.
Fully connected layers: every input connected to every hidden unit, etc.
Units compute weighted sums of inputs followed by an activation function.
Activation functions:
- Sigmoid: outputs between 0 and 1, historically popular but causes slow training and overfitting.
- ReLU (Rectified Linear Unit): zero for negative inputs, linear for positive; leads to faster training and less overfitting.
Neural networks are supervised learning models mainly used for classification and regression.
History: from simple perceptrons in 1950s (linear classifiers) to multilayer perceptrons (MLPs) and deep learning resurgence around 2006 due to better training methods and data availability.

6. Training Neural Networks

Define a loss function (e.g., squared error, cross-entropy).
Optimize parameters (weights) to minimize loss using gradient descent:
- Iteratively update weights opposite to gradient of loss.
- Learning rate controls step size; too large causes overshooting, too small causes slow convergence.
Extend gradient descent to multiple parameters using partial derivatives and vector notation (gradient).
Use stochastic gradient descent (SGD) or mini-batches for efficiency.
Loss surfaces in neural networks are non-convex with many local minima.
Backpropagation algorithm efficiently computes gradients for all weights.
Modern frameworks handle Backpropagation automatically.

7. Deep Learning and Feature Learning

Deep neural networks with many hidden layers learn hierarchical features automatically:
- Early layers learn edges.
- Middle layers learn parts (faces, shapes).
- Later layers learn whole objects.
This self-learning of features removes the need for manual feature engineering.

8. Convolutional Neural Networks (CNNs)

CNNs operate directly on 2D image data preserving spatial structure.
Key components:
- Convolutional layers: apply small kernels (filters) to local regions (receptive fields).
- Sparse interactions: kernels are smaller than image, reducing parameters.
- Parameter sharing: same kernel applied across different image locations, improving translation invariance and reducing parameters.
- Pooling (subsampling) layers: compress feature maps using max or average pooling, increasing invariance and reducing dimensional