Prev: L6, Next: L8 , Assignment: A4 , Practice Questions: M19 M20 , Links: Canvas, Piazza, Zoom, TopHat (453473)
Tools
📗 Calculator:
📗 Canvas:


📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

Slide:



# Computer Vision Tasks

📗 Unsupervised:
➩ Image segmentation.
📗 Supervised:
➩ Image colorization.
➩ Image reconstruction.
➩ Image synthesis.
➩ Image captioning.
➩ Object detection and tracking.
➩ Medical image analysis.



# Convolution

📗 Using pixel intensities as the features assume pixels are independent of their neighbors. This is inappropriate for most of the computer vision tasks.
📗 Neighboring pixel intensities can be combined to create one feature that captures the information in the region around the pixel.
📗 Linearly combining pixels in a rectangular region is called convolution: Link, Wikipedia.
➩ The convolution of a vector \(x_{i} = \left(x_{i 1}, x_{i 2}, ..., x_{i m}\right)\) with a filter \(w = \left(w_{-k}, w_{-k + 1}, ..., w_{k-1}, w_{k}\right)\) is: \(a_{i} = \left(a_{i 1}, a_{i 2}, ..., a_{i m}\right) = x \star w\), where \(a_{i j} = w_{-k} x_{i \left(j + k\right)} + w_{-k + 1} x_{i \left(j + k - 1\right)} + ... + w_{k} x_{i \left(j - k\right)}\) for \(j = 1, 2, ..., m\).
➩ The convolution of an \(m \times m\) matrix \(X_{i}\) with a \(\left(2 k + 1\right) \times \left(2 k + 1\right)\) filter \(W\) is \(A_{i} = X_{i} \star W\), where \(A_{i j} = W_{-k} \star X_{i \left(j + k\right)} + W_{-k + 1} \star X_{i \left(j + k - 1\right)} + ... + W_{k} \star X_{i \left(j - k\right)}\) for rows \(j = 1, 2, ..., m\).
➩ 3D convolution can be defined the same way.
📗 The convolution filter is also called "kernel", but different from the kernel matrix for SVMs.
TopHat Quiz (Past Exam Question) ID:
📗 [4 points] What is the convolution between the image and the filter using zero padding? Remember to flip the filter first.

📗 Answer (matrix with multiple lines, each line is a comma separated vector): .




# Image Gradients

📗 Image gradients are changes in pixel intensity due to the change in the location of the pixel: Wikipedia.
📗 Image gradients can be computed (approximated) by convolution with the following filters: \(\nabla_{x} I = W_{x} \star I\) and \(\nabla_{y} I = W_{y} \star I\).
➩ (Discrete) derivative filter: \(W_{x} = \begin{bmatrix} -1 & 0 & 1 \\ -1 & 0 & 1 \\ -1 & 0 & 1 \end{bmatrix}\) and \(W_{y} = \begin{bmatrix} -1 & -1 & -1 \\ 0 & 0 & 0 \\ 1 & 1 & 1 \end{bmatrix}\).
➩ Sobel filters: \(W_{x} = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}\) and \(W_{y} = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}\), which can be viewed as a combination of the Gaussian filter (to blur the image) and the derivative filter: Wikipedia.
📗 Image gradients can be used in edge detection: Link, Wikipedia.
📗 Gradient magnitude at a pixel \((s, t)\) is given by \(G = \sqrt{\nabla_{x}^{2} I\left(s, t\right) + \nabla_{y}^{2} I\left(s, t\right)}\) and gradient direction \(\Theta = arctan\left(\dfrac{\nabla_{y} I\left(s, t\right)}{\nabla_{x} I\left(s, t\right)}\right)\).
TopHat Quiz (Past Exam Question) ID:
📗 [1 points] What is the gradient magnitude of the center element (pixel) of the image . Use the x gradient filter: \(\begin{bmatrix} -1 & 0 & 1 \\ -1 & 0 & 1 \\ -1 & 0 & 1 \end{bmatrix}\), and the y gradient filter: \(\begin{bmatrix} -1 & -1 & -1 \\ 0 & 0 & 0 \\ 1 & 1 & 1 \end{bmatrix}\). Remember to flip the filters.
📗 Answer: .




# Histogram of Gradients

📗 Histogram of Gradients (HOG) can be used as a features for images and is often combined with SVM for face detection and recognition tasks: Wikipedia.
📗 In every 8 by 8 pixel region of an image, the gradient vectors \(\begin{bmatrix} \nabla_{x} I\left(s, t\right) \\ \nabla_{y} I\left(s, t\right) \end{bmatrix}\) are put into 9 orientation bins, for example, \(\left[0, \dfrac{2}{9} \pi\right], \left[\dfrac{2}{9} \pi, \dfrac{4}{9} \pi\right], ..., \left[\dfrac{16}{9} \pi, 2 \pi\right]\), and the histogram count is used as the HOG features.
📗 The resulting bins are normalized within a block of 2 by 2 regions.
📗 Scale Invariant Feature Transform (SIFT) produces similar feature representation using histogram of oriented gradients: Wikipedia.
➩ It is location invariant.
➩ It is scale invariant: images at different scales are used to compute the features.
➩ It is orientation invariant: dominant orientation in a larger region is calculated and all gradients in the region are rotated by the dominant orientation.
➩ It is illumination and contrast invariant: feature vectors are normalized so that they sum up to 1, and thresholded (values smaller than a threshold, for example 0.2, are made 0).
Example
hog




# Haar Features

📗 HOG and SIFT features are too expensive to compute for real-time face detection tasks.
📗 Each image contains large number of locations at different scales, but faces only occur in very few of them.
📗 Each feature and classifier based on the feature should be easy to compute, and boosting can be used to combine simple features and classifiers.
📗 Haar features are differences between sums of pixel intensities in rectangular regions and can be obtained by convolution with filters such as \(\begin{bmatrix} 1 & 1 \\ -1 & -1 \end{bmatrix}\), \(\begin{bmatrix} 1 & -1 \\ 1 & -1 \end{bmatrix}\), ...: Wikipedia.
📗 Integral images (sum of pixel intensities above and to the left of every pixel) can be used to further speed up the computation: Wikipedia.



# Weak Classifiers

📗 Each weak classifier can be a decision stump (decision tree with only one split) based on a Haar feature.
📗 Finding the threshold by comparing information gain is computationally expensive, so it is usually computed as the mid-point of the averages of the two classes.
➩ Start with a classifier with close to 100 percent detection rate by a possibly small false-positive rate.
➩ Train the next classifier on regions that are not rejected by the first classifier, still with close to 100 percent detection rate and possibly small false-positive rate.
➩ Repeat this process to get a sequence of weak classifiers. The combined classifier is a strong classifier.
Example
ghost




# Viola Jones

📗 Viola Jones algorithm is a popular real-time face detection algorithm: Wikipedia.
➩ Each classifier operates on a 24 by 24 region of the image at multiple scales (scaling factor of 1.25).
➩ The regions can be overlapping. Nearby detection of faces are combined into a single detection.



# Learning Convolution

📗 The features can be engineered using computer vision techniques such as HOG or SIFT.
📗 They can also be learned as hidden units in a neural network. These neural networks are call convolutional neural networks (CNN): Link, Link, Wikipedia.
📗 Instead of activation units \(a = g\left(w^\top x + b\right)\), the dot product can be replaced by convolution (usually cross-correlation in practice, which is convolution without flipping the filters). The resulting matrix of activation units is called an activation map computed as \(A = g\left(W \star x + b\right)\).



# Convolution and Pooling Layers

📗 Convolution can also be applied on activation maps in the previous layer \(A^{\left(l\right)} = g\left(W^{l} \star A^{\left(l - 1\right)} + b\right)\).
📗 Multiple units can be combined into one in pooling layers.
➩ Max pooling computes the maximum in a square region.
➩ Average pooling computes the average in a square region.
📗 The filter weights in convolution layers need to be trained using gradient descent. The pooling layer does not have weights that need to be trained.
➩ The gradient with respect to the weights in the convolution layers can be computed using convolution: \(\dfrac{\partial C}{\partial W} = X \star \dfrac{\partial C}{\partial A}\) and \(\dfrac{\partial C}{\partial X} = \text{rot} W \star \dfrac{\partial C}{\partial A}\), where \(\text{rot} W\) is the filter matrix rotated by 180 degrees (for example \(\begin{bmatrix} a & b \\ c & d \end{bmatrix}\) to \(\begin{bmatrix} d & c \\ b & a \end{bmatrix}\)).
➩ The gradient for the pooling layers is (i) for max pooling: \(1\) for the maximum input, 0 for other unit, (ii) for average pooling: \(\dfrac{1}{m^{2}}\) for each of the units in the \(m \times m\) region.
TopHat Quiz (Past Exam Question) ID:
📗 [4 points] A convolutional neural network has input image of size x that is connected to a convolutional layer that uses a x filter, zero padding of the image, and a stride of 1. There are activation maps. (Here, zero-padding implies that these activation maps have the same size as the input images.) The convolutional layer is then connected to a pooling layer that uses x max pooling, a stride of (non-overlapping, no padding) of the convolutional layer. The pooling layer is then fully connected to an output layer that contains output units. There are no hidden layers between the pooling layer and the output layer. How many different weights must be learned in this whole network, not including any bias.
📗 Answer: .




# Examples of Convolutional Neural Networks

📗 LeNet is a simple convolutional neural network: Link, Wikipedia.
📗 AlexNet is another deep CNN architecture: Wikipedia.
📗 InceptionNet (GoogLeNet) introduced Inception module and auxiliary classifiers to improve training CNN with large number of layers: Link.
➩ 1 by 1 convolutions are used to reduce the number of activation maps.
➩ auxiliary classifiers are added so that the gradient in earlier layers does not become zero even when many of the weights in later layers are close to 0.
📗 ResNet introduces additional skip layer connections to improve training networking that are very deep: Wikipedia.
📗 Adversarial attacks on CNN have been proposed to create more robust neural networks: Link.



📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.
📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.
📗 Anonymous feedback can be submitted to: Form.

Prev: L6, Next: L8





Last Updated: July 03, 2024 at 12:23 PM