CS 540 Lecture Notes: Computer Vision

CS 540

Lecture Notes

Fall 1996

Computer Vision (Chapter 24)

What is Computer Vision?

"The central problem of computer vision is from one or a sequence of images to understand the object or scene and its 3D properties." -- Y. Aloimonos
"Vision is the process by which descriptions of physical scenes are inferred from images of them." -- S. Zucker
"A process that produces from images of the external 3D world a description that is useful to the viewer and not cluttered by irrelevant information." -- D. Marr

Applications

Medical image analysis
Aerial photo interpretation
Vehicle exploration and mobility
Material handling
For example, part sorting and picking
Inspection
For example, integrated circuit board and chip inspection
Assembly
Navigation
Human-computer interfaces
For example, handwriting recognition, OCR, face recognition, gesture recognition, gaze tracking, 3D model acquisition
Multimedia
For example, video databases, image compression, image browsing, content-based retrieval
Telepresence/Tele-reality
For example, tele-medicine, virtual classrooms, video conferencing, interactive walkthroughs

Example: Vehicle Navigation using Neural Networks

The ALVINN system by D. Pomerleau is an example of a 2-layer, feedforward neural network for "vision-based lane keeping" that was described earlier in the section on Neural Networks.

Example: Face Recognition using an Eigenspace Representation

Problem Statement: Given a training set of M images, each of size N x N pixels, where each image contains a single person's face, approximately registered for face position, orientation, scale, and brightness, and a test image, determine if the person in the test image is one of the people in the training set, and, if so, indicate which person it is.
Need a similarity metric for measuring the "distance" between two face images
Need a way of representing face features to be compared within the similarity metric
One approach due to M. Turk and A. Pentland (see "Eigenfaces for Recognition," J. Cognitive Neuroscience 3, 1991, pp. 71-86) An online description of their current work and demos is available at their Eigenfaces/Photobook web page. For other information and research on the problem of face recognition, see the Face Recognition Home Page.

Eigenspace Representation of Images

An N x N image can be "represented" as a point in an N^2 dimensional image space, where each dimension is associated with one of the pixels in the image and the possible values in each dimension are the possible gray levels of each pixel. For example, a 512 x 512 image where each pixel is an integer in the range 0, ..., 255 (i.e., a pixel is stored in one byte), then image space is a 262,144-dimensional space and each dimension has 256 possible values.
If we represented our M training images as M points in image space, then one way of recognizing the person in a new test image would be to find its nearest neighbor training image in image space. But this approach would be very slow since the size of image space is so large, and would not exploit the fact that since all of our images are of faces, they will likely be clustered relatively near one another in image space. So, instead, let's represent each image in a lower-dimensional feature space, called face space or eigenspace.
Say we have M' images, E1, E2, ..., EM', called eigenfaces or eigenvectors. These images define a basis set, so that each face image will be defined in terms of how similar it is to each of these basis images. That is, we can represent an arbitrary image I as a weighted combination of these eigenvectors as follows:
1. Compute the average image, A, from all of the training images I1, I2, ..., IM:
```
        M
      -----
    1 \
A = -  \   Ii
    M  /
      /
      -----
       i=1
```
2. For k = 1, ..., M' do
```
	    T
     wk = Ek  * (I - A)
```
  where I is a given image and is represented as a column vector of length N^2, Ek is the kth eigenface image and is a column vector of length N^2, A is a column vector of length N^2, * is the dot product operation, and - is pixel by pixel subtraction. Thus wk is a real-valued scalar.
3. W = [w1, w2, ..., wM'] is a column vector of weights that indicates the contribution of each eigenface image in representing image I. So, instead of representing image I in image space, we'll represent it as a point W in the M'-dimensional weight space that we'll call face space or eigenspace. Hence, each image is projected from a point in the high dimensional image space down to a point in the much lower dimensional eigenspace.
Notice that image I can be reconstructed from W as follows:
```
	 M'
       -----
       \
I = A + \    wi * Ei
	/
       /
       -----
	i=1
```
though this reconstruction will be exact only if M' = min(M, N^2) in general. Hence, representing an image in eigenspace won't be exact in that the image won't be reconstructible, but it will be a pretty good approximation that's sufficient for differentiating between faces.
Now, select a value for M' and then determine the M' "best" eigenvector images (i.e., eigenfaces). How?
Answer: Use the statistics technique called Principal Components Analysis (also called the Karhunen-Loeve transform in communications theory). Intuitively, this technique selects the M' images that maximize the information content in the compressed (i.e., eigenspace) representation.
The best M' eigenface images are computed as follows:
1. For each training image Ii, normalize it by subtracting the mean (i.e., the "average image"): Yi = Ii - A
2. Compute the N^2 x N^2 Covariance Matrix:
```
	M
      -----
    1 \          T
C = -  \    Yi Yi
    M  /     
      /
      -----
       i=1
```
3. Find the eigenvectors of C that are associated with the M' largest eigenvalues. Call the eigenvectors E1, E2, ..., EM'. These are the eigenface images used by the algorithm given above.
Note: C is very large, so this method is computationally very intensive. However, there are relatively fast methods for finding the k largest eigenvectors, which is all we need.

Face Recognition Algorithm

The entire face recognition algorithm can now be given:

Given a training set of face images, compute the M' largest eigenvectors, E1, E2, ..., EM'. M' = 10 or 20 is a typical value used. Notice that this step is done once "offline."
For each different person in the training set, compute the point associated with that person in eigenspace. That is, use the formula given above to compute W = [w1, ..., wm']. Note that this step is also done once offline.
Given a test image, Itest, project it to the M'-dimensional eigenspace by computing the point Wtest, again using the formula given above.
Find the closest training face to the given test face:
```
d = min || Wtest - Wk ||
     k
```
where Wk is point in eigenspace associated with the kth person in the training set, and || * || denotes the Euclidean distance in eigenspace.
Find the distance of the test image from eigenspace (that is, compute the projection distance so that we can estimate the likelihood that the image contains a face):
```
dffs = || Y - Yf ||
```
where Y = Itest - A, and Yf = sum_i_from_1_to_M' (Wi * Ei).

If dffs < Threshold1
	; Test image is "close enough" to the eigenspace
	; associated with all of the training faces to
	; believe that this test image is likely to be some
	; face (and not a house or a tree or something
	; other than a face)
then if d < Threshold2
     then classify Itest as containing the face of person k,
	     where k is the closest face in the eigenspace to
	     Wtest, the projection of Itest to eigenspace
     else classify Itest as an unknown person
else classify Itest as not containing a face

Face Recognition Accuracy and Extensions to Eigenspace Approach

Performance using a 20-dimensional eigenspace resulted in about 95% correct classification on a database of about 7,500 images of about 3,000 people
If training set contains multiple images of each person, then for each person compute the average point in eigenspace from the points computed for each image of that person
Method requires that all images in the database contain faces of about the same size, position, and orientation, so they can be compared using this global distance function in eigenspace
If there are multiple images of a 3D object (e.g., a person's head from many different positions and orientations), then the points in eigenspace corresponding to the different 3D views can be combined by fitting a hypersurface to all the points, and storing this hypersurface in eigenspace as the description of that person. Then, classify a test image as the person corresponding to the closest hypersurface

Template Matching

One common approach to detecting image features of various kinds is to define a template, which is usually an m x n "window" or sub-image representing some pattern of interest. For example, we could define templates for left-eye, right-eye, nose, and mouth, and then use these to detect face features. If these four templates match in the right relative locations to one another, then we can detect a face. Two major advantages of using multiple "local" templates such as these for face recognition are (1) it allows us to focus attention on the key features of recognizing faces, and (2) it makes the algorithm more robust in the sense that occlusion or noise in other parts of the face have no effect on our ability to recognize a face.
To use templates, we must define a measure for matching a template with various positions (and possibly orientations) in an image to determine the best match locations
Sum of Squared Difference
Let f denote the template, and g denote an image, then one measure of match is called the Sum of Squared Difference and is defined by:
```
          -----  -----
          \      \                           2
SSD(x,y) = \      \    (f(u,v) - g(x+u, y+v))
	   /      /
          /      /
	  -----  -----
	   u       v
```
Example
Assume the image and template are one-dimensional, indexed from 0, and defined as follows:
f = 1 2 3
g = 0 0 0 1 2 3 3 3
SSD = 14 9 3 0 2 5 14 17
For example, SSD(0) = (1-0)^2 + (2-0)^2 + (3-0)^2 = 14
- SSD >= 0, where 0 indicates a perfect match and a large number means a poor match. Hence SSD can be considered as a "mismatch" measure.
- Values of f and g that are not given are assumed to be 0
- We can expand the above formula as follows:
  
  Sum Sum (f - g)^2 = Sum Sum f^2 + Sum Sum g^2 - 2 Sum Sum fg So, considered the image and template to be fixed means we can use Sum Sum fg as a match measure, leading to the following alternative the SSD.
Cross-Correlation
```
          -----   -----
          \       \
CC(x,y) =  \       \    f(u,v) g(x+u, y+v)
           /       /
          /       /
          -----   -----
	    u       v
```
The only problem with this definition is that it does depend on the pixel values in the image, which vary over the positions where the template is overlaid. So, to avoid this dependence, we usually use the Normalized Cross-Correlation:
```
                     CC(c,y)
NCC(x,y) =  ------------------------------------
            ( -----   -----                ) 1/2
            ( \       \                  2 )
            (  \       \    (g(x+u, y+v))  )
            (  /       /                   )
            ( /       /                    )
            ( -----   -----                )
```
So, for the above example we get:
CC = 0 3 8 14 17 18 ...
g^2 = 0 1 5 14 22 27 ...
NCC = - 3 3.6 3.7 3.6 3.5 ...
- 0 <= NCC <= (Sum Sum f^2)^(1/2) and 0 means a poor match and a large value means a good match
- If the template is size m x m and the image is size n x n, then (m^2)(n^2) multiplications are required to compute CC.
- The degree of match falls off slowly away from the position of a perfect match, implying that it may be hard to localize precisely the best match location

Edge Detection

A problem that is closely related to template matching is detecting edges, which are one of the most basic low-level features that are commonly detected in images. An edge is a pixel where there is a large change in the intensity (brightness) value in a local neighborhood around the given pixel. The following figure shows a one-dimensional cross-section of an image in terms of the pixels intensity values in the row corresponding to y = y0.

There are many physical causes of edges occurring in images. The main reasons are (1) depth discontinuity, (2) surface orientation discontinuity, (3) reflectance discontinuity, and (4) illumination discontinuity. Examples of these four are shown in the following image of a scene containing a cylindrical object.

There are many different approaches to edge detection, but one which bears close relationship to the template matching method described earlier, is to define a template corresponding to a straight edge passing through the center of the template at some desired orientation. A set of templates are defined for each orientation of interest. So, for example, we might define templates for detecting edges at orientations of 0 (i.e., a horizontal edge), 45, 90 (i.e., a vertical edge), and 135 degrees relative to the x-axis as follows:

Orientation 0

Orientation 45

Orientation 90

Orientation 135

1	1	1
0	0	0
-1	-1	-1

1	1	0
1	0	-1
0	-1	-1

1	0	-1
1	0	-1
1	0	-1

0	-1	-1
1	0	-1
1	1	0