CS 540 Lecture Notes: Computer Vision

University of Wisconsin - Madison

CS 540 Lecture Notes

C. R. Dyer

Computer Vision (Chapter 24.1, 24.5)

What is Computer Vision?

"The central problem of computer vision is from one or a sequence of images to understand the object or scene and its 3D properties." -- Y. Aloimonos
"Vision is the process by which descriptions of physical scenes are inferred from images of them." -- S. Zucker
"A process that produces from images of the external 3D world a description that is useful to the viewer and not cluttered by irrelevant information." -- D. Marr

Applications

Medical image analysis
Aerial photo interpretation
Vehicle exploration and mobility
Material handling
For example, part sorting and picking
Inspection
For example, integrated circuit board and chip inspection
Assembly
Navigation
Human-computer interfaces
For example, handwriting recognition, optical character recognition (OCR), face recognition, gesture recognition, gaze tracking, 3D model acquisition
Multimedia
For example, video databases, image compression, image browsing, content-based retrieval
Telepresence/Tele-immersion/Tele-reality
For example, tele-medicine, virtual classrooms, video conferencing, interactive walkthroughs

Example: Vehicle Navigation using Neural Networks

The ALVINN system by D. Pomerleau is an example of a 2-layer, feedforward neural network for "vision-based lane keeping" that was described earlier in the section on Neural Networks.

Example: Face Recognition using an Eigenspace Representation

Problem Statement: Given a training set of M images, each of size N x N pixels, where each image contains a single person's face, approximately registered for face position, orientation, scale, and brightness, and a test image, determine if the person in the test image is one of the people in the training set, and, if so, indicate which person it is.
Need a similarity metric for measuring the "distance" between two face images
Need a way of representing face features to be compared within the similarity metric
One approach due to M. Turk and A. Pentland (see "Eigenfaces for Recognition," J. Cognitive Neuroscience 3, 1991, pp. 71-86) An online description of their work and demos is available on the Eigenfaces/Photobook web page. For other information and research on the problem of face recognition, see the Face Recognition Home Page. For information on face detection, see the Face Detection Home Page.

Eigenspace Representation of Images

An N x N image can be "represented" as a point in an N² dimensional image space, where each dimension is associated with one of the pixels in the image and the possible values in each dimension are the possible gray levels of each pixel. For example, a 512 x 512 image where each pixel is an integer in the range 0, ..., 255 (i.e., a pixel is stored in one byte), then image space is a 262,144-dimensional space and each dimension has 256 possible values.
If we represented our M training images as M points in image space, then one way of recognizing the person in a new test image would be to find its nearest neighbor training image in image space. But this approach would be very slow since the size of image space is so large, and would not exploit the fact that since all of our images are of faces, they will likely be clustered relatively near one another in image space. So, instead, let's represent each image in a lower-dimensional feature space, called face space or eigenspace.
Say we have M' images, E₁, E₂, ..., E_M', called eigenfaces or eigenvectors. These images define a basis set, so that each face image will be defined in terms of how similar it is to each of these basis images. That is, we can represent an arbitrary image I as a weighted (linear) combination of these eigenvectors as follows:
1. Compute the average image, A, from all of the training images I₁, I₂, ..., I_M:
```
        M
      -----
    1 \
A = -  \   I_i
    M  /
      /
      -----
       i=1
```
2. For k = 1, ..., M' compute a real-valued weight, w_k, indicating the similarity between the input image, I, and the kth eigenvector, E_k:
```
     w_k = E_k^T  * (I - A)
```
  where I is a given image and is represented as a column vector of length N², E_k is the kth eigenface image and is a column vector of length N², A is a column vector of length N², * is the dot product operation, and - is pixel by pixel subtraction. Thus w_k is a real-valued scalar.
3. W = [w₁, w₂, ..., w_M']^T is a column vector of weights that indicates the contribution of each eigenface image in representing image I. So, instead of representing image I in image space, we'll represent it as a point W in the M'-dimensional weight space that we'll call face space or eigenspace. Hence, each image is projected from a point in the high dimensional image space down to a point in the much lower dimensional eigenspace. In terms of compression, each image is represented by M' real numbers, which means that for a typical value of M'=10 and 32 bits per weight, we need only 320 bits/image to encode it in face space. (Of course, we must also store the M' eigenface images, which are each N² pixels, but this cost is amortized over all of the training images, so it can be considered to be a small additional cost.)
Notice that image I can be approximately reconstructed from W as follows:
```
	 M'
       -----
       \
I ~  A + \    w_i * E_i
	/
       /
       -----
	i=1
```
This reconstruction will be exact if M' = min(M, N²). Hence, representing an image in eigenspace won't be exact in that the image won't be reconstructible, but it will be a pretty good approximation that's sufficient for differentiating between faces.
Now, select a value for M' and then determine the M' "best" eigenvector images (i.e., eigenfaces). How?
Answer: Use the statistics technique called Principal Components Analysis (also called the Karhunen-Loeve transform in communications theory). Intuitively, this technique selects the M' images that maximize the information content in the compressed (i.e., eigenspace) representation.
The best M' eigenface images are computed as follows:
1. For each training image I_i, normalize it by subtracting the mean (i.e., the "average image"): Y_i = I_i - A
2. Compute the N² x N² Covariance Matrix:
```
	M
      -----
    1 \
C = -  \    Y_i Y_i^T
    M  /     
      /
      -----
       i=1
```
3. Find the eigenvectors of C that are associated with the M' largest eigenvalues. Call the eigenvectors E₁, E₂, ..., E_M'. These are the eigenface images used by the algorithm given above.
Note: C is very large, so this method is computationally very intensive. However, there are relatively fast methods for finding the k largest eigenvectors, which is all we need.

Face Recognition Algorithm

The entire face recognition algorithm can now be given:

Given a training set of face images, compute the M' largest eigenvectors, E₁, E₂, ..., E_M'. M' = 10 or 20 is a typical value used. Notice that this step is done once "offline."
For each different person in the training set, compute the point associated with that person in eigenspace. That is, use the formula given above to compute W = [w₁, ..., w_M']. Note that this step is also done once offline.
Given a test image, I_test, project it to the M'-dimensional eigenspace by computing the point W_test, again using the formula given above.
Find the closest training face to the given test face:
```
d = min || W_test - W_k ||
     k
```
where W_k is the point in eigenspace associated with the kth person in the training set, and || X || denotes the Euclidean norm defined as (x₁² + x₂² + ... + x_n²)^1/2 where X is the vector [x₁, x₂, ..., x_n].
Find the distance of the test image from eigenspace (that is, compute the projection distance so that we can estimate the likelihood that the image contains a face):
```
dffs = || Y - Yf ||
```
where Y = I_test - A, and Yf = sum_i_from_1_to_M' (w_test,i * E_i).

If dffs < Threshold1
	; Test image is "close enough" to the eigenspace
	; associated with all of the training faces to
	; believe that this test image is likely to be some
	; face (and not a house or a tree or something
	; other than a face)
then if d < Threshold2
     then classify I_test as containing the face of person k,
	     where k is the closest face in the eigenspace to
	     W_test, the projection of I_test to eigenspace
     else classify I_test as an unknown person
else classify I_test as not containing a face

Example

Say we have two 3 x 3 training images, so N=3 and M=2, defined as follows:

Image I₁
0	0	0
10	10	10
0	0	0

Image I₂
0	10	0
0	10	0
0	10	0

We represent these two images as column vectors of length 3*3=9, so we have

I₁ = [0 0 0 10 10 10 0 0 0]^T
I₂ = [0 10 0 0 10 0 0 10 0]^T

Now assume that we use a subspace of dimension 1, i.e., M'=1, and the eigenvector computed from the two training images is:

E₁ = [5 0 5 10 5 10 5 0 5]^T

(Note: This is not the true eigenvector but is used here to keep the example simple.)

The average image, A, is computed from I₁ and I₂ by computing for each pixel, the average gray level from the two images' corresponding pixels. Thus, the second pixel in A is (0+10)/2 = 5. Hence,

A = [0 5 0 5 10 5 0 5 0]^T

We can now compute how the first training image, I₁, is projected into the one-dimensional eigenspace by computing W₁ = [w_1,1], where w_1,1 = E₁^T * (I₁ - A). So, here we have

I_1' = I₁ - A = [0 -5 0 5 0 5 0 -5 0]^T
w_1,1 = 5*0 + 0*-5 + 5*0 + 10*5 + ... + 5*0 = 0
W₁ =[0]

In other words, image I₁ projects to the origin in this one-dimensional subspace defined by basis image E₁.

Similarly, for I₂ we get W₂ = [w_2,1], where

w_2,1 = E₁^T * (I₂ - A)
    = [5 0 5 10 5 10 5 0 5] * [0 5 0 -5 0 -5 0 5 0]^T
    = -100

So, W₂ = [-100].

Now, say we are given the following test image

Image **I_test**
0	7	3
0	10	10
0	10	0

Projecting I_test into face space we get W_test = [w_test,1], where

w_test,1 = E₁^T * (I_test - A)
    = [5 0 5 10 5 10 5 0 5] * [0 2 3 -5 0 5 0 5 0]^T
    = 15

So, W_test = [15], which means that W_test is more similar to image I₁ than to image I₂. Therefore, we would classify I_test as the same class as I₁.

Face Recognition Accuracy and Extensions to Eigenspace Approach

Performance using a 20-dimensional eigenspace resulted in about 95% correct classification on a database of about 7,500 images of about 3,000 people
If training set contains multiple images of each person, then for each person compute the average point in eigenspace from the points computed for each image of that person
Method requires that all images in the database contain faces of about the same size, position, and orientation, so they can be compared using this global distance function in eigenspace
If there are multiple images of a 3D object (e.g., a person's head from many different positions and orientations), then the points in eigenspace corresponding to the different 3D views can be combined by fitting a hypersurface to all the points, and storing this hypersurface in eigenspace as the description of that person. Then, classify a test image as the person corresponding to the closest hypersurface

Applications of Eigenfaces

There are a variety of commercial products that are now available based on the eigenface method. See, for example, TrueFace PC, which does computer logins by face recognition, and Viisage and Identix for face recognition products for various biometric applications.