CS 540 | Lecture Notes | Fall 1996 |
Computer Vision (Chapter 24)
What is Computer Vision?
- "The central problem of computer vision is from one or
a sequence of images to understand the object or scene and its
3D properties." -- Y. Aloimonos
- "Vision is the process by which descriptions of physical
scenes are inferred from images of them." -- S. Zucker
- "A process that produces from images of the external 3D world
a description that is useful to the viewer and not cluttered
by irrelevant information." -- D. Marr
Applications
- Medical image analysis
- Aerial photo interpretation
- Vehicle exploration and mobility
- Material handling
For example, part sorting and picking
- Inspection
For example, integrated circuit board and chip inspection
- Assembly
- Navigation
- Human-computer interfaces
For example, handwriting recognition, OCR, face recognition,
gesture recognition, gaze tracking, 3D model acquisition
- Multimedia
For example, video databases,
image compression, image browsing, content-based retrieval
- Telepresence/Tele-reality
For example, tele-medicine, virtual classrooms, video conferencing,
interactive walkthroughs
Example: Vehicle Navigation using Neural Networks
The ALVINN system by D. Pomerleau is an example of a 2-layer,
feedforward neural network for "vision-based lane keeping"
that was described earlier in the section on
Neural Networks.
Example: Face Recognition using an Eigenspace Representation
- Problem Statement: Given a training set of M images,
each of size N x N pixels,
where each image contains a single person's face, approximately
registered for face position, orientation, scale, and brightness,
and a test image, determine if the person in the test image is
one of the people in the training set, and, if so, indicate which
person it is.
- Need a similarity metric for measuring the "distance" between
two face images
- Need a way of representing face features to be compared within
the similarity metric
- One approach due to M. Turk and A. Pentland (see
"Eigenfaces for Recognition," J. Cognitive Neuroscience 3,
1991, pp. 71-86) An online description of their current work
and demos is available at their
Eigenfaces/Photobook web page.
For other information and research on the problem of face
recognition, see the
Face Recognition Home Page.
Eigenspace Representation of Images
- An N x N image can be "represented" as a point in
an N^2 dimensional
image space, where each dimension is associated with one of
the pixels in the image and the possible values in each dimension
are the possible gray levels of each pixel. For example,
a 512 x 512 image where each pixel is an integer in the range 0, ..., 255
(i.e., a pixel is stored in one byte), then image space is a
262,144-dimensional space and each dimension has 256 possible values.
- If we represented our M training images as M points in image
space, then one way of recognizing the person in a new test image
would be to find its nearest neighbor training image in image space.
But this approach would be very slow since the size of image space
is so large, and would not exploit the fact that since all of our
images are of faces, they will likely be clustered relatively near
one another in image space. So, instead, let's represent each image
in a lower-dimensional feature space, called face space or
eigenspace.
- Say we have M' images, E1, E2, ..., EM', called
eigenfaces or eigenvectors. These images define a
basis set, so that each face image will be defined in terms
of how similar it is to each of these basis images. That is, we
can represent an arbitrary image I as a weighted combination of these
eigenvectors as follows:
- Compute the average image, A, from all of the training images
I1, I2, ..., IM:
M
-----
1 \
A = - \ Ii
M /
/
-----
i=1
- For k = 1, ..., M' do
T
wk = Ek * (I - A)
where I is a given image and is represented as a column vector
of length N^2, Ek is the kth eigenface image and
is a column vector of length N^2, A
is a column vector of length N^2, * is
the dot product operation, and - is pixel by pixel subtraction.
Thus wk is a real-valued scalar.
- W = [w1, w2, ..., wM'] is a column vector of weights
that indicates the contribution of each eigenface image in representing
image I. So, instead of representing image I in image
space, we'll represent it as a point W
in the M'-dimensional weight space that we'll call face space
or eigenspace. Hence, each image is projected from
a point in the high dimensional image space down to a point in the
much lower dimensional eigenspace.
- Notice that image I can be reconstructed from W
as follows:
M'
-----
\
I = A + \ wi * Ei
/
/
-----
i=1
though this reconstruction will be exact only if M' = min(M, N^2)
in general. Hence, representing an image in eigenspace won't
be exact in that the image won't be reconstructible, but it will be
a pretty good approximation that's sufficient for differentiating
between faces.
- Now, select a value for M' and then determine the
M' "best" eigenvector images (i.e., eigenfaces). How?
Answer: Use the statistics technique called
Principal Components Analysis (also called the
Karhunen-Loeve transform in communications theory).
Intuitively, this technique selects the M' images
that maximize the information content in the compressed (i.e.,
eigenspace) representation.
The best M' eigenface images are computed as follows:
- For each training image Ii, normalize it by subtracting
the mean (i.e., the "average image"): Yi = Ii - A
- Compute the N^2 x N^2 Covariance Matrix:
M
-----
1 \ T
C = - \ Yi Yi
M /
/
-----
i=1
- Find the eigenvectors of C that are associated with
the M' largest eigenvalues. Call the eigenvectors
E1, E2, ..., EM'. These are the eigenface images
used by the algorithm given above.
Note: C is very large, so this method is computationally
very intensive. However, there are relatively fast methods for
finding the k largest eigenvectors, which is all we need.
Face Recognition Algorithm
The entire face recognition algorithm can now be given:
- Given a training set of face images, compute the M'
largest eigenvectors, E1, E2, ..., EM'.
M' = 10 or 20 is a typical value used. Notice that this
step is done once "offline."
- For each different person in the training set, compute the
point associated with that person in eigenspace. That is, use
the formula given above to compute W = [w1, ..., wm'].
Note that this step is also done once offline.
- Given a test image, Itest, project it to the M'-dimensional
eigenspace by computing the point Wtest, again using the formula
given above.
- Find the closest training face to the given test face:
d = min || Wtest - Wk ||
k
where Wk is point in eigenspace associated with the kth person
in the training set, and || * || denotes the Euclidean distance in
eigenspace.
- Find the distance of the test image from eigenspace (that is,
compute the projection distance so that we can estimate the likelihood
that the image contains a face):
dffs = || Y - Yf ||
where Y = Itest - A, and
Yf = sum_i_from_1_to_M' (Wi * Ei).
-
If dffs < Threshold1
; Test image is "close enough" to the eigenspace
; associated with all of the training faces to
; believe that this test image is likely to be some
; face (and not a house or a tree or something
; other than a face)
then if d < Threshold2
then classify Itest as containing the face of person k,
where k is the closest face in the eigenspace to
Wtest, the projection of Itest to eigenspace
else classify Itest as an unknown person
else classify Itest as not containing a face
Face Recognition Accuracy and Extensions to Eigenspace Approach
- Performance using a 20-dimensional eigenspace resulted in
about 95% correct classification on a database of about 7,500 images
of about 3,000 people
- If training set contains multiple images of each person, then
for each person compute the average point in eigenspace from the
points computed for each image of that person
- Method requires that all images in the database contain faces
of about the same size, position, and orientation, so they can be
compared using this global distance function in eigenspace
- If there are multiple images of a 3D object (e.g., a person's
head from many different positions and orientations), then the points
in eigenspace corresponding to the different 3D views can be
combined by fitting a hypersurface to all the points, and storing
this hypersurface in eigenspace as the description of that person.
Then, classify a test image as the person corresponding to the
closest hypersurface
Template Matching
- One common approach to detecting image features of various kinds
is to define a template, which is usually an m x n
"window" or sub-image representing some pattern of interest. For
example, we could define templates for left-eye, right-eye, nose, and
mouth, and then use these to detect face features. If these four
templates match in the right relative locations to one another, then
we can detect a face. Two major advantages of using multiple "local"
templates such as these for face recognition are (1) it allows
us to focus attention on the key features of recognizing faces, and
(2) it makes the algorithm more robust in the sense that occlusion or
noise in other parts of the face have no effect on our ability to
recognize a face.
- To use templates, we must define a measure for matching a template
with various positions (and possibly orientations) in an image
to determine the best match locations
- Sum of Squared Difference
Let f denote the template, and g denote an image,
then one measure of match is called the Sum of Squared Difference
and is defined by:
----- -----
\ \ 2
SSD(x,y) = \ \ (f(u,v) - g(x+u, y+v))
/ /
/ /
----- -----
u v
Example
Assume the image and template are one-dimensional, indexed from 0,
and defined as follows:
f = 1 2 3
g = 0 0 0 1 2 3 3 3
SSD = 14 9 3 0 2 5 14 17
For example, SSD(0) = (1-0)^2 + (2-0)^2 + (3-0)^2 = 14
- SSD >= 0, where 0 indicates a perfect match and a large
number means a poor match. Hence SSD can be considered as a
"mismatch" measure.
- Values of f and g
that are not given are assumed to be 0
- We can expand the above formula as follows:
Sum Sum (f - g)^2 = Sum Sum f^2 + Sum Sum g^2 - 2 Sum Sum fg
So, considered the image and template to be fixed means we can
use Sum Sum fg as a match measure, leading to the following
alternative the SSD.
- Cross-Correlation
----- -----
\ \
CC(x,y) = \ \ f(u,v) g(x+u, y+v)
/ /
/ /
----- -----
u v
The only problem with this definition is that it does depend on the
pixel values in the image, which vary over the positions where the
template is overlaid. So, to avoid this dependence, we usually use
the Normalized Cross-Correlation:
CC(c,y)
NCC(x,y) = ------------------------------------
( ----- ----- ) 1/2
( \ \ 2 )
( \ \ (g(x+u, y+v)) )
( / / )
( / / )
( ----- ----- )
So, for the above example we get:
CC = 0 3 8 14 17 18 ...
g^2 = 0 1 5 14 22 27 ...
NCC = - 3 3.6 3.7 3.6 3.5 ...
- 0 <= NCC <= (Sum Sum f^2)^(1/2)
and 0 means a poor match and a large value means a good match
- If the template is size m x m and
the image is size n x n, then (m^2)(n^2)
multiplications are required to compute CC.
- The degree of match falls off slowly away from the position
of a perfect match, implying that it may be hard to localize
precisely the best match location
Edge Detection
A problem that is closely related to template matching is detecting
edges, which are one of the most basic low-level features that are
commonly detected in images. An edge is a pixel where there is a
large change in the intensity (brightness) value in a local neighborhood
around the given pixel. The following figure shows a one-dimensional
cross-section of an image in terms of the pixels intensity values
in the row corresponding to y = y0.
There are many physical causes of edges occurring in images.
The main reasons are (1) depth discontinuity,
(2) surface orientation discontinuity,
(3) reflectance discontinuity, and
(4) illumination discontinuity. Examples
of these four are shown in the following image of a scene
containing a cylindrical object.
There are many different approaches to edge detection, but one
which bears close relationship to the template matching method
described earlier, is to define a template corresponding to a
straight edge passing through the center of the template at some
desired orientation. A set of templates are defined for each orientation
of interest. So, for example, we might define templates for
detecting edges at orientations of 0 (i.e., a horizontal edge), 45, 90 (i.e.,
a vertical edge), and 135 degrees
relative to the x-axis as follows:
Orientation 0 | Orientation 45 | Orientation 90 | Orientation 135 |
|
|
|
|
Last modified December 10, 1996
Copyright © 1996 by Charles R. Dyer. All rights reserved.