Object Recognition using Pictorial Structure Models

Abstract

Pictorial structure models, originally introduced by Fischlet and Elschlager, provide a statistical model of objects. Using these pictorial structure models, objects in an image can be recognized and their constituent parts can be located in the image. Work by Felzenszwalb provides a probabilistic approach to the training and recognition of pictorial structure models in an image. For this paper, I implemented Felzenszwalb approach for face recognition in order to gauge its effectiveness and determine the advantages and limitations of his method.

Introduction

Object recognition in images is an increasingly important aspect of vision research and specifically, methods to recognize general object classes are receiving more attention. Pictorial structure models, introduced in [1] by Fischlet and Elschlager, provide a framework in which to recognize generic classes of objects in an Image.

Conceptually, pictorial structure models are quite simple. A model consists of a number of parts. Each part is responsible for recognizing a single component or feature of the model. The parts of the model are attached to each other via a set of deformable connections that help control the overall placement of the various parts of the model.

Parts of the model typically have at minimum a location within the image space. However, the can also have additional attributes, such as width, height, orientation, etc. which further describe a part.

In general, the process of recognizing an object class in an image, given its model, involves the maximization model score given the image. The model score, for any arrangement of parts, depends upon two factors: the score of each part at its selected location and the location of the parts with respect to the ideal connection configuration.

In order to perform this maximization against an image, two operations must be performed. First the model must be trained. Second, the model space must be searched to find the best configuration of the model given the image. It should be noted that the search space of the model can be quite large, given both the number of possible part configurations, including location and other attributes, and the relative placement of the parts.

In [2], Felzenszwalb provides approaches to solve both the training problem and the search problem. His methodology is a probabilistic approach and is tailored to solve the search problem efficiently. The algorithms described in this part are completely based upon this approach and are described in the following section.

Approach and Algorithms

The Felzenszwalb approach to solving the search problem uses a probabilistic formulation. Specifically, in order to maximize the configuration of the parts in an image, we solve for the likelihood of seeing that image given a part configuration and model. This is given by, where I is the image, L the configuration of the model parts, and the pictorial model. By application of Bayes’ rule we get the posterior distribution, which is the probability of a part configuration, L, given a specific image and our model.

Given the posterior distribution, we can perform a number of operations. First, training becomes a simple maximum likelihood calculation to determine the model, given a number of training images and their corresponding part configurations. Second, the search problem can be solved by finding the maximum a posterior (MAP) estimate of. And third, we can sample from the posterior distribution in order to generate guesses as to possible likely locations of the model in an image.

The model,, is formulated as a graph, with the vertices representing the parts and the edges representing the connections between the parts. Give an arbitrary graph of parts and connections, solving for either the MAP or sampling from the posterior distribution takes exponential time over the number of part configurations.

In order to reduce the time of the MAP estimation and sampling, we impose two restrictions upon the model. First, instead of an arbitrary graph, we restrict the graph to a tree structure. This allows us to use a dynamic programming approach to solve both the MAP estimation and sampling problems. Second, we impose a limitation on the part model such that all of the connections must have a cost that is generated from a normal distribution with a zero mean and a diagonal covariance matrix. This second limitations allows us to use a modified chamfer algorithm to efficiently calculate the deformation cost as connections stretch or contract from their optimal position.

Once the above restrictions are implemented, the solution to both the MAP estimation and the sampling problem can be solved using a dynamic program. Specific details of the algorithm can be found in [2], however, the general algorithm follows.

To calculate the MAP estimate, we first start at the leaf nodes of the model tree. For each leaf part,, we generate table of all the possible configurations of the leaf’s parent part, i.e. location, orientation, height, width, scale, etc. We then fill the table with the score of the best possibleconfiguration given the various parent configurations. Once the leaves are complete, we can move up to the next level of the tree and calculate the same table with respect to the parent part, all the way up to the root. In the end, we will have a table that provides the posterior probability over all the possible root locations, given the best possible location of all of the other parts.

In general, this is an operation has a time complexity O(NH²), where N is the number of parts and H is the possible number of configurations for each part. This can become quite costly when H is large, which it typically is. It is here that the restriction on the connection cost becomes important.

Since the connection deformation cost is a normal distribution around a mean of zero, we can use a modified form of a chamfer algorithm to solve for the 1^st norm minimum cost of a child part configuration given the parent configuration. Using the chamfer algorithm we reduce the time complexity to approximately O(NH).

The sampling problem can be solved similarly to the MAP estimation. With the sampling algorithm, at each tree depth, we generate the probability distribution of a child part configuration given the parent configuration, for each possible configuration of the parent. This distribution is dependent only upon an individual node’s children’s distribution, so again if we solve first for the leaves and proceed inward, we can generate the probability distribution of the root node. Once we have the distribution of the root, we can select from that distribution to generate a root configuration. Given the root, configuration, we can then recurse down into the tree to generate the distributions of each of the other part, select from them, and repeat.

Again, as in the MAP estimation, this operation would generally have a time complexity of O(NH²). However, since we restricted the connection to the form of a normal distribution, we can quickly calculate the probability distribution of the children by applying a Gaussian kernel, based upon the normal distribution, to a seed table of the probabilities of the child part. Since the Gaussian is separable, it can be applied in O(H) time, resulting in approximately O(NH) for N parts.

Implementation

For the purposes of this project, I implemented the training, MAP estimation, and sampling algorithms proposed by Felzenszwalb. The implementation was done in C++ and used objected oriented design in order to make it extensible to different parts and different connections.

Additionally, I wrote a simple Java application to automate the generation of training data for use with the C++ code.

All code was written for flexibility and for easy of reconfiguration. Performance, both in terms of clock cycles and memory was not a chief concern, although it was optimized whenever possible.

This implementation did not include articulated part models, such as those which would be necessary to recognize the shapes of people against a background.

Results

Like the original work by Felzenszwalb, I attempted some face recognition with the implemented algorithms. The algorithm was run against faces from the Yale face database.

The model parts were single point parts. Each part had a location attribute but no other attributes, such as orientation or scale, were necessary. The model on an individual part consisted of a vector composed of 18 different values based upon image pixels directly beneath and around the part location. The values were derived from a number of Gaussian derivative filters which were applied to the image in a preprocessing step. The vector was normalized so that all values ranged from 0 to 1 in an attempt to make the part invariant to illumination conditions of the image.

The connections between parts were simple distance connection, with deformation costs applied independently for the x and y direction.

Shown below are the results of one of the best training/test cycles. Figure 1 shows the five training images used. This illustrates the structure model that was learned during the training phase. Notice that the structure of this model uses the right eye as the root of the model. It is hypothesized that this is due to the small amount of variance in the placement of the parts in the various training images. It would be expected that with more training images, with different scales, a more natural model would be generated.

Figure 1 – Training Images

Figure 2 shows the MAP estimation against the five original training images and ten additional test images. These images have been scaled down substantially for display in this report.

Figure 2 – Test Images

Although it is hard to tell from the images above, the images can generally be placed into three categories: correct, mostly correct, and incorrect. Based on these categories, the original training images scored 80% correct and 20% incorrect. Of the non-training images, 40% were correct, 30% were mostly correct, and 30% were incorrect.

Given the large number of possible part configurations, these results are probably not particularly bad. However, they are far from perfect.

It should be noted that the inaccuracies above can not be attributed to Felzenszwalb training and search algorithm, although it does have its limitations. Instead, the error most likely in the calculation of where good matches for each part occurs. In order to understand the robustness of the simple filtered part we used to recognize each individual part, it would be useful to be able to see an estimate of the probability of each part being at a particular position, independent of the connection to other parts.

It just so happens, that during the process of sampling, that exact information is available for every part except the root. Figure 3 shows six images, corresponding each of the non-root parts. The images have been normalized such that black indicates low probability while white indicates high probability. These particular images are based upon the first training image.

These depictions show that if we considered only a single part at a time, the parts could be located in a wide variety of positions. This is especially true of the left and right corner of the mouth and the tip of the nose for this particular image. However, when their relative positions are taken into account, there is enough information to reconstruct the face.

Figure 3 – Normalized images depicting probability of individual parts for first training image. Images correspond, from left to right, top to bottom, to the likelihood of left eye, tip of nose, left corner of mouth, right corner of mouth, center of brow, and center of mouth.

The same analysis can be done for an image that was recognized incorrectly. Shown in Figure 4 are the results. As can be seen, left eye, left corner of the mouth, and right corner of the mouth do not deviate much from the previous sets of images. However, the tip of the nose (top center image) does not identify the nose well and the brow and center of the mouth look completely incorrect.

Figure 4 – Normalized images depicting probability of individual parts for failed test image. Images correspond, from left to right, top to bottom, to the likelihood of left eye, tip of nose, left corner of mouth, right corner of mouth, center of brow, and center of mouth.

It is believed that if the individual part recognition was more robust, this methodology would work much better then it did for this set of test images.

Conclusions

The test results shown above are far from spectacular. However, most of the error can be associated not with Felzenszwalb’s search technique but with the technique used to recognize individual parts. However, it should be noted that the Felzenszwalb is actually very restrictive in both the structure of the model and in the complexity of the individual parts. Since the complete model space is searched, albeit efficiently, the individual part probabilities need to be calculated for every possible configuration. This necessarily limits the complexity of the individual parts and therefore weakens the algorithm as a whole. However, despite this limitation the algorithm still appears to be effective and with only some additional work could be made much more effective.

References

[1] M.A. Fischler and R.A. Elschlager. The representation and matching of pictorial structures. TC, 22(1):67-92, January 1973

[2] P.F. Felzenszwalb and D.P. Huttenlocher. Pictorial Structures for Object Recognition.