In the foregoing sections we have emphasized that VIM systems have much in common with databases, and need to be designed through a data model. In this section we discuss the characteristics of a such a data model and, the relationship between Computer Vision and the data model.
The role of a data model in database systems is to provide the user a textual or visual language to express the properties of the data objects that are to be stored and retrieved using the system. The database language should allow users to define, update (insert, delete, modify) and search objects and properties. For Visual Information Management systems the data model assumes the additional role of specifying and computing different levels of abstraction form images and videos. Accordingly, the data model needs to be satisfy the following properties:
Figure 2: The Layered VIMSYS Data Model
Figure 3: Results of local color and structure query in the Virage System.
An example of the results obtained by using a set of simple features like local color histogram and local edge properties for image retrieval can be seen in Figure 3. Here, in a database of more than 3000 images, the first image was used as the query image, and the rest are a ranked list of 50 most similar images. Interestingly, although object recognition has not been attempted, 42 of the 50 are flowers, and 41 of them are roses.
The feature layer poses some interesting problems. We mention two such problems relevant to Computer Vision and pattern recognition.
The first problem deals with the correspondence between perceptual similarity and computable feature distance. Most of the feature distances used in literature are convex functions which also satisfy the metric conditions. This is a limitation in many cases, because in the feature space, the shape of the cluster(s) that correspond to perceptual similarity can be far from convex blocks. An obvious alternative is to use unsupervised clustering resulting in nonlinear regions in the feature space. However, it can be very expensive on the one hand, and not amenable to dynamic updates on the other. The problem of dynamic update becomes important because maintaining dynamic statistics of the data population introduces a significant overhead in a multi-user fast transaction environment. Research is needed to find algorithms that find a suitable tradeoff between the perceptual and computational criteria. Moreover, suppose perceptually meaningful clusters are determined by some method and have arbitrary shapes in the feature space. The questions are: what families of distance functions should be used to rank visual objects belonging to a given cluster? Would such a function need to adapt itself as the cluster evolves and changes shape and topology?
The second problem relates to the issue of containment search. Consider the query ``Show all images that contain a table and a chair like this", where the system is provided with a reference image with the desired objects. The correct result is that the system will retrieve all images that embed the objects preserving their spatial relationship. The first problem is to find a segmentation scheme which, with the specification of ``what" it is expected to segment, produces a reasonable segmentation given all inaccuracies that may occur due to unpredictable contrast and occlusions that may occur due to other parts of the query and candidate images. The second problem is to find a set of shape based features, which are robust to segmentation inaccuracies inevitable in an uncontrolled collection. Also, although complete invariance of affine transformations may not be required, the features should be tolerant to translation and some perturbations of rotation and scale. One approach adopted by the QBIC group is to have the user segment all the important objects in every image which gets inserted in a collection. While this is a very practical, and possibly the surest method to ensure correctness of results, more automatic methods from Computer Vision are warranted.