Data Model and Semantics in Visual Information Management Systems

In the foregoing sections we have emphasized that VIM systems have much in common with databases, and need to be designed through a data model. In this section we discuss the characteristics of a such a data model and, the relationship between Computer Vision and the data model.

The role of a data model in database systems is to provide the user a textual or visual language to express the properties of the data objects that are to be stored and retrieved using the system. The database language should allow users to define, update (insert, delete, modify) and search objects and properties. For Visual Information Management systems the data model assumes the additional role of specifying and computing different levels of abstraction form images and videos. Accordingly, the data model needs to be satisfy the following properties:

We should be able to access an image matrix completely or in partitions.
The image features should be considered as both independent entities, and as related to the image.
The image features should be ordered in a hierarchy such that more complex features can be constructed out of simpler ones.
There should be several alternative methods to derive a specific semantic feature from image features.
The data model should support spatial data and file structures which infer spatial parameters associated with images and image features.
In the case of complicated image regions the image features should be represented as a sequence of nested or recursively defined entities.

We perceive the general VIM data model to be organized in layers: the representation layer, the image object layer, the domain object layer and the domain event layer. We present here a refined version of our original four layer VIMSYS model . In each layer all data objects have a set of attributes and methods associated with them. The attributes too have their own representations, and are connected in a class attribute hierarchy. The relations, as we shall explain shortly, may be spatial, functional or semantic. Figure 2 illustrates the basic layered data model.

Figure 2: The Layered VIMSYS Data Model

Representation Layer: The representation layer contains the image matrix and any transformation that results in an alternate but complete representation of the image. For example, an image originally received as an RGB matrix and its LUV conversion required for color processing are both members of the representation layer. Similarly, if a raster scanned image of a line drawing is converted to a vector format, the latter belongs to this layer. Obviously this layer is itself not very rich in processing user queries. Only queries that requests pixel based image information can be provided by this layer. However the value of the layer is giving the system designer an explicit handle to define, maintain and interconvert between representations which are used by other layers of the data model. Since image transformation is part of the layer's intended functionality, the system designer has the option to exercise his knowledge about the image model by meaningful transformations. For example a class of transformations can be defined to enhance the input image under user or designer selected noise models. Implemented this way, the transformation becomes part of the data model, and upon insertion to the database, every image goes through the enhancement routine and only the enhanced image is used for all computations downstream. Similarly corrections for the Gamma Factor or for specular reflection, if known a priori can be made in this layer. In case of videos, the representation layer functionality can get more complex. Key frame composting, which converts a temporal sequence of frames covering a large spatial area into a large single image constructed to show the entire spatial coverage can be placed in this layer. In case key frames are extracted from the sequence, the input stream is first directed to the segmentation layer, and the extracted key frame(s) is transmitted to the representation layer for storage and usual static content analysis.
Image Object Layer: This layer has two sublayers - the segmentation sublayer and the feature sublayer.
- Segmentation Sublayer: If the user of a collection would search the images in a collection only by their global properties, or by the global properties of direct image partitions, then the segmentation layer can be omitted. In this case, the designer may directly use the feature layer. If however, the user would search by computed local properties, the system must additionally maintain and allow searching on the segmentation information. We see the process of segmentation as information reduction: - performed both by asserting that not all parts of the image or video are equally informative, and by condensing information about an entire image or video into spatial or temporal clusters of summarized properties. Hence the output of segmentation is the spatial (and temporal) localization of image properties, implemented by using suitable spatial and temporal data structures. While the properties themselves can be searched for in the feature layer, their localization can be searched for in the segmentation layer. Since an image or video can be segmented in several different ways (e.g., based on dominant edges, color, motion), the localization information is indexed by the associated feature category. The separation of the existence of a property from its locality of occurrence serves another important purpose. It allows a query processing system to optimize whether a locality based search or a property based search is likely to produce faster results for a given collection of images or videos. A complexity of the segmentation layer arises from the options that a segmentation can be automatic and based on an image model (e.g., region growing with an assumption on the limit of gray level variance within a segment), or it can be automatic and guided by designer-specified knowledge of the content (e.g. a procedural knowledge of how to find the optic nerve in registered images of the retina or how to find a cut in a video), or interactive (e.g. using a user-initiated deformable contour). However, our data model allows image segmentation models and knowledge modules to be registered with the system during schema declaration. Thus if an image model is registered, a segmentation scheme using it can be declared in the system. This way, the designer can choose libraries of rich segmentation methods developed as part of mainstream Computer Vision.
- Feature Sublayer: As mentioned before, a feature to us is information which is efficient to compute, organizable in a data structure and is amenable to a numeric distance computation to produce a ranking score. Since the goal is not to reconstruct the original image, the features need not be invertible. Most of the literature use rather simple macro-level features. Methods derived from basic digital image processing techniques developed and compiled by Prof. Rosenfeld turn out to be very effective for retrieval. For example, histogram-based features (color histogram, orientation histogram, gray-level co-occurrence matrix) and their properties (e.g., their moments) which have been used from the very early days of digital image processing, produce convincing results.
  
  Figure 3: Results of local color and structure query in the Virage System.
  
  An example of the results obtained by using a set of simple features like local color histogram and local edge properties for image retrieval can be seen in Figure 3. Here, in a database of more than 3000 images, the first image was used as the query image, and the rest are a ranked list of 50 most similar images. Interestingly, although object recognition has not been attempted, 42 of the 50 are flowers, and 41 of them are roses.
  The feature layer poses some interesting problems. We mention two such problems relevant to Computer Vision and pattern recognition.
  The first problem deals with the correspondence between perceptual similarity and computable feature distance. Most of the feature distances used in literature are convex functions which also satisfy the metric conditions. This is a limitation in many cases, because in the feature space, the shape of the cluster(s) that correspond to perceptual similarity can be far from convex blocks. An obvious alternative is to use unsupervised clustering resulting in nonlinear regions in the feature space. However, it can be very expensive on the one hand, and not amenable to dynamic updates on the other. The problem of dynamic update becomes important because maintaining dynamic statistics of the data population introduces a significant overhead in a multi-user fast transaction environment. Research is needed to find algorithms that find a suitable tradeoff between the perceptual and computational criteria. Moreover, suppose perceptually meaningful clusters are determined by some method and have arbitrary shapes in the feature space. The questions are: what families of distance functions should be used to rank visual objects belonging to a given cluster? Would such a function need to adapt itself as the cluster evolves and changes shape and topology?
  The second problem relates to the issue of containment search. Consider the query ``Show all images that contain a table and a chair like this", where the system is provided with a reference image with the desired objects. The correct result is that the system will retrieve all images that embed the objects preserving their spatial relationship. The first problem is to find a segmentation scheme which, with the specification of ``what" it is expected to segment, produces a reasonable segmentation given all inaccuracies that may occur due to unpredictable contrast and occlusions that may occur due to other parts of the query and candidate images. The second problem is to find a set of shape based features, which are robust to segmentation inaccuracies inevitable in an uncontrolled collection. Also, although complete invariance of affine transformations may not be required, the features should be tolerant to translation and some perturbations of rotation and scale. One approach adopted by the QBIC group is to have the user segment all the important objects in every image which gets inserted in a collection. While this is a very practical, and possibly the surest method to ensure correctness of results, more automatic methods from Computer Vision are warranted.
Domain Object Layer: A domain object is a user defined entity representing a physical object or a concept that can be translated in terms of one or more features in the lower layers. Thus, a concept like ``sunset'' or an object like ``heart'' in a medical image are both domain objects. The domain object layer is analogous to a conceptual schema as defined in database systems. It consists of three components. It is a graph that relates an object with its attributes and other objects though different relationships. Many of these relationships are semantic, meaning that they cannot be inferred by any computation on the image or video, but have to be told to the system by the designer. An important category of relationship is classification: B is a subclass of A when from all instances of A in a database, a subset satisfying some condition is labeled as B. For visual information, the condition can be tested and labeled automatically if sufficient domain knowledge is built into the system. For example, a heart in systole and a heart in diastole have widely different shapes and occupy different spatial extents, but if their semantic similarity is by a classification hierarchy, the system should be able to correctly search for either the heart or any of its individual subclasses. For the heart, and other objects that go through a finite range of variations, the domain knowledge can be encoded in terms of a basis templates. A domain object can be expressed as a direct mapping to one of these template categories (the visual thesaurus approach). Alternately, the templates can be treated as basis vectors and an instance can be expressed as their linear combination (the eigenimage approach). Yet another way to specify domain knowledge, is by a set of rules that relate domain objects to image objects. For example, in a database of MRI images of the brain, the rules can be like: a segmented object with shape like this, and situated in a bounding box of ( ) of a normalized T2 image, having a positive local contrast of about , and a segmented object in the same location of a T1 image, having a similar shape, and a negative local contrast of about , can be mapped to the domain object ``gray matter''. Such specification of domain objects in terms of their image properties is not new to Computer Vision. But this model provides generic and explicit methods to specify such knowledge to a retrieval system, and users can use it for his or her application specific data definition.
Domain Event Layer: The purpose of the domain event layer is to allow ``events'' computed from image sequences or videos to be queriable entities.These events can be result from pure motion (e.g., when the velocity of the centroid of a segmented object exceeds 20 pixels/frame), from spatial interactions (e.g, when two object centroids come to about 5 pixels from each other), spatio-temporal interaction (e.g., when the object is approaching this region in space and it less than 10 pixels away from it), appearance, disappearance or morphing (e.g., when a ballet dancer transitions from one move to another). Domain events also include events that are not instanteneous but occur over a period of time. (e.g, a tumor that has grown beyond 15 pixels in diameter over a sequence of six images acquired monthly). In order to manage domain events,a VIM system not only needs an event detection mechanism but also an event organization mechanism,such as a temporal data structure that allows to maintain and search through detected events of different types and time granularities. We are currently working on effective methods of mapping domain events to motion segmentation results from Computer Vision.