EAV/CSG - Entity-Attribute-Value with Classes, Sets and Graphs.


Overview

Introduction

Entity-Attribute-Value (EAV) databases have been used for a long time, mostly in medical and bioinformatic databases, as a way to store complex heterogeneous data in a simple and extensible schema. The EAV model has many benefits when conventional relational-database approaches become overly complex, difficult to extend, and waste space when complex and sparse data is stored. Because of the problems that arise basic EAV models are rarely used, instead more complex models are created around the basic EAV model to provide more structure and performance.

The basic EAV model uses one table with three rows: Entity, Attribute, and Value. An Entity is a single piece of information. In a conventional relational database, and entity would consist one a single table row, or a collection of rows from multiple related tables. An Attribute correlates to a conventional column name, and a Value would be a value in a row. So a conventional table stores an entity in one row, each element in the row, and its column name represents the attribute-value pairs. In an EAV database, each row represents one attribute-value pair. When a new attribute needs to be created in a conventional database, either a new column needs to be added to a table, or a new table needs to be created. In an EAV model, no schema changes need to be made, rather a new attribute name is used. EAV models have many advantages over relational databases, but have many problems including complex query performance.

The EAV/CR (Entity-Attribute-Value with Classes and Relations) model adds the idea of entities being instances of classes, very similar in idea with the object-oriented programming (OOP) sense of classes and objects. It also adds the ability for an attribute to be able to reference another entity, and therefore creates relationships between the stored entities.

The EAV/CSG (Entity-Attribute-Value with Classes, Sets and Graphs) model that I propose in this project is similar, but provides a more complex system for inter-entity relationships via the use of sets and directed-acyclic-graphs instead of using attribute based relations. The design came out of frustration and a need for a database that could handle storage of physical and logical data in relation to systems administration, and be able to store that in one integerated place. Physical information had to be stored about computers and parts including serial numbers, model numbers, physical location (building and room), warranty information, etc. Logical information was also needed such as host configuration data, network configuration, operating system, etc. All of these bits of information were stored in multiple databases, on different platforms and were all stored in conventional relational tables that offered little or no room for expansion (that was needed to fix limitations add provide for new features). Unfortunately this design was never used for the reason it was designed, for political (not design) reasons.. however I am completing it as a 'proof of design'.

This project has a working database schema using PostgreSQL, an API and set of utilities based in C++ using the pqxx library for database access. The idea is that a database and application can be designed around the schema and API with no direct postgres programming required. The schema defined here should easily be implemented using other relational database engines and programming languages (that is for you to do, if you so wish), but I am hoping that the schema and API should be advanced enough that it should prove useful to people developing their own databases. Comments, questions, recommendations, ideas, use of this stuff, ranting, etc. are all welcome, please contact me (the author) Adam J Kunen.

Concept of design

I believe we can split all EAV models into three main components, which I call the EAV Core, Entity Management, and Entity Relationships. The EAV Core is the meat of all EAV models; it is where the actuall attribute-value pairs are stored in relation to a specific entity. The Entity Management relates to how an entity is instantiated and defines a more abstract meaning of what that entity is. In a bare bones EAV model, the Entity Management is trivial, but when classes are introduced it becomes more complex. The Entity Relationships, while not necessarily required, defines how each entity in the model are related.

There are two main camps when it comes to EAV cores. One approach stores all values are stored in the database as text strings. This requires data types other than strings to be converted back-and-forth to text which is a problem when it comes to complex queries (numeric comparisons) and leaves the data-type checking up to the application. The main benefits of this approach are that only one EAV table is needed, and query programming is simplified (in some cases). The other approach uses strict data typing where one defines attributes having a specific data type. Each data type has its own EAV table where the value column has the appropriate data type. This allows for the underlying database itself to strictly enforce data types and allows for data type specific comparisons in queries. The EAV/CR model uses strict typing, and actually implements entity relations via a special attribute type. The only disadvantage of this approach is the added code needed on the application side, however the advantages over the string based appraoch outweight the negatives. EAV/CSG uses the strict data typing approach.

Entity Management in EAV/CSG uses a class based system very similar to the EAV/CR model. Because of this, and because we want to conform more closely with the Object/Class naming schemes, Entities in the EAV/CSG model will be refered to as Objects. An object is an instance of a class. A class is a definition of a 'kind' of object that defines what attributes an object can contain. This is the same concept used in OOP, including the idea of class inheritance, except with out the idea of instance-methods (functions that act upon the objects data). This is also the same concept used by EAV/CR.

Entity Relationships are what really bring out the complexity in EAV models. This is where the

Definition of terms

Attribute A label, data-type, and possibly a value
Object An instance of an Object Class. Contains a set of Attribute-Value pairs. An Object is the Entity in EAV terms.
Object Class A definiton of a type of Object that can be instantiated. Much like the OOP idea of the Class/Object relationship, minus the idea of methods that act upon the objects... This is left to the application developer.
Set An instance of a Set Class. Contains a set of Objects, a set of Attributes and a set of Sub-Classes
Set Class A definition of a type of Set that can be instantiatied. A Set Class defines


Index

(c)2005 by Adam J Kunen. Please send comments/questions to ajkunen@cs.wisc.edu