The Grid File: An Adaptable, Symmetric Multikey File Structure

J. Nievergelt, Hans Hinterberger, and Kenneth C. Sevcik.
Institut for Informatik, ETH, and University of Toronto

One-line Summary

A balanced multikey file structure provides efficient, uniform access to multidimensional data which identified by several attributes in each record of highly dynamic files.

Overview/Main Points

Problem statement
- A file F consisting of a collection of records R = [a₁, a₂, ..., a_k], where the a are fields containing attribute values.
  - [LastName, FirstName, MiddleInitial, YearOfBirth, SSN]
  - [Doe, John, -, 1951, 123456789]
- Query classification
  - exact match query, point query: entire record specified
  - partially specified query: Doe born in 1951
  - single-key query: all records with last name Doe
  - presumably unique: ssn 987654321
  - range or interval query: everybody born between 1940 and 1960
- Multikey access problem
  - a record is characterized by a small number of attributes (less than 10)
  - the domain S_i of each attributes is large, independent, and linearly ordered (l_i ≤ a_i ≤ u_i)
- Idea for multikey access file structure
  - bitmap representation of the attribute space: reserve one bit for each possible record in the space
  - k-dimensional matrix represents all possible values of k attributes
  - dynamic partition (directory) on all key-value space for compressed bitmap
- Performance criteria adapts from response times to multikey access request to the number of disk access due to fixed data transfer rate of the disks
- Design principles
  - Two-disk-access principle: at most two disk accesses for a single record retrieve (one for the directory, the other data bucket)
  - Efficient range queries with respect to all attributes
Grid partitions
- Two categories of searching techniques
  - organize the specific set of data to be stored: comparative search techniques, such as binary search trees, since the boundaries between different regions are determined by values of stored data.
  - organize the embedding space from which the data is drawn: address computation techniques, including radix trees, since region boundaries are drawn at fixed places regardless of the content of the file.
- Space partition techniques
  - recursively divide and conquer results in progressively shorter region boundaries
  - inverted file where the primary key determines region boundaries for a uniform bucket occupancy, while the secondary key is partitioned independently of the data to be stored (the value of the key domain may be in its own region).
  - grid file whose each region boundary seperates the entire search space at fixed value of the domain, but all dimensions are treated symmetrically
- grid blocks divided by the k-dimension record space
- an one-dimensional parition is created by either splitting or merging
Grid file
- a bucket contains c records where 10 < c < 1000.
- grid directory
  - functionalities of a bucket management system
    - a class to assign grid blocks in the record space to data buckets
    - a data structure (called grid directory) that represents the assignment and supports operations to update the assignments when bucket overflow or underflow
    - an algo to update the directory for the changes of the assignment
  - all the records in one grid block store in the same bucket, but several grid blocks (called region of the bucket) may share a bucket.
  - convex assignment of grid blocks to buckets
    - the union of the shared assignments of the grid blocks forms a rectangular box in the space of records.
    - the regions of buckets are pairwise disjoint, and together span the space of records
  - grid directory
    - grid array: a dynamic k-dimensional array whose elements (pointers to data buckets) are one-to-one correspondent with the gird blocks of the partition.
    - linear scales: k one-dimensional arrays whose scale defines a partition of a domain S.
  - Example of two-dimensional space S = X × Y
    - nx > 0, ny > 0, 0 ≤ cx < nx, 0 ≤ cy < ny
    - G(0, ..., nx-1, 0, ... ny-1)
    - X(0, ..., nx), Y(0, ..., ny)
  - Operations
    - direct acess: G(cx, cy)
    - Next in each direction
      - nextxabove: cx ← (cx + 1) mod nx
      - nextxbelow: cx ← (cx - 1) mod nx
      - nextyabove: cy ← (cy + 1) mod ny
      - nextybelow: cy ← (cy - 1) mod ny
    - Merge
      - mergex
        
        given px, 1 ≤ px < nx, merge px with nextxbelow
        
        rename all elements above px
        
        adjust X-scale
      - mergey: for any py, 1 ≤ py < ny
    - Split
      - splitx
        
        given px, 0 ≤ px < nx, create new element px+1
        
        rename all cells above px
        
        adjust X-scale
      - splity: for any py, 0 ≤ py < ny (≤ ny in the paper)
  - Constrains on the values: if G(i', j') = G(i'', j''), then for all i, j such that i' ≤ i ≤ i'' and j' ≤ j ≤ j'', G(i, j) = G(i', j') = G(i'', j'')
  - A query on 2-dim space using grid directory
    - [a, z]. A1 = (a, f, k, p, z)
    - [0, 10]. A2 = (0, 2, 7, 10)
    - Query: A1 = 'x' AND A2 = 6
- Decisions
  - grid partitions of the search space
  - convex assignments of grid blocks (region) to buckets
  - grid directory, a (large) dynamic array but small linear scales
- Performance with respect to partition size
- Issues
  - splitting and merging policy
  - implementation of the grid directory
  - concurrent access
  - Precision: ^{the number of records retrieved that meet the query specification} / _{total number of records retrieved}
    - High precision VS mapping of cell to bucket
    - High precision VS contiguous access

The Grid File: An Adaptable, Symmetric Multikey File Structure

One-line Summary

Overview/Main Points

Relevance

Flaws