Ubiquitous B-Tree

Douglas Comer
Purdue University

One-line Summary

B-tree, a balanced, multiway, and external data structure is efficient and versatile for organizating files, without massive re-organization. Its variation, B+-tree allows efficient sequential processing of the file.

Overview/Main Points

File
- abstract model: n records, each of which is a <key, value> pair.
- access patterns
  - sequential access
  - random access
- Operations
  - insert
  - delete
  - find
  - next
index
- the external index file to speed retrieval
- Factors to evaluate index methods
  - main cost: the number of secondary storage accesses
  - the time of index process
  - the secondary storage space utilization
  - the ratio of the space required by the index to the space required by the associated information
- Indexed Sequential Access Method (ISAM) for sequential accesses

B-tree, de facto, a standard for file organization

the root node has at least 2 descendants.
order d, except for the root, each node has
- at least d keys and d+1 pointers
- at most 2d keys and 2d+1 pointers
the number of nodes at depths i (the root at 0) is at least 2 * d^i
a tree with all leaves lay at the same depth h has ^{(d^h - 1)}/_(2d-1) nodes with at least d keys each
a tree with n total keys has the height of at most log_d^{^{(n + 1)}/₂}.
the balancing mechanism
- requires some computation time to perform the balancing
- uses extra storage to lower the balancing costs
Insertion may involve split operations all ways to the root so that the height of the tree increases one level.
Deletion
- key to delete
  - in a nonleaf node: substitute by the successor key in key-sequence order resides in the leftmost leaf in the right substree of the deleted key
  - in a leaf node
- if an underflow of keys in a node occurs
  - key redistribution if two neighbors have at least 2d keys
  - key concatenation o.w.
Operation complexity
- find: logarithm of the file size
- insertion & deletion: O(log_dⁿ) in the worst case
- next: complicated
Perform not well in a sequential processing environment
- a simple preorder tree walk [KNUT68] extracts keys in order
- require space for at least log_d^{(n + 1)} nodes to stack the nodes along a path from the root to avoid reading them twice.
- require tracing a path through several nodes before reaching the desired key: e.g., find the smallest key located in the leftmost leaf

Limitations on large order d

_{Node size}\^{File size (records)}	10^3	10^4	10^5	10^6	10^7
10	3	4	5	6	7
50	2	3	3	4	4
100	2	2	3	3	4
150	2	2	3	3	4

HW systems have a upper bound on the amount of data could be transfered with one access to secondary storage.
each device has some fixed trace size which must be accommodated to avoid wasting large amounts of space.

Variants

Optimization for insertion and deletion
- For the underflow resulting from a deletion, redistribute keys from nieghboring nodes, instead of concatenation (unless the requisite number of keys cannot be obtained).
- For the overflow resulting from an insertion, distribute keys into a neighboring node to delay splitting (unless two neighbors fill up).
Improvements in the secondary costs
- Use a binary search instead of a linear lookup to locate the proper descendent pointer when processing a node [Clam64]
- use a binary search for large nodes, while a sequential search for small [Knut73]
- the square root search, an extrapolation technique [Maru77]
an interpolation search to eliminate a secondary storage access for index creation for a file
Various orders at each depth [Knut73]
B^*-tree: a B-tree whose node is at least ⅔ full[Knut73].
- insertion: a local redistribution scheme to delay spliting until 2 sibling nodes are full.
- speed up the search due to the smaller height of the resulting tree from the higher storage utilization (at least 66%)
B⁺-tree: independent index and key set
Virtual B-Trees
- each node of the B-tree maps into one page of the virtual address space
- use LRU for page replacement so that the most active nodes (which close to the root) are stay in memory
- at least the root should remain in main memory for each search
- the memory protection mechanism isolates other users

Compression decreases the retrieval costs by increasing the capacity of each node

compressed keys use several standard techniques for removing redundancy, as well as both front and rear compression for prefix B⁺-trees

compressed pointers (good for virtual B-trees) use a base/displacement from of node address rather than an absolute address value:

base

offset₀

key₁

offset₁

key₂

...

offset_2d-1

key_2d

offset_2d

variable length entries promotes shorter keys during insertion for a resulting tree with better storage utilization and faster access times.
Binary B-trees
- a B-tree whose order is 1 for a one-level store
- each node has 1 or 2 keys and 2 or 3 pointers
- a linked representation for a better space utilization
- one extra bit to indicate whether the right pointer in a node points to a sibling or a descendant
- Performance: insertion, deletion, and find take only log n steps, although searching the rightmost path requires twice as many nodes to be accessed as the leftmost
- To maintain logarithmic cost, no two right links points to sibling nodes in a row.
- Symmetric Binary B-trees an extension of Binary B-trees that allows for both left and right links pointing to sibling nodes, containing AVL trees as a subclass.
2-3 trees for one-level store
- each node has 2 or 3 sons for 1 or 2 keys
- a 2-3 tree is a B-tree of order 1, and vice versa.
- quite appropriate for an internal data structure, rather than external
a B-tree variant for maintaining a list of keys which have highly skewed probability of access by keep a set of fingers which point to localities of interest so that updating items within p locations from a finger takes log_d^p time.
a B-tree scheme without upward splitting for a better performance

Theoretical results
- a linear time construction algo [Rose78, Mill77] for an optimal 2-3 tree from the sorted list of keys under the cost crierion of the number of comparisions or node accesses
- Both an upper and lower bound for tree construction from a uniformly distribution set of n keys [Yao78]
Concurrent control
- use a set of locking protocols to insure the integrity of access
- find
  - hold an S lock on the current node, and release the lock on the ancestor after processing to the next depth.
  - a reader locks at most two nodes
  - other readers are free to lock other parts of the tree simultaneously
- update
  - leave a reservation (I lock) on each node it access for locking the node later
  - the reservation (I lock) may be converted to an absolute lock (X lock) in a top-down manner if any changes will apply to the reserved node. o.w. cancelled
  - the reservation (I lock) is compatible with the reader lock (S lock), but not another reservation.
- trade-off
  - reserving an entire path from the root to a leaf to prevent other updates
  - most updates affect only a few levels near a leaf, but reserving too few nodes might result in beginning again at the root.

B⁺-tree
- leaf nodes: all keys reside in the leaves, and linked together left-to-right to improve the next operation in a sequential processing
- index nodes: the upper level, which are organized as a B-tree, consist only of an index locating the index and key parts.
- sequence set: the linked list of leaves
- index set, independent with sequence set
- deletion: no changes for index nodes as long as the leaf remains at least half full. O.w., the redistribution or concatenation procedures are required for the index as well as the leaves nodes.
- insertion: when a leaf splits in two, promote a copy of the key instead of the midle key, in the index node, retaining the actual key in the right leaf.
- find: search the nearest right pointer if a key in the index equals the query key, until reaching a leaf node.
- Operation complexity
  - find: logarithm of the file size
  - insertion & deletion: O(log_dⁿ) in the worst case
  - next: at most 1 access
- Sequential processing needs space for only 1 node since no node will be accessed more than once.
- Prefix B⁺ tree
  - the index is a prefix of the actual keys for saving space and decreasing access time
  - Choosing the shortest unique prefix of the key: scan a small neighborhood of keys to obtain a good pair for the separation

Ubiquitous B-Tree

One-line Summary

Overview/Main Points

Relevance

Flaws