Shore Storage Manager: src/internal.h Source File

00001 // -*- mode:c++; c-basic-offset:4 -*-
00002 /*<std-header orig-src='shore' incl-file-exclusion='INTERNAL_H'>
00003 
00004  $Id: internal.h,v 1.5 2010/07/07 20:50:10 nhall Exp $
00005 
00006 SHORE -- Scalable Heterogeneous Object REpository
00007 
00008 Copyright (c) 1994-99 Computer Sciences Department, University of
00009                       Wisconsin -- Madison
00010 All Rights Reserved.
00011 
00012 Permission to use, copy, modify and distribute this software and its
00013 documentation is hereby granted, provided that both the copyright
00014 notice and this permission notice appear in all copies of the
00015 software, derivative works or modified versions, and any portions
00016 thereof, and that both notices appear in supporting documentation.
00017 
00018 THE AUTHORS AND THE COMPUTER SCIENCES DEPARTMENT OF THE UNIVERSITY
00019 OF WISCONSIN - MADISON ALLOW FREE USE OF THIS SOFTWARE IN ITS
00020 "AS IS" CONDITION, AND THEY DISCLAIM ANY LIABILITY OF ANY KIND
00021 FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
00022 
00023 This software was developed with support by the Advanced Research
00024 Project Agency, ARPA order number 018 (formerly 8230), monitored by
00025 the U.S. Army Research Laboratory under contract DAAB07-91-C-Q518.
00026 Further funding for this work was provided by DARPA through
00027 Rome Research Laboratory Contract No. F30602-97-2-0247.
00028 
00029 */
00030 
00031 /*  -- do not edit anything above this line --   </std-header>*/
00032 
00033 /* This file contains doxygen documentation only */
00034 
00035 /**\page IMPLNOTES Implementation Notes
00036  *
00037  * \section MODULES Storage Manager Modules
00038  * The storage manager code contains the following modules (with related C++ classes):
00039  *
00040  * - \ref SSMAPI (ss_m) 
00041  *   Most of the programming interface to the storage manager is encapsulated
00042  *   in the ss_m class. 
00043  * - \ref IO_M (io_m),  \ref VOL_M (vol_m) and \ref DIR_M (dir_m)
00044  *   These managers handle volumes, page allocation and stores, which are the
00045  *   structures underlying files of records, B+-Tree indexes, and
00046  *   spatial indexes (R*-Trees).
00047  * - \ref FILE_M (file_m), \ref BTREE_M (btree_m), and \ref RTREE_M (rtree_m)
00048  *   handle the storage structures available to servers.
00049  * - \ref LOCK_M (lock_m) 
00050  *   The lock manager is quasi-stand-alone.
00051  * - \ref XCT_M (xct_t) and * \ref LOG_M (log_m) handle transactions,
00052  *   logging, and recovery.
00053  * - \ref BF_M (bf_m)
00054  *   The buffer manager works closely with \ref XCT_M and \ref LOG_M.
00055  *
00056  * \section IO_M I/O Manager
00057  * The I/O manager was, in the early days of SHORE, expected to
00058  * have more responsibility than it now has; now it is little more
00059  * than a wrapper for the \ref VOL_M.  
00060  *
00061  * \section VOL_M Volume Manager
00062  * The volume manager handles formatting of volumes,
00063  * allocation and deallocation of pages and extents in stores.
00064  * Within a page, allocation of space is up to the manager of the
00065  * storage structure (btree, rtree, or file).
00066  *
00067  * Pages are reserved for a store in units of ss_m::ext_sz, 
00068  * a compile-time constant that indicates the size of an extent.
00069  * An extent is a set of contiguous pages.  Extents are represented 
00070  * by persistent data structures, extnode_t, which are 
00071  * linked together to form the entire structure of a store.  
00072  * A stnode_t holds metadata for a store and  sits at the 
00073  * head of the chain of extents that forms the store, and 
00074  * the extents in the store list are marked as having an owner, 
00075  * which is the store id of the store to which they belong.  
00076  * A store id is a number of type snum_t, and an extent id is 
00077  * a number of type extnum_t.  
00078  * Scanning the pages in a store can be accomplished by scanning the 
00079  * list of extnode_t.   
00080  * Each extent has a number, and the pages in the extent are arithmetically 
00081  * derived from the extent number; likewise, from any page, 
00082  * its extent can be computed.  Free extents are not linked together; 
00083  * they simply have no owner (signified by an  extnode_t::owner == 0).
00084  *
00085  * \subsection PAGES Page Types
00086  * Pages in a volume come in a variety of page types, all the same size.
00087  * The size of a page is a compile-time constant.  
00088  * \bug GNATS 127: Larger page sizes are broken. We temporarily support
00089  * only 8KB pages.
00090  *
00091  * Each page type is a C++ class that derives from the base class
00092  * page_p.  Page_p implements functionality that is common to all
00093  * (or most) page types. The types are as follows:
00094  *
00095  * - extlink_p : extent-link pages, used by vol_m
00096  * - stnode_p  : store-node pages, used by vol_m
00097  * - file_p    : slotted pages of file-of-record, used by file_m
00098  * - lgdata_p  : pages of large records, used by file_m
00099  * - lgindex_p : pages of large records, used by file_m
00100  * - keyed_p   : slotted pages of indexes, used by btree_m
00101  * - zkeyed_p  : slotted pages of indexes, used by btree_m
00102  * - rtree_p   : slotted pages of spatial indexes, used by rtree_m
00103  *
00104  * Issues specific to the page types will be dealt with in the descriptions of the modules.
00105  *
00106  * \subsection RSVD_MODE Space Reservation on a Page
00107  *
00108  * All pages are \e slotted (those that don't need the slot structure
00109  * may use only one slot) and have the following layout:
00110  * - header, including
00111  *   - lsn_t log sequence number of last page update
00112  *   - page id
00113  *   - links to next and previous pages (used by some storage structures)
00114  *   - page tag (indicates type of page)
00115  *   - space management metadata (space_t)
00116  *   - store flags (logging level metadata)
00117  * - slots (grow down)
00118  * - slot table array of pointers to the slots (grows up)
00119  * - footer (copy of log sequence number of last page update)
00120  * 
00121  * Different storage structures offer different opportunities for fine-grained 
00122  * locking and need different means of allocation space within a page.
00123  * Special care is taken to reserve space on a page when slots 
00124  * are freed (records are deleted) so that rollback can restore 
00125  * the space on the page.  
00126  * Page types that use this space reservation have 
00127  * \code page_p::rsvd_mode() == true \endcode.
00128  *
00129  * In the case of B-trees, this is not used because
00130  * undo and redo are handled logically -- entries 
00131  * can be re-inserted in a different page.  But in the case of files, 
00132  * records are identified by physical ID, which includes page and slot number,
00133  * so records must be reinserted just where they first appeared. 
00134  *
00135  * Holes in a page are coalesced (moved to the end of the page) as needed, 
00136  * when the total free space on the page satisfies a need but the 
00137  * contiguous free space does not.  Hence, a record truncation followed 
00138  * by an append to the same record does not necessarily cause the 
00139  * shifting of other records on the same page.
00140  *
00141  * A count of free bytes is maintained for all pages. More space-allocation
00142  * metadata is maintained for rsvd_mode() pages.
00143  *
00144  * When a transaction release a slot on a page with rsvd_mode(), the slot 
00145  * remains
00146  * "reserved" for use by the same transaction. That slot is not free to
00147  * be allocated by another transaction until the releasing transaction 
00148  * commits.  This is because if the transaction aborts, the slot must
00149  * be restored with the same slot number.  Not only must the slot number be
00150  * preserved, but the number of bytes consumed by that slot must remain 
00151  * available lest the transaction abort.
00152  * The storage manager keeps track of the youngest active transaction
00153  * that is freeing space (i.e., "reserving" it) on the page 
00154  * and the number of bytes freed ("reserved")
00155  * by the youngest transaction.  
00156  * When the youngest transaction to reserve space on the page becomes
00157  * older than the oldest active transaction in the system, the reserved
00158  * space becomes free. This check for freeing up the reserved space happens
00159  * whenever a transaction tries to allocate space on the page.
00160  *
00161  * During rollback, a transaction can use \e any amount of
00162  * reserved space, but during forward processing, it can only use space
00163  * it reserved, and that is known only if the transaction in question is
00164  * the youngest transaction described in the above paragraph.
00165  *
00166  * The changes to space-reservation metadata (space_t) are not logged.
00167  * The actions that result in updates to this metadata are logged (as
00168  * page mark and page reclaim).
00169  *
00170  * \bug GNATS 138 (performance) The allocation of
00171  * space for records is known to have performance problems and this portion
00172  * of the storage manager needs redesign.
00173  * This is a problem inherited from the original SHORE storage manager.
00174  *
00175  * \subsection ALLOCEXT Allocation of Extents and Pages
00176  * Extents are linked through extlink_t to form a linked list, which
00177  * composes the allocated extents of a store. The volume manager 
00178  * handles allocating extents to stores.  Deallocation
00179  * of extents is shared by the volume manager and the the lock manager
00180  * (see the discussion in lockid_t::set_ext_has_page_alloc).
00181  *
00182  * The volume manager keeps a cache of {store, extent} pairs, to find
00183  * extents already allocated to a store that contain free pages. This
00184  * cache is consulted before new extents are allocated to a store.
00185  * Since after restart the cache is necessarily empty, it is primed when
00186  * first needed for the purpose of allocating anything for the store. 
00187  *
00188  * Priming the store is an expensive operation.
00189  * It is not done on each volume mount, because volumes are mounted and
00190  * dismounted several times during recovery, and priming on each
00191  * mount would be prohibitive.
00192  *
00193  * \section FILE_M File Manager
00194  * A file is a group of variable-sized records (or objects).  
00195  * A record is the smallest persistent datum that has identity 
00196  * A record may also have a "user header", whose contents are
00197  * for use by the server.  
00198  * As records vary in size, so their storage representation varies. 
00199  * The storage manager changes the storage representation as needed.
00200  * A file comprises two stores.  
00201  * One store is allocated for slotted (small-record) pages, called file_p
00202  * pages.  
00203  * One store is allocated for large records, and contains lgdata_p and
00204  * lgindex_p pages.
00205  * Small records are those whose size is less than or equal to
00206  * sm_config_info_t.max_small_rec.  A record larger than this
00207  * has a slot on a small-record page, which slot contains metadata
00208  * refering to pages in the large-record store.
00209  * The scan order of a file is the physical order of the records
00210  * in the small-record store.
00211  *
00212  * Every record, large or small, has the following metadata in the
00213  * record's slot on the file_p page; these data are held in a rectag_t
00214  * structure:
00215  * \code
00216 struct rectag_t {
00217     uint2_t   hdr_len;  // length of user header, may be zero
00218     uint2_t   flags;    // enum recflags_t: indicates internal implementation
00219     smsize_t  body_len; // true length of the record 
00220 };
00221 \endcode
00222   * The flags have have these values:
00223     - t_small   : a small record, entirely contained on the file_p
00224     - t_large_0 : a large record, the slot on the file_p contains the
00225                   user header, while the body is a list
00226                   of chunks (pointers to contiguous lgdata_p pages)
00227     - t_large_1 : a large record, the slot on the file_p contains the
00228                   user header, while the body is a reference to a single
00229                   lgindex_p page, which is the root of a 1-level index of
00230                   lgdata_p pages.
00231     - t_large_2 : like t_large_1 but the index may be two levels deep. This
00232                   has not been implemented.
00233  *
00234  * Internally (inside the storage manager), the class record_t is a
00235  * handle on the record's tag and is the class through which the
00236  * rectag_t is manipulated.
00237  *
00238  * A record is exposed to the server through a set of ss_m methods (create_rec,
00239  * append_rec, etc), and through the pin_i class.
00240  *
00241  * All updates to records are accomplished by copying out part or all of
00242  * the record from the buffer pool to the server's address space, performing
00243  * updates there, and handing the new data to the storage manager. 
00244  * User (server) data are not updated directly in the buffer pool.
00245  *
00246  * The server may cause the file_p and at most one large data page to
00247  * be pinned for a given record through the pin_i class; the server must
00248  * take care not to create latch-latch deadlocks by holding a record pinned
00249  * while attempting to pin another.  An ordering protocol among the pages
00250  * pinned must be observed to avoid such deadlocks.
00251  *
00252  * \note The system only detects lock-lock deadlocks.  Deadlocks involving
00253  * mutexes or latches or other blocking mechanisms will cause the server to
00254  * hang.
00255  *
00256  * \subsection HISTO Finding a Page with Space
00257  *
00258  * When a record is created, the file manager tries to use an already-allocated
00259  * page that has space for the record (based on the length hint and
00260  * the data given in the create_rec call). The file manager caches information
00261  * about page utilization for pages in each store.  The page utilization
00262  * data take the form of a histoid_t, which contains a heap
00263  * and a histogram.  The heap keeps track of the amount of
00264  * free space in (recently-used) pages in the heap, and it is 
00265  * searchable so that it can
00266  * return the page with the smallest free space that is larger than a
00267  * given value.
00268  * The histogram has a small number of buckets, each of which counts
00269  * the number of pages in the file containing free space between 
00270  * the bucket min and the bucket max.
00271  *
00272  * Three policies used can be used in combination to search for space.
00273  * - t_cache : look in the heap for a page with space. 
00274  * - t_compact : if the histograms say there are any pages with 
00275  *   sufficient space somewhere in the file, 
00276  *   do a linear search of the file for such a page.
00277  * - t_append : append the new record to the file
00278  *
00279  * Using append_file_t to create records means only t_append is used, 
00280  * ensuring that the record will always be appended to the file.
00281  * ss_m::create_rec uses t_cache | t_compact | t_append.
00282  *
00283  * \bug GNATS 33 (performance) t_compact is turned off.  It is expensive and has not been
00284  * tested lately, thus there is presently no policy that will keep
00285  * files compact.  If files become too sparse, the server must reorganize
00286  * the database.
00287  *
00288  * Once the file manager has located a page with sufficient space to
00289  * create the record, the I/O and volume managers worry about 
00290  * \ref RSVD_MODE.
00291  *
00292  * \subsection HISTO Allocating a New File Page
00293  *
00294  * If the file manager cannot find an already-allocated page with
00295  * sufficient space, or if it is appending a record to a file and
00296  * needs to allocate a new page, it first descends to the I/O manager
00297  * to find an extent with a free page (or the last extent in the file) (
00298  * \ref ALLOCEXT).
00299  *
00300  * Once it has found an extent with a free page, it allocates a page in
00301  * that extent.  
00302  *
00303  * IX-locks are acquired on file pages for the purpose of fine-grained locking.
00304  * There is no file structure superimposed on the store, with which
00305  * to link the file pages together, so as soon as an extent is allocated
00306  * to a store, it is visible to any transaction; in particular, it is
00307  * visible between the time the page is allocated by the I/O manager and
00308  * the time the file manager formats that page for use.
00309  * For this reason, the allocation of a file page must be a top-level action
00310  * so that
00311  * undo of the allocation is logical: it must check for the file page
00312  * still being empty before the page can be freed.
00313  * If another transaction slipped in an allocated a record on the same page,
00314  * the page can't be freed on undo of the page allocation.
00315  *
00316  * Protocol for allocating a file page:
00317  * - Enter I/O manager (mutex).
00318  * - Start top-level action (grab log anchor).
00319  * - IX-lock (long duration) the page.
00320  * - Allocate the page.
00321  * - End top-level action (grab log anchor) to compensate around
00322  *    the changes to the extent map.
00323  * - Log the file-page allocation.
00324  *
00325  * \bug GNATS 129 There is a gap between the end of the top-level action
00326  * and the logging of the file-page allocation during which a crash could
00327  * cause a page to be allocated but unformatted.  This needs to be fixed; the
00328  * fix is to support undoable-compensation log records so that the log insert
00329  * and the compensation can be done in one operation. The issue is in
00330  * handling these records on restart.
00331  *
00332  * To free a page, the transaction must acquire an EX lock on the page;
00333  * this prevents the page from being used by any other transaction until
00334  * the freeing transaction commits.  If the EX lock cannot be acquired,
00335  * the page is in use and will not be freed (e.g., the other transaction
00336  * could abort and re-insert something on the page).
00337  *
00338  * \section BTREE_M B+-Tree Manager
00339  *
00340  * The values associated with the keys are opaque to the storage
00341  * manager, except when IM (Index Management locking protocol) is used, 
00342  * in which case the value is
00343  * treated as a record ID, but no integrity checks are done.  
00344  * It is the responsibility of the server to see that the value is
00345  * legitimate in this case.
00346  *
00347  * B-trees can be bulk-loaded from files of sorted key-value pairs,
00348  * as long as the keys are in \ref LEXICOFORMAT "lexicographic form". 
00349  * \bug GNATS 116 Btree doesn't sort elements for duplicate keys in bulk-load. 
00350  * This is a problem inherited from the original SHORE storage manager.
00351  *
00352  * The implementation of B-trees is straight from the Mohan ARIES/IM
00353  * and ARIES/KVL papers. See \ref MOH1, which covers both topics.
00354  *
00355  * Those two papers give a thorough explanation of the arcane algorithms,
00356  * including logging considerations.  
00357  * Anyone considering changing the B-tree code is strongly encouraged 
00358  * to read these papers carefully.  
00359  * Some of the performance tricks described in these papers are 
00360  * not implemented here.  
00361  * For example, the ARIES/IM paper describes performance of logical 
00362  * undo of insert operations if and only if physical undo 
00363  * is not possible.  
00364  * The storage manager always undoes inserts logically.
00365  *
00366  * \bug GNATS 137 Latches can now be downgraded; btree code should use this.
00367  *
00368  * \section RTREE_M R*-Tree Manager
00369  *
00370  * The spatial indexes in the storage manager are R*-trees, a variant
00371  * of R-trees that perform frequent restructuring to yield higher
00372  * performance than normal R-trees.  The entire index is locked.
00373  * See \ref BKSS.
00374  *
00375  * \section DIR_M Directory Manager
00376  * All storage structures created by a server
00377  * have entries in a B+-Tree index called the 
00378  * \e store \e directory or just \e directory.
00379  * This index is not exposed to the server.
00380  *
00381  * The storage manager maintains  some transient and some persistent data
00382  * for each store.  The directory's key is the store ID, and the value it
00383  * returns from a lookup is a
00384  * sdesc_t ("store descriptor") structure, which
00385  * contains both the persistent and transient information. 
00386  *
00387  * The persistent information is in a sinfo_s structure; the 
00388  * transient information is resident only in the cache of sdesc_t 
00389  * structures that the directory manager
00390  * maintains.
00391  *
00392  * The metadata include:
00393  * - what kind of storage structure uses this store  (btree, rtree, file)
00394  * - if a B-tree, is it unique and what kind of locking protocol does it use?
00395  * - what stores compose this storage structure (e.g., if file, what is the
00396  *   large-object store and what is the small-record store?)
00397  * - what is the root page of the structure (if an index)
00398  * - what is the key type if this is an index
00399  *
00400  * \section LOCK_M Lock Manager
00401  *
00402  * The lock manager understands the folling kind of locks
00403  * - volume
00404  * - extent
00405  * - store
00406  * - page
00407  * - kvl
00408  * - record
00409  * - user1
00410  * - user2
00411  * - user3
00412  * - user4
00413  *
00414  * Lock requests are issued with a lock ID (lockid_t), which
00415  * encodes the identity of the entity being locked, the kind of
00416  * lock, and, by inference,  a lock hierarchy for a subset of the
00417  * kinds of locks above.
00418  * The lock manager does not insist that lock identifiers 
00419  * refer to any existing object.  
00420  *
00421  * The lock manager enforces two lock hierarchies:
00422  * - Volume - store - page - record
00423  * - Volume - store - key-value
00424  *
00425  * Note that the following lock kinds are not in any hierarchy:
00426  * -extent
00427  * -user1, user2, user3, user4
00428  *
00429  * Other than the way the lock identifiers are inspected for the purpose 
00430  * of enforcing the hierarchy, lock identifiers are considered opaque 
00431  * data by the lock manager.  
00432  *
00433  * The lockid_t structure can be constructed from the IDs of the
00434  * various entities in (and out of ) the hierarchy; see lockid_t and
00435  * the example lockid_test.cpp.
00436  *
00437  * The lock manager escalates up the hierarchy by default.  
00438  * The escalation thresholds are based on run-time options.  
00439  * They can be controlled (set, disabled) on a per-object level.  
00440  * For example, escalation to the store level can be disabled when 
00441  * increased concurrency is desired.  
00442  * Escalation can also be controlled on a per-transaction or per-server basis.
00443  *
00444  * Locks are acquired by storage manager operations as appropriate to the
00445  * use of the data (read/write). (Update locks are not acquired by the
00446  * storage manager.)
00447  *
00448  * The storage manager's API allows explicit acquisition 
00449  * of locks by a server.  
00450  * Freeing locks is automatic at transaction commit and rollback.  
00451  * There is limited support for freeing locks in the middle of 
00452  * a transaction:
00453  * - locks of duration less than t_long can be unlocked with unlock(), and
00454  * - quarks (sm_quark_t) simplify acquiring and freeing locks mid-transaction:
00455  *
00456  * A quark is a marker in the list of locks held by a transaction.  
00457  * When the quark is destroyed, all locks acquired since the 
00458  * creation of the quark are freed.  Quarks cannot be used while more than
00459  * one thread is attached to the transaction, although the storage 
00460  * manager does not strictly enforce this (due to the cost).
00461  * When a quark is in use for a transaction, the locks acquired
00462  * will be of short duration, the assumption being that the quark
00463  * will be closed before commit-time.  
00464  *
00465  * Extent locks are an exception; they must be held long-term for
00466  * page allocation and deallocation to work, so even in the context
00467  * of an open quark, extent locks will be held until end-of-transaction.
00468  *
00469  * The lock manager uses a hash table whose size is determined by
00470  * a configuration option.  
00471  * The hash function used by the lock manager is known not 
00472  * to distribute locks evenly among buckets.
00473  * This is partly due to the nature of lock IDs.
00474  *
00475  * To avoid expensive lock manager queries, each transaction 
00476  * keeps a cache of the last <N> locks acquired (the number
00477  * <N> is a compile-time constant).
00478  * This close association between the transaction manager and
00479  * the lock manager is encapsulated in several classes in the file lock_x.
00480  *
00481  * \subsection DLD Deadlock Detection
00482  * The lock manager uses a statistical deadlock-detection scheme
00483  * known as "Dreadlocks" [KH1].
00484  * Each storage manager thread (smthread_t) has a unique fingerprint, which is
00485  * a set of bits; the deadlock detector ORs together the bits of the
00486  * elements in a waits-for-dependency-list; each thread, when 
00487  * blocking, holds a  digest (the ORed bitmap).  
00488  * It is therefore cheap for a thread to detect a cycle when it needs to 
00489  * block awaiting a lock: look at the holders
00490  * of the lock and if it finds itself in any of their digests, a
00491  * cycle will result.
00492  * This works well when the total number of threads relative to the bitmap
00493  * size is such that it is possible to assign a unique bitmap to each
00494  * thread.   
00495  * If you cannot do so, you will have false-positive deadlocks
00496  * "detected".
00497  * The storage manager counts, in its statistics, the number of times
00498  * it could not assign a unique fingerprint to a thread.  
00499  * If you notice excessive transaction-aborts due to false-positive
00500  * deadlocks,
00501  * you can compile the storage manager to use a larger
00502  * number bits in the 
00503  * \code sm_thread_map_t \endcode
00504  * found in 
00505  * \code smthread.h \endcode.
00506  *
00507  * \section XCT_M Transaction Manager
00508  * When a transaction commits, the stores that are marked for deletion
00509  * are deleted, and the stores that were given sm_store_property_t t_load_file or t_insert_file are turned into t_regular stores.
00510  * Because these are logged actions, and they occur if and only if the 
00511  * transaction commits, the storage manager guarantees that the ending
00512  * of the transaction and re-marking and deletion of stores is atomic.
00513  * This is accomplished by putting the transaction into a state
00514  * xct_freeing_space, and writing a log record to that effect.
00515  * The space is freed, the stores are converted, and a final log record is written before the transaction is truly ended.
00516  * In the vent of a carash while a transaction is freeing space, 
00517  * recovery searches all the 
00518  * store metadata for stores marked for deleteion
00519  * and deletes those that would otherwise have been missed in redo.
00520  *
00521  * \section LOG_M Log Manager
00522  *
00523  * \subsection LOG_M_USAGE How the Server Uses the Log Manager
00524  *
00525  * Log records for redoable-undoable operations contain both the 
00526  * redo- and undo- data, hence an operation never causes two 
00527  * different log records to be written for redo and for undo.  
00528  * This, too, controls logging overhead.  
00529  *
00530  * The protocol for applying an operation to an object is as follows:
00531  * - Lock the object.
00532  * - Fix the page(s) affected in exclusive mode.
00533  * - Apply the operation.
00534  * - Write the log record(s) for the operation.
00535  * - Unfix the page(s).
00536  *
00537  * The protocol for writing log records is as follows:
00538  * - Grab the transaction's log buffer in which the last log record is to be 
00539  *   cached by calling xct_t::get_logbuf()
00540  * - Write the log record into the buffer (the idiom is to construct it
00541  *      there using C++ placement-new).
00542  * - Release the buffer with xct_t::give_logbuf(),
00543  *    passing in as an argument the fixed page that was affected
00544  *    by the update being logged.  This does several things: 
00545  *    - writes the transaction ID, previous LSN for this transaction 
00546  *      into the log record
00547  *    - inserts the record into the log and remembers this record's LSN
00548  *    - marks the given page dirty.
00549  *
00550  * Between the time the xct log buffer is grabbed and the time it is
00551  * released, the buffer is held exclusively by the one thread that
00552  * grabbed it, and updates to the xct log buffer can be made freely.
00553  * (Note that this per-transaction log buffer is unrelated to the log buffer
00554  * internal to the log manager.)
00555  *
00556  * The above protocol is enforced by the storage manager in helper
00557  * functions that create log records; these functions are generated
00558  * by Perl scripts from the source file logdef.dat.  (See \ref LOGDEF.)
00559  *
00560  *\subsection LOGRECS Log Record Types
00561  * \anchor LOGDEF
00562  * The input to the above-mentioned Perl script is the source of all
00563  * log record types.  Each log record type is listed in the  file
00564  * \code logdef.dat \endcode
00565  * which is fairly self-explanatory, reproduced here:
00566  * \include logdef.dat
00567  *
00568  * The bodies of the methods of the class <log-rec-name>_log
00569  * are hand-written and reside in \code logrec.cpp \endcode.
00570  *
00571  * Adding a new log record type consists in adding a line to
00572  * \code logdef.dat, \endcode
00573  * adding method definitions to 
00574  * \code logrec.cpp, \endcode
00575  * and adding the calls to the free function log_<log-rec-name>(args)
00576  * in the storage manager.
00577  * The base class for every log record is logrec_t, which is worth study
00578  * but is not documented here.
00579  *
00580  * Some logging records are \e compensated, meaning that the 
00581  * log records are skipped during rollback. 
00582  * Compensations may be needed because some operation simply cannot
00583  * be undone.  The protocol for compensating actions is as follows:
00584  * - Fix the needed pages.
00585  * - Grab an \e anchor in the log.  
00586  *   This is an LSN for the last log record written for this transaction.
00587  * - Update the pages and log the updates as usual.
00588  * - Write a compensation log record (or piggy-back the compensation on
00589  *   the last-written log record for this transaction to reduce 
00590  *   logging overhead) and free the anchor.
00591  *
00592  * \note Grabbing an anchor prevents all other threads in a multi-threaded
00593  * transaction from gaining access to the transaction manager.  Be careful
00594  * with this, as it can cause mutex-latch deadlocks where multi-threaded
00595  * transactions are concerned.  In other words, two threads cannot concurrently
00596  * update in the same transaction.
00597  *
00598  * In some cases, the following protocol is used to avoid excessive
00599  * logging by general update functions that, if logging were turned
00600  * on, would generate log records of their own.
00601  * - Fix the pages needed in exclusive mode.
00602  * - Turn off logging for the transaction.
00603  * - Perform the updates by calling some general functions.  If an error occurs, undo the updates explicitly.
00604  * - Turn on logging for the transaction.
00605  * - Log the operation.  If an error occurs, undo the updates with logging turned off..
00606  * - Unfix the pages.
00607  *
00608  * The mechanism for turning off logging for a transaction is to
00609  * construct an instance of xct_log_switch_t.
00610  *
00611  * When the instance is destroyed, the original logging state
00612  * is restored.  The switch applies only to the transaction that is 
00613  * attached to the thread at the time the switch instance is constructed, 
00614  * and it prevents other threads of the transaction from using 
00615  * the log (or doing much else in the transaction manager) 
00616  * while the switch exists.
00617  *
00618  * \subsection LOG_M_INTERNAL Log Manager Internals
00619  *
00620  * The log is a collection of files, all in the same directory, whose
00621  * path is determined by a run-time option.
00622  * Each file in the directory is called a "log file" and represents a
00623  * "partition" of the log. The log is partitioned into files to make it 
00624  * possible to archive portions of the log to free up disk space.
00625  * A log file has the name \e log.<n> where <n> is a positive integer.
00626  * The log file name indicates the set of logical sequence numbers (lsn_t)
00627  * of log records (logrec_t) that are contained in the file.  An
00628  * lsn_t has a \e high part and a \e low part, and the
00629  * \e high part (a.k.a., \e file part) is the <n> in the log file name.
00630  *
00631  * The user-settable run-time option sm_logsize indicates the maximum 
00632  * number of KB that may be opened at once; this, in turn, determines the
00633  * size of a partition file, since the number of partition files is
00634  * a compile-time constant.
00635  * The storage manager computes partition sizes based on the user-provided
00636  * log size, such that partitions sizes are a convenient multiple of blocks
00637  * (more about which, below).
00638  *
00639  * A new partition is opened when the tail of the log approaches the end
00640  * of a partition, that is, when the next insertion into the log
00641  * is at an offset larger than the maximum partition size.  (There is a
00642  * fudge factor of BLOCK_SIZE in here for convenience in implementation.)
00643  * 
00644  * The \e low part of an lsn_t represents the byte-offset into the log file
00645  * at which the log record with that lsn_t sits.
00646  *
00647  * Thus, the total file size of a log file \e log.<n>
00648  * is the size of all log records in the file, 
00649  * and the lsn_t of each log record in the file is
00650  * lsn_t(<n>, <byte-offset>) of the log record within the file.
00651  *
00652  * The log is, conceptually, a forever-expanding set of files. The log 
00653  * manager will open at most PARTITION_COUNT log files at any one time.
00654  * - PARTITION_COUNT = smlevel_0::max_openlog
00655  * - smlevel_0::max_openlog (sm_base.h) = SM_LOG_PARTITIONS
00656  * - SM_LOG_PARTITIONS a compile-time constant (which can be overridden in 
00657  *   config/shore.def).
00658  *
00659  * The log is considered to have run out of space if logging requires that
00660  * more than smlevel_0::max_openlog partitions are needed.
00661  * Partitions are needed only as long as they contain log records 
00662  * needed for recovery, which means:
00663  * - log records for pages not yet made durable (min recovery lsn)
00664  * - log records for uncommitted transactions (min xct lsn)
00665  * - log records belonging to the last complete checkpoint
00666  *
00667  * Afer a checkpoint is taken and its log records are durable,
00668  * the storage manager tries to scavenge all partitions that do not
00669  * contain necessary log records.  The buffer manager provides the
00670  * min recovery lsn; the transaction manager provides the min xct lsn,
00671  * and the log manager keeps track of the location of the last 
00672  * completed checkpoint in its master_lsn.  Thus the minimum of the
00673  * 
00674  * \e file part of the minmum of these lsns indicates the lowest partition 
00675  * that cannot be scavenged; all the rest are removed.
00676  *
00677  * When the log is in danger of runing out of space 
00678  * (because there are long-running  transactions, for example) 
00679  * the server may be called via the
00680  * LOG_WARN_CALLBACK_FUNC argument to ss_m::ss_m.  This callback may
00681  * abort a transaction to free up log space, but the act of aborting
00682  * consumes log space. It may also archive a log file and remove it.
00683  * If the server provided a
00684  * LOG_ARCHIVED_CALLBACK_FUNC argument to ss_m::ss_m, this callback
00685  * can be used to retrieve archived log files when needed for
00686  * rollback.
00687  * \warning This functionality is not complete and has not been
00688  * well-tested.
00689  *
00690  * Log files (partitions) are written in fixed-sized blocks.  The log
00691  * manager pads writes, if necessary, to make them BLOCK_SIZE. 
00692  * - BLOCK_SIZE = 8192, a compile-time constant.
00693  *
00694  * A skip_log record indicates the logical end of a partition.
00695  * The log manager ensures that the last log record in a file 
00696  * is always a skip_log record. 
00697  *
00698  * Log files (partitions) are composed of segments. A segment is
00699  * an integral number of blocks.
00700  * - SEGMENT_SIZE = 128*BLOCK_SIZE, a compile-time constant.
00701  *
00702  * The smallest partition is one segment plus one block, 
00703  * but may be many segments plus one block. The last block enables
00704  * the log manager to write the skip_log record to indicate the
00705  * end of the file.
00706  *
00707  * The partition size is determined by the storage manager run-time option,
00708  * sm_logsize, which determines how much log can be open at any time,
00709  * i.e., the combined sizes of the PARTITION_COUNT partitions.
00710  *
00711  * The maximum size of a log record (logrec_t) is 3 storage manager pages.
00712  * A page happens to match the block size but the two compile-time
00713  * constants are not inter-dependent. 
00714  * A segment is substantially larger than a block, so it can hold at least
00715  * several maximum-sized log records, preferably many.
00716  * 
00717  * Inserting a log record consists of copying it into the log manager's
00718  * log buffer (1 segment in size).  The buffer wraps so long as there
00719  * is room in the partition.  Meanwhile, a log-flush daemon thread
00720  * writes out unflushed portions of the log buffer. 
00721  * The log daemon can lag behind insertions, so each insertion checks for
00722  * space in the log buffer before it performs the insert. If there isn't
00723  * enough space, it waits until the log flush daemon has made room.
00724  *
00725  * When insertion of a log record would wrap around the buffer and the
00726  * partition has no room for more segments, a new partition is opened,
00727  * and the entire newly-inserted log record will go into that new partition.
00728  * Meanwhile, the log-flush daemon will see that the rest of the log
00729  * buffer is written to the old partition, and the next time the
00730  * log flush daemon performs a flush, it will be flushing to the
00731  * new partition.
00732  *
00733  * The bookkeeping of the log buffer's free and used space is handled
00734  * by the notion of \e epochs.
00735  * An epoch keeps track of the start and end of the unflushed portion
00736  * of the segment (log buffer).  Thus, an epoch refers to only one
00737  * segment (logically, log buffer copy within a partition).
00738  * When an insertion fills the log buffer and causes it to wrap, a new
00739  * epoch is created for the portion of the log buffer representing
00740  * the new segment, and the old epoch keeps track of the portion of the 
00741  * log buffer representing the old segment.  The inserted log record
00742  * usually spans the two segements, as the segments are written contiguously
00743  * to the same log file (partition).
00744  *
00745  * When an insertion causes a wrap and there is no more room in the
00746  * partition to hold the new segment, a new 
00747  * epoch is created for the portion of the log buffer representing
00748  * the new segment, and the old epoch keeps track of the portion of the 
00749  * log buffer representing the old segment, as before.  
00750  * Now, however, the inserted log record is inserted, in its entirety,
00751  * in the new segment.  Thus, no log record spans partitions.
00752  *
00753  * Meanwhile, the log-flush buffer knows about the possible existence of
00754  * two epochs.  When an old epoch is valid, it flushes that epoch.
00755  * When a new epoch is also valid, it flushes that new one as well.
00756  * If the two epochs have the same target partition, the two flushes are
00757  * done with a single write.
00758  *
00759  * The act of flushing an epoch to a partition consists in a single
00760  * write of a size that is an even multiple of BLOCK_SIZE.  The
00761  * flush appends a skip_log record, and zeroes as needed, to round out the
00762  * size of the write.  Writes re-write portions of the log already
00763  * written, in order to overwrite the skip_log record at the tail of the
00764  * log (and put a new one at the new tail).
00765  *
00766  *
00767  *\subsection RECOV Recovery
00768  * The storage manager performs ARIES-style logging and recovery.
00769  * This means the logging and recovery system has these characteristics:
00770  * - uses write-ahead logging (WAL)
00771  * - repeats history on restart before doing any rollback 
00772  * - all updates are logged, including those performed during rollback
00773  * - compensation records are used in the log to bound the amount
00774  *   of logging done for rollback 
00775  *   and guarantee progress in the case of repeated 
00776  *   failures and restarts.
00777  *
00778  * Each time a storage manager (ss_m class) is constructed, the logs
00779  * are inspected, the last checkpoint is located, and its lsn is
00780  * remembered as the master_lsn, then recovery is performed.
00781  * Recovery consists of three phases: analysis, redo and undo.
00782  *
00783  *\subsubsection RECOVANAL Analysis
00784  * This pass analyzes the log starting at the master_lsn, and
00785  *   reading log records written thereafter.  Reading the log records for the
00786  *   last completed checkpoint, it reconstructs the transaction table, the
00787  *   buffer-pool's dirty page table, and mounts the devices and
00788  *   volumes that were mounted at the time of the checkpoint.
00789  *   From the dirty page table, it determines the \e redo_lsn, 
00790  *   the lowest recovery lsn of the dirty pages, which is 
00791  *   where the next phase of recovery must begin.
00792  *
00793  *\subsubsection RECOVREDO Redo
00794  * This pass starts reading the log at the redo_lsn, and, for each
00795  *   log record thereafter, decides whether that log record's 
00796  *   work needs to be redone.  The general protocol is:
00797  *   - if the log record is not redoable, it is ignored
00798  *   - if the log record is redoable and contains a page ID, the
00799  *   page is inspected and its lsn is compared to that of the log
00800  *   record. If the page lsn is later than the log record's sequence number,
00801  *   the page does not need to be updated per this log record, and the
00802  *   action is not redone.
00803  *
00804  *\subsubsection RECOVUNDO Undo
00805  *  After redo,  the state of the database matches that at the time 
00806  *  of the crash.  Now the storage manager rolls back the transactions that 
00807  *  remain active.  
00808  *  Care is taken to undo the log records in reverse chronological order, 
00809  *  rather than allowing several transactions to roll back 
00810  *  at their own paces.  This is necessary because some operations 
00811  *  use page-fixing for concurrency-control (pages are protected 
00812  *  only with latches if there is no page lock in
00813  *  the lock hierarchy -- this occurs when 
00814  *  logical logging and high-concurrency locking are used, 
00815  *  in the B-trees, for example.  A crash in the middle of 
00816  * a compensated action such as a page split must result in 
00817  * the split being undone before any other operations on the 
00818  * tree are undone.). 
00819  * \bug GNATS 49 (performance) There is no concurrent undo.
00820  *
00821  * After the storage manager has recovered, control returns from its
00822  * constructor method to the caller (the server).
00823  * There might be transactions left in prepared state.  
00824  * The server is now free to resolve these transactions by 
00825  * communicating with its coordinator. 
00826  *
00827  *\subsection LSNS Log Sequence Numbers
00828  *
00829  * Write-ahead logging requires a close interaction between the
00830  * log manager and the buffer manager: before a page can be flushed
00831  * from the buffer pool, the log might have to be flushed.
00832  *
00833  * This also requires a close interaction between the transaction
00834  * manager and the log manager.
00835  * 
00836  * All three managers understand a log sequence number (lsn_t).
00837  * Log sequence numbers serve to identify and locate log records
00838  * in the log, to timestamp pages, identify timestamp the last
00839  * update performed by a transaction, and the last log record written
00840  * by a transaction.  Since every update is logged, every update
00841  * can be identified by a log sequence number.  Each page bears
00842  * the log sequence number of the last update that affected that
00843  * page.
00844  *
00845  * A page cannot be written to disk until  the log record with that
00846  * page's lsn has been written to the log (and is on stable storage).
00847  * A log sequence number is a 64-bit structure,  with part identifying
00848  * a log partition (file) number and the rest identifying an offset within the file. 
00849  *
00850  * \subsection LOGPART Log Partitions
00851  *
00852  * The log is partitioned to simplify archiving to tape (not implemented)
00853  * The log comprises 8 partitions, where each partition's
00854  * size is limited to approximately 1/8 the maximum log size given
00855  * in the run-time configuration option sm_logsize.
00856  * A partition resides in a file named \e log.<n>, where \e n
00857  * is the partition number.
00858  * The configuration option sm_logdir names a directory 
00859  * (which must exist before the storage manager is started) 
00860  * in which the storage manager may create and destroy log files.
00861  *
00862  *  The storage manger may have at most 8 active partitions at any one time.  
00863  *  An active partition is one that is needed because it 
00864  *  contains log records for running transactions.  Such partitions 
00865  *  could (if it were supported) be streamed to tape and their disk 
00866  *  space reclaimed.  Space is reclaimed when the oldest transaction 
00867  *  ends and the new oldest transaction's first log record is 
00868  *  in a newer partition than that in which the old oldest 
00869  *  transaction's first log record resided.  
00870  *  Until tape archiving is implemented, the storage 
00871  *  manager issues an error (eOUTOFLOGSPACE) 
00872  *  if it consumes sufficient log space to be unable to 
00873  *  abort running transactions and perform all resulting necessary logging 
00874  *  within the 8 partitions available. 
00875  * \note Determining the point at which there is insufficient space to
00876  * abort all running transactions is a heuristic matter and it
00877  * is not reliable}.  
00878  *
00879  * Log records are buffered by the log manager until forced to stable 
00880  * storage to reduce I/O costs.  
00881  * The log manager keeps a buffer of a size that is determined by 
00882  * a run-time configuration option.  
00883  * The buffer is flushed to stable storage when necessary.  
00884  * The last log in the buffer is always a skip log record, 
00885  * which indicates the end of the log partition.
00886  *
00887  * Ultimately, archiving to tape is necessary.  The storage manager
00888  * does not perform write-aside or any other work in support of
00889  * long-running transactions.
00890  *
00891  * The checkpoint manager chkpt_m sleeps until kicked into action
00892  * by the log manager, and when it is kicked, it takes a checkpoint, 
00893  * then sleeps again.  Taking a checkpoint amounts to these steps:
00894  * - Write a chkpt_begin log record.
00895  * - Write a series of log records recording the mounted devices and volumes..
00896  * - Write a series of log records recording the mounted devices.
00897  * - Write a series of log records recording the buffer pool's dirty pages.
00898  *    For each dirty page in the buffer pool, the page id and its recovery lsn 
00899  *    is logged.  
00900  *    \anchor RECLSN
00901  *    A page's  recovery lsn is metadata stored in the buffer 
00902  *    manager's control block, but is not written on the page. 
00903  *    It represents an lsn prior to or equal to the log's current lsn at 
00904  *    the time the page was first marked dirty.  Hence, it
00905  *    is less than or equal to the LSN of the log record for the first
00906  *    update to that page after the page was read into the buffer
00907  *    pool (and remained there until this checkpoint).  The minimum
00908  *    of all the  recovery lsn written in this checkpoint 
00909  *    will be a starting point for crash-recovery, if this is 
00910  *    the last checkpoint completed before a crash.
00911  * - Write a series of log records recording the states of the known 
00912  *    transactions, including the prepared transactions.  
00913  * - Write a chkpt_end log record.
00914  * - Tell the log manage where this checkpoint is: the lsn of the chkpt_begin
00915  *   log record becomes the new master_lsn of the log. The master_lsn is
00916  *   written in a special place in the log so that it can always be 
00917  *   discovered on restart.
00918  *
00919  *   These checkpoint log records may interleave with other log records, making
00920  *   the checkpoint "fuzzy"; this way the world doesn't have to grind to
00921  *   a halt while a checkpoint is taken, but there are a few operations that
00922  *   must be serialized with all or portions of a checkpoint. Those operations
00923  *   use mutex locks to synchronize.  Synchronization of operations is
00924  *   as follows:
00925  *   - Checkpoints cannot happen simultaneously - they are serialized with
00926  *   respect to each other.
00927  *   - A checkpoint and the following are serialized:
00928  *      - mount or dismount a volume
00929  *      - prepare a transaction
00930  *      - commit or abort a transaction (a certain portion of this must
00931  *        wait until a checkpoint is not happening)
00932  *      - heriocs to cope with shortage of log space
00933  *   - The portion of a checkpoint that logs the transaction table is
00934  *     serialized with the following:
00935  *      - operations that can run only with one thread attached to
00936  *        a transaction (including the code that enforces this)
00937  *      - transaction begin, end
00938  *      - determining the number of active transactions
00939  *      - constructing a virtual table from the transaction table
00940  *
00941  * \section BF_M Buffer Manager
00942  * The buffer manager is the means by which all other modules (except
00943  * the log manager) read and write pages.  
00944  * A page is read by calling bf_m::fix.
00945  * If the page requested cannot be found in the buffer pool, 
00946  * the requesting thread reads the page and blocks waiting for the 
00947  * read to complete.
00948  *
00949  * All frames in the buffer pool are the same size, and 
00950  * they cannot be coalesced, 
00951  * so the buffer manager manages a set of pages of fixed size.
00952  *
00953  * \subsection BFHASHTAB Hash Table
00954  * The buffer manager maintains a hash table mapping page IDs to
00955  * buffer control blocks.  A control block points to its frame, and
00956  * from a frame one can arithmetically locate its control block (in
00957  * bf_m::get_cb(const page_s *)).
00958  * The hash table for the buffer pool uses cuckoo hashing 
00959  * (see \ref P1) with multiple hash functions and multiple slots per bucket.  
00960  * These are compile-time constants and can be modified (bf_htab.h).
00961  *
00962  * Cuckoo hashing is subject to cycles, in which making room on one 
00963  * table bucket A would require moving something else into A.
00964  * Using at least two slots per bucket reduces the chance of a cycle.
00965  *
00966  * The implementation contains a limit on the number of times it looks for
00967  * an empty slot or moves that it has to perform to make room.  It does
00968  * If cycles are present, the limit will be hit, but hitting the limit
00969  * does not necessarily indicate a cycle.  If the limit is hit,
00970  * the insert will fail.
00971  * The "normal" solution in this case is to rebuild the table with
00972  * different hash functions. The storage manager does not handle this case.
00973  * \bug bf_core_m::htab
00974  * GNATS 47 :
00975  * In event of insertion failure, the hash table will have to be rebuilt with
00976  * different hash functions, or will have to be modified in some way.
00977  *
00978  * \bug GNATS 35 The buffer manager hash table implementation contains a race.
00979  * While a thread performs a hash-table
00980  * lookup, an item could move from one bucket to another (but not
00981  * from one slot to another within a bucket).
00982  * The implementation contains a temporary work-around for
00983  * this, until the problem is more gracefully fixed: if lookup fails to
00984  * find the target of the lookup, it performs an expensive lookup and
00985  * the statistics record these as bf_harsh_lookups. This is expensive.
00986  *
00987  * \subsection REPLACEMENT Page Replacement
00988  * When a page is fixed, the buffer manager looks for a free buffer-pool frame,
00989  * and if one is not available, it has to choose a victim to replace. 
00990  * It uses a clock-based algorithm to determine where in the buffer pool
00991  * to start looking for an unlatched frame:
00992  * On the first pass of the buffer pool it considers only clean frames. 
00993  * On the second pass it will consider dirty pages,
00994  * and on the third or subsequent pass it will consider any frame.
00995  *
00996  * The buffer manager forks background threads to flush dirty pages. 
00997  * The buffer manager makes an attempt to avoid hot pages and to minimize 
00998  * the cost of I/O by sorting and coalescing requests for contiguous pages. 
00999  * Statistics kept by the buffer manager tell the number of resulting write 
01000  * requests of each size.
01001  *
01002  * There is one bf_cleaner_t thread for each volume, and it flushes pages for that
01003  * volume; this is done so that it can combine contiguous pages into
01004  * single write requests to minimize I/O.  Each bf_cleaner_t is a master thread with
01005  * multiple page-writer slave threads.  The number of slave threads per master
01006  * thread is controlled by a run-time option.
01007  * The master thread can be disabled (thereby disabling all background
01008  * flushing of dirty pages) with a run-time option. 
01009  *
01010  * The buffer manager writes dirty pages even if the transaction
01011  * that dirtied the page is still active (steal policy). Pages
01012  * stay in the buffer pool as long as they are needed, except when
01013  * chosen as a victim for replacement (no force policy).
01014  *
01015  * The replacement algorithm is clock-based (it sweeps the buffer
01016  * pool, noting and clearing reference counts). This is a cheap
01017  * way to achieve something close to LRU; it avoids much of the
01018  * overhead and mutex bottlenecks associated with LRU.
01019  *
01020  * The buffer manager maintains a hash table that maps page IDs to buffer 
01021  * frame  control blocks (bfcb_t), which in turn point to frames
01022  * in the buffer pool.  The bfcb_t keeps track of the page in the frame, 
01023  * the page ID of the previously-held page, 
01024  * and whether it is in transit, the dirty/clean state of the page, 
01025  * the number of page fixes (pins) held on the page (i.e., reference counts), 
01026  * the \ref RECLSN "recovery lsn" of the page, etc.  
01027  * The control block also contains a latch.  A page, when fixed,
01028  * is always fixed in a latch mode, either LATCH_SH or LATCH_EX.
01029  * \bug GNATS 40 bf_m::upgrade_latch() drops the latch and re-acquires in
01030  * the new mode, if it cannot perform the upgrade without blocking. 
01031  * This is an issue inherited from the original SHORE storage manager.
01032  * To block in this case
01033  * would enable a deadlock in which two threads hold the latch in SH mode
01034  * and both want to upgrade to EX mode.  When this happens, the statistics 
01035  * counter \c bf_upgrade_latch_race is incremented.
01036  *
01037  * Page fixes are expensive (in CPU time, even if the page is resident).
01038  *
01039  * Each page type defines a set of fix methods that are virtual in 
01040  * the base class for all pages: The rest of the storage manager 
01041  * interacts with the buffer manager primarily through these methods 
01042  * of the page classes.  
01043  * The macros MAKEPAGECODE are used for each page subtype; they 
01044  * define all the fix methods on the page in such a way that bf_m::fix() 
01045  * is properly called in each case. 
01046  *
01047  * A page frame may be latched for a page without the page being 
01048  * read from disk; this
01049  * is done when a page is about to be formatted. 
01050  *
01051  * The buffer manager is responsible for maintaining WAL; this means it may not
01052  * flush to disk dirty pages whose log records have not reached stable storage yet.
01053  * Temporary pages (see sm_store_property_t) do not get logged, so they do not
01054  * have page lsns to assist in determining their clean/dirty status, and since pages
01055  * may change from temporary (unlogged) to logged, they require special handling, described
01056  * below.
01057  *
01058  * When a page is unfixed, sometimes it has been updated and must be marked dirty.
01059  * The protocol used in the storage manager is as follows:
01060  *
01061  * - Fixing with latch mode EX signals intent to dirty the page. If the page
01062  *   is not already dirty, the buffer control block for the page is given a
01063  *   recovery lsn of the page's lsn. This means that any dirtying of the page
01064  *   will be done with a log record whose lsn is larger than this recovery lsn.
01065  *   Fixing with EX mode of an already-dirty page does not change 
01066  *   the recovery lsn  for the page.
01067  *
01068  * - Clean pages have a recovery lsn of lsn_t::null.
01069  *
01070  * - A thread updates a page in the buffer pool only when it has the
01071  *   page EX-fixed(latched).
01072  *
01073  * - After the update to the page, the thread writes a log record to 
01074  *   record the update.  The log functions (generated by Perl) 
01075  *   determine if a log record should be written (not if a tmp 
01076  *   page, or if logging turned off, for example),
01077  *   and if not, they call page.set_dirty() so that any subsequent
01078  *   unfix notices that the page is dirty.
01079  *   If the log record is written, the modified page is unfixed with
01080  *   unfix_dirty() (in xct_impl::give_logbuf).
01081  *
01082  * - Before unfixing a page, if it was written, it must be marked dirty first
01083  *   with 
01084  *   - set_dirty followed by unfix, or
01085  *   - unfix_dirty (which is set_dirty + unfix).
01086  *
01087  * - Before unfixing a page, if it was NOT written, unfix it with bf_m::unfix
01088  *   so its recovery lsn gets cleared.  This happens only if this is the
01089  *   last thread to unfix the page.  The page could have multiple fixers 
01090  *   (latch holders) only if it were fixed in SH mode.  If fixed (latched)
01091  *   in EX mode,  this will be the only thread to hold the latch and the
01092  *   unfix will clear the recovery lsn.
01093  *
01094  *  It is possible that a page is fixed in EX mode, marked dirty but never
01095  *  updated after all,  then unfixed.  The buffer manager attempts to recognize
01096  *  this situation and clean the control block "dirty" bit and recovery lsn.
01097  *
01098  * Things get a little complicated where the buffer-manager's 
01099  * page-writer threads are
01100  * concerned.  The  page-writer threads acquire a share latches and copy
01101  * dirty pages; this being faster than holding the latch for the duration of the
01102  * write to disk
01103  * When the write is finished,  the page-writer re-latches the page with the
01104  * intention of marking it clean if no intervening updates have occurred. This
01105  * means changing the \e dirty bit and updating the recovery lsn in the buffer 
01106  * control block. The difficulty lies in determining if the page is indeed clean,
01107  * that is, matches the latest durable copy.
01108  * In the absence of unlogged (t_temporary) pages, this would not be terribly
01109  * difficult but would still have to cope with the case that the page was
01110  * (updated and) written by another thread between the copy and the re-fix.
01111  * It might have been cleaned, or that other thread might be operating in
01112  * lock-step with this thread.
01113  * The conservative handling would be not to change the recovery lsn in the
01114  * control block if the page's lsn is changed, however this has 
01115  * serious consequences
01116  * for hot pages: their recovery lsns might never be moved toward the tail of
01117  * the log (the recovery lsns remain artificially low) and 
01118  * thus the hot pages can prevent scavenging of log partitions. If log
01119  * partitions cannot be scavenged, the server runs out of log space.
01120  * For this reason, the buffer manager goes to some lengths to update the
01121  * recovery lsn if at all possible.
01122  * To further complicate matters, the page could have changed stores, 
01123  * and thus its page type or store (logging) property could differ.
01124  * The details of this problem are handled in a function called determine_rec_lsn().
01125  *
01126  * \subsection PAGEWRITERMUTEX Page Writer Mutexes
01127  *
01128  * The buffer manager keeps a set of \e N mutexes to sychronizing the various
01129  * threads that can write pages to disk.  Each of these mutexes covers a
01130  * run of pages of size smlevel_0::max_many_pages. N is substantially smaller
01131  * than the number of "runs" in the buffer pool (size of 
01132  * the buffer pool/max_many_pages), so each of the N mutexes actually covers
01133  * several runs:  
01134  * \code
01135  * page-writer-mutex = page / max_many_pages % N
01136  * \endcode
01137  *
01138  * \subsection BFSCAN Foreground Page Writes and Discarding Pages
01139  * Pages can be written to disk by "foreground" threads under several
01140  * circumstances.
01141  * All foreground page-writing goes through the method bf_m::_scan.
01142  * This is called for:
01143  * - discarding all pages from the buffer pool (bf_m::_discard_all)
01144  * - discarding all pages belonging to a given store from the buffer pool 
01145  *   (bf_m::_discard_store), e.g., when a store is destroyed.
01146  * - discarding all pages belonging to a given volume from the buffer pool 
01147  *   (bf_m::_discard_volume), e.g., when a volume is destroyed.
01148  * - forcing all pages to disk (bf_m::_force_all) with or without invalidating
01149  *   their frames, e.g., during clean shutdown.
01150  * - forcing all pages of a store to disk (bf_m::_force_store) with 
01151  *   or without invalidating
01152  *   their frames, e.g., when changing a store's property from unlogged to
01153  *   logged.
01154  * - forcing all pages of a volume to disk (bf_m::_force_store) with 
01155  *   without invalidating the frames, e.g., when dismounting a volume.
01156  * - forcing all pages whose recovery lsn is less than or equal to a given
01157  *   lsn_t, e.g.,  for a clean shutdown, after restart.
01158  */