Sorting
[Storage Structures]

Collaboration diagram for Sorting:


Detailed Description

The storage manager provides two tools for sorting, the class sort_stream_i and the method ss_m::sort_file.

The class sort_stream_i implements sorting of <key,elem> pairs with a simple interface. It is an adjunct of bulk-loading indexes. One inserts <key,elem> pairs into the stream, and then iterates over the stream with its sort_stream_i::get_next method to retrieve the pairs in order. The sort_stream_i uses temporary files as necessary to store the data stream.

The storage manager function ss_m::sort_file implements a polyphase merge-sort.

Both methods offer a variety of run-time options. The options are specified with different APIs:

Note:
Historical note: The original implementation of the sort_file method used the same APIs (for specifying sort behavior and describing keys) as sort_stream_i, but that implementation for sort_file was inadequate for the needs of the projects using the storage manager at the time (viz user-defined key types, marshalling and unmarshalling, etc.). The new implementation is more general, but the sort_stream_i has not been rewritten to use the new options- and key- descriptor classes.
See sort_stream_i for details and sort_stream::cpp for an example of its use. The balance of this section describes the use of ss_m::sort_file and its application to bulk-loading.

Sort_file supports:

For all this generality, the sort uses callbacks to the server for such things as

Runs

The sort takes as many runs as required to read and sort the entire input file, while limiting the amount of memory used. The run size is given as an argument to sort_file.

The sort uses temporary files when the input file contains more records than can fit in one run (determined by the run size). These temporary files may be spread across multiple volumes, which is useful if the volumes reside on different spindles.

If the entire input file fits into the buffer pool pages of one run, much of the complexity of the sort is eliminated, because the copying of objects and metadata to scratch files is unnecessary, but for the general (polyphase) case, behavioral options describe how the writing to scratch files is handled, and these options depend on the kind of output desired (see Behavior and Results of Sort).

Callbacks

The sort has to call back to the server for

Behavior and Results of Sort

Behaviors that can be controlled are the following; these behaviors are determined by the contents of the sort_keys_t argument to sort_file:

Many behavioral options depend on other options, as discussed below.

Result is Sorted Copy of Input File

Use ssm_sort::sort_keys_t::set_for_file to create a sorted copy of the input file.

Applicable key description data:

Applicable sort behavior options:

Result is Input to Bulk-Load

Applicable key description data:

Applicable sort behavior options:

Keys

Sort keys may be located in the input records or derived from the input records. Index keys that are different from the sort key are derived.

When the output is to be a sorted copy of the input file, the keys do not appear in the output file except inasmuch as they are embedded in the original input records. The sort keys in this case may be derived from the record contents, in which case they truly do not appear in the output. Multiple keys may be used for the sort, and they may be a combination of fixed-location keys and derived keys. See Result is Sorted Copy of Input File.

When the output is to be a bulk-loadable file, its records takes the form header = key, body = element, and sort-file gives the caller no control over the elements' contents: they are the record-IDs of the original records. If something other than the record-ID is desired for bulk-loading an index, the output can be made to be a sorted copy of the original file, with a suitable unmarshal (UMOF) applied.

Only one sort key can be used in this case, but the index key can differ differ from the sort key. See Result is Input to Bulk-Load.


Classes

class  ssm_sort::key_cookie_t
 Input, output argument to CSKF, MOF, UMOF callbacks. More...
class  ssm_sort::factory_t
 A memory allocator (abstract base class) used by ss_m::sort_file and its adjuncts. More...
class  ssm_sort::key_location_t
 Descriptor for fixed-location keys. More...
class  ssm_sort::object_t
 Handle on a record in the buffer pool or in scratch memory. More...
class  ssm_sort::skey_t
 The result of a CSKF function. More...
struct  ssm_sort::generic_CSKF_cookie
 A cookie passed to generic_CSKF callback must point to one of these. More...
class  ssm_sort::sort_keys_t
 Parameter to control behavior of sort_file. More...
class  sort_stream_i
 Sorting tool. More...

Typedefs

typedef w_rc_t(*) ssm_sort::CSKF (const rid_t &rid, const object_t &in_obj, key_cookie_t cookie, factory_t &f, skey_t *out)
 Create-Sort-Key Function.
typedef w_rc_t(*) ssm_sort::MOF (const rid_t &rid, const object_t &obj_in, key_cookie_t cookie, object_t *obj_out)
 Marshal Object Function.
typedef w_rc_t(*) ssm_sort::UMOF (const rid_t &rid, const object_t &obj_in, key_cookie_t cookie, object_t *obj_out)
 Un-marshal Object Function.
typedef int(*) ssm_sort::CF (w_base_t::uint4_t length1, const void *key1, w_base_t::uint4_t length2, const void *key2)
 key Comparison Function
typedef w_rc_t(*) ssm_sort::LEXFUNC (const void *source, smsize_t len, void *sink)
 Lexify key function.

Functions

static rc_t ss_m::sort_file (const stid_t &fid, const stid_t &sorted_fid, int nvids, const vid_t *vid, sort_keys_t &kl, smsize_t min_rec_sz, int run_size, int temp_space)
 Sort a file.


Typedef Documentation

ssm_sort::CSKF

Create-Sort-Key Function.

Parameters:
[in] rid Record ID of the record containing the key.
[in] in_obj Handle (object_t) on the record whose key we need.
[in] cookie Describes the location and type of key in the source record.
[in] f Class for managing allocated space.
[out] out Result is written to this object_t.
This type of callback function fills in the skey_t out for a key in an object (record). It is called only when the object is first encountered in its run in a sort.

A set of predefined callbacks are provided, q.v.:

See generic_CSKF_cookie for an example of a simple user-defined CSKF.

The skey_t out is pre-allocated and must be populated by this callback function. The factory_t f may be used for this allocation, or a user-defined factory_t may be used instead. Whatever factory is used to allocate the buffer to hold a key, that same factory must be placed in the skey_t via the placement-new constructor call. Examples are given in the description for skey_t.

This callback must not free any space for in_obj.

Definition at line 663 of file sort_s.h.

ssm_sort::MOF

Marshal Object Function.

Parameters:
[in] rid Record id of the source record to marshal
[in] obj_in Handle on the source record (object_t)
[in] cookie Describes location and type of key in the source record
[out] obj_out Result is to be written to this object_t.
The obj_out is already allocated; this function must populate it. The obj_out has no buffers or factory associated with it. The MOF must populate the obj_out from its own factory (it may use factory_t::cpp_vector, which uses the global heap, or it may use the factory from the obj_in, but ONLY if that factory is not factory_t::none. The obj_out object carries its factory along with it throughout the sort process.

This callback must not free any space for obj_in.

This type of callback is used

A predefined callback is provided:

The sort code checks for

in which case, it does not call a marshal function at all. (This is less work than allocating the output object_t to call a vacuous function.)

Definition at line 707 of file sort_s.h.

ssm_sort::UMOF

Un-marshal Object Function.

Parameters:
[in] rid Record id of the source record to marshal
[in] obj_in Handle on the source record (object_t)
[in] cookie Describes location and type of key in the destination record
[out] obj_out Result is to be written to this object_t.
The obj_out is already allocated; this function must populate it. The obj_out has no buffers or factory associated with it. The MOF must populate the obj_out from its own factory (it may use factory_t::cpp_vector, which uses the global heap, or it may use the factory from the obj_in, but ONLY if that factory is not factory_t::none. The obj_out object carries its factory along with it throughout the sort process.

This callback must not free any space for obj_in.

This type of callback is used

A predefined callback is provided:

The sort code checks for

in which case, it does not call a marshal function at all. (This is less work than allocating the output object_t to call a vacuous function.)

Definition at line 755 of file sort_s.h.

ssm_sort::CF

key Comparison Function

Parameters:
[in] length1 Length of first key in bytes
[in] key1 Pointer to start of first key
[in] length2 Length of second key in bytes
[in] key2 Pointer to start of second key
Return values:
int Negative if key1 < key2, 0 if equal, Positive if key1 > key2
Used on every key comparison.
Note:
It is up to the server to determine if the keys are properly aligned for comparison as fundamental types. Alignment is determined by the combined behavior of the CSKF and MOF callbacks, as well as of the location of the keys in the original records.

Definition at line 778 of file sort_s.h.

ssm_sort::LEXFUNC

Lexify key function.

Parameters:
[in] source Pointer to start of key
[in] len Length of key
[out] sink Pointer to output buffer (preallocated, of length len)
This type of function is called by the generic_CSKF. It may be useful for user-defined CKSF callbacks. Its purpose is to reformat keys into lexicographic form so that simple string-type (byte-by-byte) comparisons yield the same results as typed comparisons would. An alternative to lexifying keys and using string comparisons is to use typed comparisons, however, that has its disadvantages, particularly for keys that might be spread across page boundaries or are not aligned.

Definition at line 801 of file sort_s.h.


Function Documentation

static rc_t ss_m::sort_file ( const stid_t fid,
const stid_t sorted_fid,
int  nvids,
const vid_t vid,
sort_keys_t kl,
smsize_t  min_rec_sz,
int  run_size,
int  temp_space 
) [static, inherited]

Sort a file.

Parameters:
[in] fid File to sort.
[in] sorted_fid File to which to write the results.
[in] nvids Size of array vid.
[in] vid Array of IDs of scratch files created by the caller.
[in] kl See sort_keys_t.
[in] min_rec_sz Hint of minimum record size in input file.
[in] run_size Number of pages in buffer pool to use for a run.
[in] temp_space Number of pages to use for scratch space. (This limits the amount of memory used by the sort).
Before you call sort_file, you must create an output file sorted_fid into which sort_file will write the results.

The sort uses temporary files when the input file contains more records than can fit in one run (determined by run_size). These temporary files may be spread across multiple volumes, which is useful if the volumes reside on different spindles. The arguments nvids and vid are for indicating the volumes to use for these scratch files.

The caller can provide a clue in min_rec_size about the minimum record size of the input file, which can help the sort's efficiency.

The run_size indicates how many buffer-pool pages to use for each run. Since at all times one page is fixed for output, while the rest are for reading the input in runs, the real run size is run_size-1.


Generated on Wed Jul 7 17:22:37 2010 for Shore Storage Manager by  doxygen 1.4.7