Notes on Shore

These are my notes about the Shore Storage Manager. Some are bugs, Some are oddities, Some are things to wish for, and some are annoying behaviors. Others are things that I have thought about and need to investigate to find out what the actual case is. Yet others are refinements and ideas that I want to keep track of.

Page Size
The volumes and logs are not marked with any page size attributes. So, when you change the page eize and restart the system, it gets really confused. It may even irreversibly damage the existing log and pages!
Sorting
The new sort code works ok for the most part. When it was run with paradise it performs just ok. However, logical ID support is not very good with the new sort code, and some of it may be missing entirely. In addition the ssh test suite has problems if you try to use the old sort code in compatability mode. And many tests will not run due to these problems. btree.9 tries to run but fails with a NOTIMPLEMENTED error.
ssh testing
rtree.4 fails with logical ids turned on eBADVOL
SSH error reporting
I have a hack (SSH_DUMPRC) to print out full RC codes for ssh so the *location* of the RC can be found, so the particular error can be tracked down. However, it would be better if only that happens for unexpected errors, rather than all errors. It makes the facility less than generally usable.
Possible disk-space problems
When my home NT box was running short on disk space, the SM was going really wacky. It kept on running, but may have been having problems. However, one restart after a ^C caused a core dump with what looked like invalid data coming from the volume or the log. Some testing should probably be done when there isn't enough space for the log to be certain that it detects errors. Unix log only of course, the raw log has all the space it needs.
Record re-sizing is Odd.
There exists a grey area with odd behavior anomalies in the operation of the SM. If you take an existing record and expand it in one shot so that it is larger than can fit on a SM page, then the SM just makes it into a large record and it works. But if you expand the same record to a size that is less than large record size, but it won't fit on the page, then the SM errors out on you. In that case you need to delete the existind record and create_rec it somewhere else in the file where it will fit. If you have a bunch of records that are growing, this can provide horrendously bad behavior, as random record growth causes these record forwards (and corresponding index updates) to just drive records all over the place until they are all large. At which point we have way too many underutilized pages, unless we also have a bunch of smaller records.

One solution to this would be to add a "enlarge in place" facility to record handling, so I can just say I want to keep the rid and enlarge it here.

There is also an inconsistency here, in that truncate_rec, when a record is brought small enough to be a on-page object, but the page won't hold the record because it is a bit full, will just leave the record as a large record! It seems that it should behave orthogonally with regards to append_rec to minimize the number of small large records. Perhaps, again, a flag should be provided so the user knows that the record is small and should be moved.

I don't think this is kept track of at the page level, and the append_rec() scenario can NOT happen in reverse, which screws things up another way .... as you have a bunch of records shrinking, a lot of them will be left as large records until the next truncate_rec on the same record. Also, one space becomes available on the page for a "small" large record, there is nothing to convert that record back to fit on the page! It seems that to do this we need a logical-level logging operation that says "record x converted from large rep to small" ... while remaining the same size. The locking has to be right (a temporary hold all comers lock) while the operation is in progress, so that it proceeds invisibly to all users. This isn't a problem for append_rec(), since the record being enlarged past the amount the page will hold is also the record being modified, instead of a delete/truncate of another record in a page allowing "small large records" to be coalesced. If locks are allowed on *pieces* of a record (large object) this could be a problem too, since those locks would need to be converted to the lock on the now smaller rid. If only records are locked, though, no problem there.

truncate_rec
It isn't immediately obvious that truncate_rec truncates the record by that many bytes, not that it truncates it TO that many bytes. Such as the ftruncate() the unix system call as a popular example of that methodology.
update_rec() Record update with different size
There isn't an easy way to replace the contents of a record with contents of a different size. To do it you need to either grow the record with append_rec() with zeros, or shrink it with truncate_rec() the appropriate number of bytes. Then update_rec() the content with the new content.

An optimization is available for record expansion; where you can append_rec() with the new bytes, and then update_rec() the original bytes. It still requires multiple SM calls and the associated overhead, though.

build system versus languages
gcc -x c++ and gcc are used for linking c++ programs. This is sorta a problem because then the build system has to know all the language specific libraries that c++ needs to link against, instead of g++ just adding them in. It also falls down if you want to compile with multiple languages, since there isn't a way, for example, to say that a c++ program needs a c++ linker, and a pascal program needs a pascal linker, etc.
output streams
For cerr/cout streams to work really well, each thread really needs to have it's own "context". Otherwise, changes that other threads make, such as numeric precision or output justification will magically appear in the output of another thread trying to do the same thing. Locking really won't even help this, as the format changes would have to be "transactional based" to have any meaning, and that would lock the stream up completely. This is a big issue when going multi-cpu capable, but could also be an issue if sthread-non-blocking streams are used to improve I/O capability.
out-of-memory handler
Install an out-of-memory new handler that will W_FATAL the system when an out-of-memory error occurs. Have to be careful to not to try allocate additional resources so error messages will print OK and stuff like that.
SSH doesn't give error information for TCL scripts
If something goes wrong in the execution of a TCL script (runscripts random.all see what happens) there is no idea of what file and line number the error happened at!
ssh script lgrec.2 uses lots of memory
The lgrec.2 script will often trigger segmentation violations on NetBSD. This happens because it allocates a lot of memory (for reading large records in to the tcl result buffer). If the memory limit is too low, something happens and a SEGV results. This is really bad behavior, I would vastly prefer it if the process said it was out of memory, rather than making it look like something has gone wrong. This may be due to a bad allocator on NetBSD 1.2, but the same thing happens on Solaris too, just with lower memory limits. So, this isn't a shore problem, but it is annoying that the system doesn't crash in a better, more informative, fashion.
hash values are fixed at uint4_t
Currently the hash values for the SM are "stuck" at 4 byte unsigned integers. This is a problem because as the machine architecture bits expands, the size of hash values does not. A w_hashval_t should be created and used for all hash values, so it can be changed as needed. Note that there is no reason that hash values should be a fixed size type; z.B. they could be a 'long' to follow the largest native machine type. Also note that lots of places assume that hashvalues are 4 byte integers and do arithmetic on them like that.
smthread_t constructors
The smthread_t constructors should be a super-set of sthread_t constructors, so that things that are sthreads can just become smthreads without having to know anything SM-specific. The flip-side is that the default sthread arguments have to be specified to get non-default SM arguments.
xct_t printing
The xct_t operator<<() should be usable by ordinary users wanting to print transaction info. Currently, I believe you need to include xct.h and the xct guts to just print a transaction. Perhaps the xct_t printer should be declared in sm.h so that random SM users can print transactions.
sm core dumps when dismount_all
If you have a shore configured and you try a dismount_all when no volumes are mounted, the SM core dumps, perhaps on a mutex-mutex deadlock. Look at running 'Index' when the volume hasn't been formatted for an example of this behavior.
keep track of max stack depth
[I keep forgetting to do this, and I've been meaning to add it to NewThreads (and sthreads) for a couple of years. I had a prototype at one time, but I wanted to fix some problems with it and I didn't complete it.] Keep track of the maximum amount of stack used by a thread. Only the current stack depth is tracked, instead of historical stack depth. In addition an interrupt could be used to preemptively track stack usage, finding deeper stack usage than just the stack depth at cooperative context switches courtesy of synchronization.
sthread sleep(0) causes an assertion failure
Sleep(0) should not cause a panic, it should either do nothing or cause a context switch (aka yield). What happens now is that it catches the w_assert1(timeout != TIMEOUT_IMMEDIATE) in block.
sthread sleep() and yield() are non-orthogonal
One requires a thread object and the other is static. This is somewhat unexpected and a bit of a hassle to remember which does which. It could be better.
sinfo_s
The sinfo_s and sdesc_t structures and classes used for representing meta information about stores are bogus. There is one sinfo_t which is supposed to be used for all files, indexes, etc. This means that the content of this needs to be a non-shared union, and all types pay for each other. It also means that adding per-store data breaks everything in the world. And, the sinfo_s constructor for setting these thinsg up is a big mess that everyone has to know *exactly* what all the arguments do, even for classes that you don't care about. This should really be changed to have an sinfo_t per store type, so each type of store can have it's own info. Since the sinfo_t is stored in a index or a file, the differing sizes won't (or shouldn't) matter -- except for index retrieval problems, and the page-size could be used as the buffer to retrieve the data into to fix that unitl that problem is fixed (or use the cursor interface which fixes it).
bulkload sort_duplicates
It appears that the bulk load code for btrees (problem may be present elsewhere, I haven't looked) will only sort duplicates that are on the same page in the bulk load file. So, if you have duplicate values that spread across multiple pages of the bulkload file, they won't be sorted properly, only on a page-by-page basis. To fix this the bulk load code needs to either create runs of duplicate value items and sort the runs, or it needs to use a multi-value sort and sort the entire input (which sort of makes sort duplicates do 2x the work, as the user should do that in the first place, or do something else to deal with this case correctly.

I may have stated the above wrong, for the code actually will sort things among two pages. But if the 3rd subsequent page contains something which should be sorted onto the 1st (or prior) subsequent page, then it is problem-land again.

bulkload fill factor
It is currently not possible to specify a fill factor for bulk loads of btrees. You may want to do this to load the tree quickly, yet still have inserts succeed without causing immediate page splits. Note that the index split factor is a index fill factor when bulk loading is running.
bulkload btree performance
Performance of bulk-loaded btrees is often worse than that of randomly loaded btrees. This is just not expected, as there should be fewer pages with higher fill factors in the bulk-loaded btree. The problem apparently is that the bulkloaded btree is deeper than the randomly loaded btree. This extra tree depth hurts performance considerably. Yet another thing to track down.

This occurs because there is a greater % fill factor (70% according to papers) in the interior pages of a random-built btree than there is in the 50% split factor (aka 50% fill factor) pages that are created by the bulkload.

Once I finish the btree split factor addition to Shore, it can be used (or a load-time parameter can be used) to control the split/fill factor during bulkload. That will allow a higher utilization in the interior nodes and the tree depth should go down. Why not 100% fill -- well, what if you want to bulkload a btree that you expect random inserts to happen to in short order!

Performance deteriorates with larger volumes
As you increase the Shore volume size, runtime performance with a fixed data size increases. For example with a 10GB volume, runtime for ssh's all script is around 8 minutes. With a 30GB volume, the same script takes about 17 minutes. WIth a 100GB volume, the script takes about 48 minutes. I need to put a profiler on and see where the time is going.

This may only be a problem on Linux; need to check other platforms first.

Pin::next_bytes() not intuitive
next_bytes in a pin goes to infinity in some cases. Such as when you are at the end of a pin. I think a more reasonable error would be for it to go to 0, indicating that no bytes remain.
no replace_assoc()
There is no replace_assoc() method to update the value of an association. The only way to do this is to delete_assoc(); create_assoc();. This sorta sucks because it means that if you are doing updates that they have to talk the btree 2x as much as needed, once for delete and then another time for create. Doing this could improve update rates quite a bit.
diskrw missing
This may be bogus due to other problems, but I moved it here as a reminder In some cases when diskrw is missing there are no errors and no crash. Things just act weird and nothing happens.
compiler link options
When DEBUGGERSYMBOLS are used, the link command is not passed any information about that state. This is a problem since the linker may need to do things differently and/or add on extra files to the link (such as end.o on HPUX).
Alignment issues painful
Alignment issues are becoming quite painful. For example I added an 'type *_exception' member to sthread_t to point to a per-thread exception context. This just happened to break btrees (presumably because of the kc_buf in smthread.h. Or some other derived alignment problem. Really need to have real fixes for this, the mess is just like a house of cards.
Short Keys
The shore key specification system has no way to specify typed variable length keys. For example if I have a key that could be "u4u4" or "u4u4u4", you can't do it. You need to specify a "u4u4u4" key and zero-pad the unused last "u4" element. This also causes a problem if you want to use a scan of a partial key. For example, a scan of "u4u4" but just specifying "u4". Gotta look at this, it is more painful than it should be
Pin w/ offset
pin_i::pin(..., start) doesn't work correctly with small records. Instead of returning a body() that is start bytes into the short tuple, it returns the start of the tuple. This is just plain wrong, as it means that a user needs to determine what size a tuple is before accessing a portion of it! This needs to be fixed soon. The pin_i code is a real mess with regards to start number and offset and paging handling, and some time should be spent to see if it could be improved.
Duplicate Btree Entries
Shore lets you write duplicate keys into a btree once. However, if you try to write duplicate keys a 2nd time, then it generates errors in btree_impl.cpp:922 that there are duplicate keys in the tree. Either the first duplicate insert should NOT be allowed to happen, or something sane should happen. A scan reveals the duplicate entries, probes just fetch one, presumably the first of the duplicates with the same body value. You can insert two identical key-value pairs, errors start occuring afterwards. However, even once that happens you can still insert one more key/value pair with the same key and a different value. Then, you can no longer insert values. The order of the above max 3 inserts doesn't matter.
TCL assert failures when running rtree.all on NT multiple times
Causes by the NT random number generator making a duplicate key by in the rtree. This occurs somewhat randomly, but repeatably. The assert goes off when the retrieved key doesn't contain the expected value ... because the key was generated more than once. I'm not 100% certain about this, but saw the same kinds of problems in the MacOSX version when I had a poor random-generator hacked together for testing.
Rollback bugs with full pages
You will see these either on a abort_xct() or when the log is cleaned up on system start. An assert goes off in page.cpp complaining about _nfree >= _nrsvd or something close. This started with the 64 bit port, which aligns data on a page to a x8 (previous a x4) alignment ... thereby reducing the space on the page by 4 bytes. The difference between nfree and nrsvd is in the assert failure is always a off-by-4 error. It only happens when rolling back a page that is almost full. I believe that somewhere is a chunk of code that has a magic number in it .. instead of using the appropriate constant. It "knows" about the old amount of free space, and when the page gets full, this assertion goes off.
Btree Index Key length sometimes ignored???
Create a btree with a key such as 'u4'. Store a value with a key longer than 'u4', such as a string into the btree. Ooops, the key value is silently truncated and shore doesn't generate an error ... which I've seen it do in other situations. Retrieve a value with the same key .. again no complaint. Of course, if you store several keys with the same 4 byte prefix, the mapping is many-to-one and you loose stored info. Need to investigate this and find why the key length is ignored.
Buffer Manager Dirty Threshold
The Buffer Manager's dirty page threshold is currently fixed at arbitrary values. It is initially calculated as one-eighth (1/8) of the number of pages in the bufferpool. However, a maximum dirty threshold of 100 pages is enforced on top of that. Oris it a minimum of 100 pages? It depends on the interpretation -- I need to examine this more closely.

It might be good if this was adjustable, since it only allows a very small portion of a large buffer pool to be dirty. If you are really doing that much writing, a larger number could be better to try and get more sequential I/O going on writes to extents.

Log Master Records
What happens here is that the write of a new master log record file, and the unlink of the old one are all done synchronously. That is, direct local system calls in the shore process. This of course brings everything to a screaming HALT while this happens.

I think the thing to do here is to upgrade the diskrw processes to also do random file system operations on behalf of shore. And, we can always leave _1_ sitting around to do these operations in a asynchronous manner. For example, once the new log master is created, the unlink could be done asynchronously, and the process would not have to wait around for it, unless it wanted an error return. If the logging system had more parallelism, other people could be writing to the log tail while the checkpoint happens, so switching that part of the log to use the async i/o system might be good to.

Log LSN Checkpoints
Is it possible to optimize the buffer-pool LSN lists of log checkpoints by only recording dirty pages in the bufferpool, rather than all pages in the bufferpool? If so, it would also take a lot less time by keeping a list of dirty pages and just traversing it, rather than groping through the entire BP for the few pages which might be dirty.

The answer to that is YES, and it already does it.

dir_vol_m::_destroy_temps()
aka ... Temporary Extent/Store Removals
These happen at volume unmount, and can take longer than one would want to happen -- especially if there is nothing to do! What happens is that a get_volume_quota() is run to find out how many extents there are in use. That takes a while, BTW, since it has to grope every extent on the disk (see another entry for problems with this). And, a memory allocation comparable to the number of extents is done! Then each store is examined to see if it is a temporary store. Since the minimum store allocation is an extent, (or two if a large object store exists for the store) and it may be possible that all stores are temporary ... that is why it allocates all that space ahead of time. If the store is temporary, it is put on that in-memory list of temporary extents, which are later iterated through to delete them.

Note, that on a big store, this can be a several MB allocation, all for naught :(

No Read-Ahead on shore-internal things
Note that there is a great way to speed many shore internal operations up. They are all iterating through stuff on disk, and know exactly where the next page is. For example the get_volume_quota mess. If read-ahead was enabled, or even if extents were I/Od as a whole this process would go way faster, since it is mostly latency based. Once the I/O pipeline of read-ahead starts performance would be much better.
is_int_aligned() is broken
is_int_aligned() and other functions of its ilk take int as pointer arguments, not the proper void *) which needs to hold a pointer value. This works OK on 32 bit machines, but breaks on 64 bit machines. Worst of all, most users of this class use a subtraction instead of some cast to turn the pointer into an integer, so warning errors don't appear and you are left holding the bag.
Buffer Page Dirty List
Whenever a log checkpoint is created, it needs to record the LSNs of all dirty pages in the buffer pool. Currently it does this by iterating through all pages in the buffer pool. That is needlessly slow if you have a large buffer pool with few dirty pages. A better solution is to maintain a linked list of dirty pages, or a bitmap of dirty pages, so that the dirty pages can be found much faster. Choice of implementation depends upon number of dirty pages, since the linked list is always of size of dirty pages, but the dirty bitmap is always the size of the bufferpool.

Dewitt was suggesting a bitmap would be a better solution than a list. I was resisting, since the bitmap is static and is another chunk of memory. However, I realized that if you want to retrieve modified buffer pool pages in sorted order, that the bitmap gives you the advantage of providing everything sorted. The downside, again, is that you need to examine the entire bitmap, not just the active entries.

Suspend Signal hoses shore command lines
This has been a problem for years and years, I'm just writing it down in my notes to self. If you try suspending a shore process from an interactive command line, it will never resume. This occurs in the debugger with Predator. If you suspend a non-debugged process something goes wrong with it and it never starts again. This may have something to do with the diskrw procs. My random guess it that perhaps the job control signals need to be handled to make it work.
Imake subtely broken
This is a good one ... if a token in the path is also a #define somewhere in imake ... then when Imake.tmpl tries to include the Imakefile the filename is cpp expanded... and the pathname is broken.

For example shore-linux as a directory name is mapped to shore-1 which of course doesn't work :(

Notes on Debugging Shore

Arcane reminders to myself about what works and what doesn't.

To run quantify against shore, this is what you need to do:

  1. Configure shore with STHREAD_CORE_PTHREAD. I haven't been able to get quantify to work with the "everything done right" pure threads interface that fixes all other problems.
  2. Configure shore with LINK_PTHREAD. Just a dependency for the above, but there aren't any dependencies in the config system.

Bolo Documentation
Bolo's Home Page
Last Modified: Thu Apr 6 22:42:43 CDT 2006
Bolo (Josef Burger) <bolo@cs.wisc.edu>