These are my notes about the Shore Storage Manager.
Some are bugs, Some are oddities, Some are things to wish for,
and some are annoying behaviors.
Others are things that I have thought about and need to
investigate to find out what the actual case is.
Yet others are refinements and ideas that I want to keep track of.
- Page Size
- The volumes and logs are not marked with any page size
attributes.
So, when you change the page eize and restart the system,
it gets really confused.
It may even irreversibly damage the existing log and pages!
- Sorting
- The new sort code works ok for the most part. When it was run
with paradise it performs just ok. However, logical ID support
is not very good with the new sort code, and some of it may be
missing entirely. In addition the ssh test suite has problems
if you try to use the old sort code in compatability mode.
And many tests will not run due to these problems. btree.9
tries to run but fails with a NOTIMPLEMENTED error.
- ssh testing
- rtree.4 fails with logical ids turned on eBADVOL
- SSH error reporting
- I have a hack (SSH_DUMPRC) to print out full RC codes for
ssh so the *location* of the RC can be found, so the particular
error can be tracked down.
However, it would be better if only that happens for unexpected
errors, rather than all errors. It makes the facility less
than generally usable.
- Possible disk-space problems
- When my home NT box was running short on disk space, the SM
was going really wacky.
It kept on running, but may have been having problems.
However, one restart after a ^C caused a core dump with what looked like
invalid data coming from the volume or the log.
Some testing should probably be done when there isn't enough space
for the log to be certain that it detects errors.
Unix log only of course, the raw log has all the space it needs.
- Record re-sizing is Odd.
- There exists a grey area with odd behavior anomalies in the
operation of the SM.
If you take an existing record and expand it in one shot so that
it is larger than can fit on a SM page, then the SM just makes it
into a large record and it works.
But if you expand the same record to a size that is less than large
record size, but it won't fit on the page, then the SM errors
out on you. In that case you need to delete the existind record and
create_rec it somewhere else in the file where it will fit.
If you have a bunch of records that are growing, this can provide
horrendously bad behavior, as random record growth causes
these record forwards (and corresponding index updates) to just
drive records all over the place until they are all large.
At which point we have way too many underutilized pages, unless
we also have a bunch of smaller records.
One solution to this would be to add a "enlarge in place"
facility to record handling, so I can just say I want to keep the
rid and enlarge it here.
There is also an inconsistency here, in that truncate_rec,
when a record is brought small enough to be a on-page object,
but the page won't hold the record because it is a bit full,
will just leave the record as a large record! It seems
that it should behave orthogonally with regards to
append_rec to minimize the number of small large records.
Perhaps, again, a flag should be provided so the user knows
that the record is small and should be moved.
I don't think this is kept track of at the page level,
and the append_rec() scenario can NOT happen in reverse,
which screws things up another way .... as you have a bunch
of records shrinking, a lot of them will be left as large
records until the next truncate_rec on the same record.
Also, one space becomes available on the page for a
"small" large record, there is nothing to convert that
record back to fit on the page!
It seems that to do this we need a logical-level logging
operation that says "record x converted from large rep
to small" ... while remaining the same size.
The locking has to be right (a temporary hold all comers lock)
while the operation is in progress, so that it proceeds invisibly
to all users.
This isn't a problem for append_rec(), since the record being
enlarged past the amount the page will hold is also the record
being modified, instead of a delete/truncate of another record
in a page allowing "small large records" to be coalesced.
If locks are allowed on *pieces* of a record (large object)
this could be a problem too, since those locks would need
to be converted to the lock on the now smaller rid.
If only records are locked, though, no problem there.
- truncate_rec
- It isn't immediately obvious that truncate_rec truncates
the record by that many bytes, not that it truncates it TO
that many bytes. Such as the ftruncate() the unix system call
as a popular example of that methodology.
- update_rec() Record update with different size
- There isn't an easy way to replace the contents of a record
with contents of a different size.
To do it you need to either grow the record with append_rec() with
zeros, or shrink it with truncate_rec() the appropriate number
of bytes.
Then update_rec() the content with the new content.
An optimization is available for record expansion;
where you can append_rec() with the new bytes,
and then update_rec() the original bytes.
It still requires multiple SM calls and the associated overhead, though.
- build system versus languages
- gcc -x c++ and gcc are used for linking c++ programs.
This is sorta a problem because then the build system has
to know all the language specific libraries that c++ needs to link
against, instead of g++ just adding them in. It also falls down
if you want to compile with multiple languages, since there isn't
a way, for example, to say that a c++ program needs a c++ linker,
and a pascal program needs a pascal linker, etc.
- output streams
- For cerr/cout streams to work really well, each thread really
needs to have it's own "context". Otherwise, changes that other
threads make, such as numeric precision or output justification will
magically appear in the output of another thread trying to do the
same thing. Locking really won't even help this, as the format
changes would have to be "transactional based" to have any meaning,
and that would lock the stream up completely. This is a big issue
when going multi-cpu capable, but could also be an issue if
sthread-non-blocking streams are used to improve I/O capability.
- out-of-memory handler
- Install an out-of-memory new handler that will W_FATAL the
system when an out-of-memory error occurs. Have to be careful to
not to try allocate additional resources so error messages will
print OK and stuff like that.
- SSH doesn't give error information for TCL scripts
- If something goes wrong in the execution of a TCL
script (runscripts random.all see what happens) there
is no idea of what file and line number the error
happened at!
- ssh script lgrec.2 uses lots of memory
- The lgrec.2 script will often trigger segmentation
violations on NetBSD. This happens because it allocates a
lot of memory (for reading large records in to the tcl result
buffer). If the memory limit is too low, something happens
and a SEGV results. This is really bad behavior, I would
vastly prefer it if the process said it was out of memory,
rather than making it look like something has gone wrong.
This may be due to a bad allocator on NetBSD 1.2,
but the same thing happens on Solaris too, just with lower
memory limits.
So, this isn't a shore problem, but it is annoying that the
system doesn't crash in a better, more informative, fashion.
- hash values are fixed at uint4_t
Currently the hash values for the SM are "stuck" at 4 byte unsigned
integers. This is a problem because as the machine architecture bits
expands, the size of hash values does not.
A w_hashval_t should be created and used for all hash values, so it
can be changed as needed.
Note that there is no reason that hash values should be a fixed
size type; z.B. they could be a 'long' to follow the largest native
machine type.
Also note that lots of places assume that hashvalues are 4 byte
integers and do arithmetic on them like that.
- smthread_t constructors
- The smthread_t constructors should be a super-set of sthread_t
constructors, so that things that are sthreads can just become
smthreads without having to know anything SM-specific.
The flip-side is that the default sthread arguments have to be
specified to get non-default SM arguments.
- xct_t printing
- The
xct_t operator<<()
should be usable by ordinary users wanting
to print transaction info.
Currently, I believe you need to include xct.h and the xct guts
to just print a transaction. Perhaps the xct_t printer
should be declared in sm.h so that random SM users can print
transactions.
- sm core dumps when dismount_all
- If you have a shore configured and you try a dismount_all
when no volumes are mounted, the SM core dumps, perhaps
on a mutex-mutex deadlock.
Look at running 'Index' when the volume hasn't been formatted
for an example of this behavior.
- keep track of max stack depth
- [I keep forgetting to do this, and I've been meaning
to add it to NewThreads (and sthreads) for a couple of years.
I had a prototype at one time, but I wanted to fix some problems
with it and I didn't complete it.]
Keep track of the maximum amount of stack used by a thread.
Only the current stack depth is tracked, instead of historical
stack depth.
In addition an interrupt could be used to preemptively track
stack usage, finding deeper stack usage than just the stack
depth at cooperative context switches courtesy of synchronization.
- sthread sleep(0) causes an assertion failure
- Sleep(0) should not cause a panic, it should either
do nothing or cause a context switch (aka yield).
What happens now is that it catches the
w_assert1(timeout != TIMEOUT_IMMEDIATE)
in block.
- sthread sleep() and yield() are non-orthogonal
- One requires a thread object and the other is static.
This is somewhat unexpected and a bit of a hassle to remember
which does which. It could be better.
- sinfo_s
- The sinfo_s and sdesc_t structures and classes used for representing
meta information about stores are bogus.
There is one sinfo_t which is supposed to be used for all files,
indexes, etc.
This means that the content of this needs to be a non-shared union,
and all types pay for each other.
It also means that adding per-store data breaks everything in the world.
And, the sinfo_s constructor for setting these thinsg up is a big
mess that everyone has to know *exactly* what all the arguments do,
even for classes that you don't care about.
This should really be changed to have an sinfo_t per store type,
so each type of store can have it's own info.
Since the sinfo_t is stored in a index or a file, the differing sizes
won't (or shouldn't) matter -- except for index retrieval problems,
and the page-size could be used as the buffer to retrieve the data
into to fix that unitl that problem is fixed (or use the cursor interface
which fixes it).
- bulkload sort_duplicates
- It appears that the bulk load code for btrees (problem may be
present elsewhere, I haven't looked) will only sort duplicates that
are on the same page in the bulk load file.
So, if you have duplicate values that spread across multiple pages
of the bulkload file, they won't be sorted properly, only on
a page-by-page basis.
To fix this the bulk load code needs to either create runs
of duplicate value items and sort the runs, or it needs to use
a multi-value sort and sort the entire input (which sort of makes
sort duplicates do 2x the work, as the user should do that in the
first place, or do something else to deal with this case correctly.
I may have stated the above wrong, for the code actually will sort
things among two pages. But if the 3rd subsequent page contains something
which should be sorted onto the 1st (or prior) subsequent page, then it
is problem-land again.
- bulkload fill factor
- It is currently not possible to specify a fill factor for bulk loads
of btrees.
You may want to do this to load the tree quickly, yet still have inserts
succeed without causing immediate page splits.
Note that the index split factor is a index fill factor when bulk loading
is running.
- bulkload btree performance
- Performance of bulk-loaded btrees is often worse
than that of randomly loaded btrees.
This is just not expected, as there should be fewer pages with
higher fill factors in the bulk-loaded btree.
The problem apparently is that the bulkloaded btree is deeper
than the randomly loaded btree.
This extra tree depth hurts performance considerably.
Yet another thing to track down.
This occurs because there is a greater % fill factor (70% according
to papers) in the interior pages of a random-built btree
than there is in the 50% split factor (aka 50% fill factor) pages
that are created by the bulkload.
Once I finish the btree split factor addition to Shore, it
can be used (or a load-time parameter can be used) to control
the split/fill factor during bulkload.
That will allow a higher utilization in the interior nodes and
the tree depth should go down.
Why not 100% fill -- well, what if you want to bulkload a btree that
you expect random inserts to happen to in short order!
- Performance deteriorates with larger volumes
- As you increase the Shore volume size, runtime
performance with a fixed data size increases.
For example with a 10GB volume, runtime for ssh's all
script is around 8 minutes.
With a 30GB volume, the same script takes about 17 minutes.
WIth a 100GB volume, the script takes about 48 minutes.
I need to put a profiler on and see where the time is going.
This may only be a problem on Linux; need to check other platforms
first.
Pin::next_bytes()
not intuitive
- next_bytes in a pin goes to infinity in some cases.
Such as when you are at the end of a pin.
I think a more reasonable error would be for it to go to 0,
indicating that no bytes remain.
- no replace_assoc()
- There is no
replace_assoc()
method to update the value
of an association.
The only way to do this is to delete_assoc(); create_assoc();
.
This sorta sucks because it means that if you are doing updates that
they have to talk the btree 2x as much as needed, once for delete and
then another time for create.
Doing this could improve update rates quite a bit.
- diskrw missing
- This may be bogus due to other problems, but I moved it here
as a reminder
In some cases when diskrw is missing there are no errors and
no crash. Things just act weird and nothing happens.
- compiler link options
- When DEBUGGERSYMBOLS are used, the link command is not
passed any information about that state.
This is a problem since the linker may need to do things differently
and/or add on extra files to the link (such as end.o on HPUX).
- Alignment issues painful
- Alignment issues are becoming quite painful.
For example I added an 'type *_exception' member to sthread_t to point
to a per-thread exception context. This just happened to break
btrees (presumably because of the kc_buf in smthread.h. Or some other
derived alignment problem. Really need to have real fixes for this,
the mess is just like a house of cards.
- Short Keys
- The shore key specification system has no way to specify
typed variable length keys.
For example if I have a key that could be "u4u4" or "u4u4u4",
you can't do it.
You need to specify a "u4u4u4" key and zero-pad the unused
last "u4" element.
This also causes a problem if you want to use a scan of a
partial key.
For example, a scan of "u4u4" but just specifying "u4".
Gotta look at this, it is more painful than it should be
- Pin w/ offset
pin_i::pin(..., start)
doesn't work correctly with small records.
Instead of returning a body() that is start bytes into the short tuple,
it returns the start of the tuple.
This is just plain wrong, as it means that a user needs to determine
what size a tuple is before accessing a portion of it!
This needs to be fixed soon.
The pin_i code is a real mess with
regards to start number and offset and paging handling, and some time
should be spent to see if it could be improved.
- Duplicate Btree Entries
- Shore lets you write duplicate keys into a btree once.
However, if you try to write duplicate keys a 2nd time, then it
generates errors in btree_impl.cpp:922 that there are duplicate
keys in the tree.
Either the first duplicate insert should NOT be allowed to happen,
or something sane should happen.
A scan reveals the duplicate entries, probes just fetch one, presumably
the first of the duplicates with the same body value.
You can insert two identical key-value pairs, errors start occuring
afterwards.
However, even once that happens you can still insert one more
key/value pair with the same key and a different value.
Then, you can no longer insert values.
The order of the above max 3 inserts doesn't matter.
- TCL assert failures when running rtree.all on NT multiple times
- Causes by the NT random number generator making a duplicate key
by in the rtree.
This occurs somewhat randomly, but repeatably.
The assert goes off when the retrieved key doesn't contain the
expected value ... because the key was generated more than once.
I'm not 100% certain about this, but saw the same kinds of problems
in the MacOSX version when I had a poor random-generator hacked
together for testing.
- Rollback bugs with full pages
- You will see these either on a abort_xct() or when the
log is cleaned up on system start.
An assert goes off in page.cpp complaining about _nfree >= _nrsvd
or something close.
This started with the 64 bit port, which aligns data on a page
to a x8 (previous a x4) alignment ... thereby reducing the space on the
page by 4 bytes.
The difference between nfree and nrsvd is in the assert failure is
always a off-by-4 error.
It only happens when rolling back a page that is almost full.
I believe that somewhere is a chunk of code that has a magic number
in it .. instead of using the appropriate constant.
It "knows" about the old amount of free space, and when the page
gets full, this assertion goes off.
- Btree Index Key length sometimes ignored???
- Create a btree with a key such as 'u4'.
Store a value with a key longer than 'u4', such as a string into
the btree. Ooops, the key value is silently truncated and
shore doesn't generate an error ... which I've seen it do in
other situations.
Retrieve a value with the same key .. again no complaint.
Of course, if you store several keys with the same 4 byte prefix,
the mapping is many-to-one and you loose stored info.
Need to investigate this and find why the key length is ignored.
- Buffer Manager Dirty Threshold
- The Buffer Manager's dirty page threshold is currently
fixed at arbitrary values.
It is initially calculated as one-eighth (1/8) of the
number of pages in the bufferpool.
However, a maximum dirty threshold of 100 pages is enforced on
top of that.
Oris it a minimum of 100 pages? It depends
on the interpretation -- I need to examine this more closely.
It might be good if this was adjustable, since it only allows a
very small portion of a large buffer pool to be dirty.
If you are really doing that much writing, a larger number could
be better to try and get more sequential I/O going on writes
to extents.
- Log Master Records
- What happens here is that the write of a new master
log record file, and the unlink of the old one are
all done synchronously.
That is, direct local system calls in the shore process.
This of course brings everything to a screaming HALT
while this happens.
I think the thing to do here is to upgrade the diskrw
processes to also do random file system operations on behalf
of shore.
And, we can always leave _1_ sitting around to do these operations
in a asynchronous manner.
For example, once the new log master is created, the unlink
could be done asynchronously, and the process would
not have to wait around for it, unless it wanted an error
return.
If the logging system had more parallelism, other people could be
writing to the log tail while the checkpoint happens, so switching
that part of the log to use the async i/o system might be good to.
- Log LSN Checkpoints
- Is it possible to optimize the buffer-pool LSN lists of
log checkpoints by only recording dirty pages in the
bufferpool, rather than all pages in the bufferpool?
If so, it would also take a lot less time by
keeping a list of dirty pages and just traversing it,
rather than groping through the entire BP for the few pages
which might be dirty.
The answer to that is YES, and it already does it.
dir_vol_m::_destroy_temps()
- aka ... Temporary Extent/Store Removals
- These happen at volume unmount, and can take longer than one
would want to happen -- especially if there is nothing to do!
What happens is that a get_volume_quota() is run
to find out how many extents there are in use.
That takes a while, BTW, since it has to grope every extent on the
disk (see another entry for problems with this).
And, a memory allocation comparable to the number of extents is done!
Then each store is examined to see if it is a temporary store.
Since the minimum store allocation is an extent, (or two if a large
object store exists for the store)
and it may be possible
that all stores are temporary ... that is why it allocates all that space
ahead of time.
If the store is temporary, it is put on that in-memory list of temporary
extents, which are later iterated through to delete them.
Note, that on a big store, this can be a several MB allocation,
all for naught :(
- No Read-Ahead on shore-internal things
- Note that there is a great way to speed many shore internal
operations up.
They are all iterating through stuff on disk, and know exactly where
the next page is.
For example the get_volume_quota mess.
If read-ahead was enabled, or even if extents were I/Od as a whole
this process would go way faster, since it is mostly latency based.
Once the I/O pipeline of read-ahead starts performance would be
much better.
- is_int_aligned() is broken
- is_int_aligned() and other functions of its ilk
take int as pointer arguments, not the proper
void *) which needs to hold a pointer value.
This works OK on 32 bit machines, but breaks on 64 bit
machines.
Worst of all, most users of this class use a subtraction instead
of some cast to turn the pointer into an integer, so warning
errors don't appear and you are left holding the bag.
- Buffer Page Dirty List
- Whenever a log checkpoint is created, it needs to record
the LSNs of all dirty pages in the buffer pool.
Currently it does this by iterating through all pages in the
buffer pool.
That is needlessly slow if you have a large buffer pool with few
dirty pages.
A better solution is to maintain a linked list of dirty pages,
or a bitmap of dirty pages, so that the dirty pages can be
found much faster.
Choice of implementation depends upon number of dirty pages,
since the linked list is always of size of dirty pages, but
the dirty bitmap is always the size of the bufferpool.
Dewitt was suggesting a bitmap would be a better solution
than a list.
I was resisting, since the bitmap is static and is another chunk of
memory.
However, I realized that if you want to retrieve modified buffer pool pages
in sorted order, that the bitmap gives you the advantage of
providing everything sorted.
The downside, again, is that you need to examine the entire bitmap,
not just the active entries.
- Suspend Signal hoses shore command lines
- This has been a problem for years and years, I'm just writing
it down in my notes to self.
If you try suspending a shore process from an interactive command
line, it will never resume.
This occurs in the debugger with Predator.
If you suspend a non-debugged process something goes wrong with
it and it never starts again.
This may have something to do with the diskrw procs.
My random guess it that perhaps the job control signals need to be
handled to make it work.
- Imake subtely broken
- This is a good one ... if a token in the path
is also a
#define
somewhere in imake ... then
when Imake.tmpl
tries to include the Imakefile
the filename is cpp expanded... and the pathname is broken.
For example shore-linux
as a directory name
is mapped to shore-1
which of course doesn't work :(