Citation:
V. Pai, P. Druschel, and W. Zwaenepoel, "IO-Lite: A Unified I/O Buffering and
Caching System", 3rd Symposium on Operating System Design and Implementation,
New Orleans, February 1999.

* Summary

IO-Lite is a unified I/O buffering and caching system for general purpose
operating systems. It allows applications, interprocess communication, 
the filesystem, the file cache, and the network subsystem to share a single
physical copy of the data safely and concurrently. IO-Lite eliminates all
copying and multiple buffering of I/O data, and enables various cross-
subsystem optimizations.

* Why We Need IO-Lite?

For many users, the perceived speed of computing is increasing dependent
on the performance of networked server systems. However, general purpose
operating systems do not provide sufficient support for high performance
server applications. One of the major problems is lack of integration
among the various I/O subsystems and the application, each of which typically
uses its own buffering and caching mechanisms. This leads to repeated data
copying, multiple buffering, and other performance-degrading anomalies.
The primary goal of IO-Lite is to improve the performance of server
applications such as those running on networked servers, and other I/O
intensive applications.

* IO-Lite Design

IO-Lite uses immutable buffers and buffer aggregates. All I/O data buffers
are immutable, such that after initialization they may not be modified. This
implies a read-only sharing model, eliminating problems of synchronization,
protection, consistency, and fault isolation among OS subsystems and 
applications. Obviously, the price of immutable buffers is data cannot be
modified in place. To alleviate the impact of this restriction, all
data buffers are encapsulated inside buffer aggregates, which are instances
of an ADT that represent I/O data. All OS subsystems access data through
this abstraction. Data contained in a buffer aggregate is not necessarily
in contiguous storage. Rather, buffer aggregates contain an ordered list
of <pointer,length> pairs that represent contiguous sections of immutable
buffers. Buffer aggregates support operations for truncating, prepending,
appending, concatenation, splitting, and mutating data. Although buffer
aggregates are passed by value amond subsystems, the underlying immutable
buffers are passed by reference. Conventional access control ensures that
a process can only access I/O buffers associated with buffer aggregates
explicitly passed to that process. A system-wide reference counting mechanism
for I/O buffers allows safe-reclamation of unused buffers.

* Interprocess Communication

To support caching in a unified buffer system, and IPC mechanism must allow
safe concurrent sharing of buffers among different protection domains. IO-Lite
uses an IPC mechanism similar to fbufs to support safe concurrent sharing.
IO-Lite extends the fbuf approach from the network subsystem to the filesystem,
including the file data cache. It also adapts the fbuf approach to a general
purpose operating system. IO-Lite IPC combines page remapping and shared
memory. When a buffer is initially transferred, VM mappings are update to
grant the receiving process read access. When a buffer is deallocated, the
mappings still persist and the buffer is added to a free pool for the
associated I/O stream

* Access Control and Allocation

IO-Lite maintains pools of buffers with the same ACL. The choice of a pool
from which a new buffer is allocated determines the ACL of the data stored
in the buffer. The access control model requires apps to determine the ACL
of an I/O data object prior to storing it in main memory.

IO-Lite buffers are allocated in a region of the virtual address space called
the IO-Lite window. The IO-Lite window appears in the address spaces of all
applications and the kernel. Buffers always consists of an integral number
of virtually contiguous VM pages, and pages share the same access control
attributes. Buffer aggregates contain a list of tuples representing slices.
Slices are always fully-contained within a single IO-Lite buffer, but slices
may overlap. In order to not waste memory, data objects with the same ACL can 
be combined in a single IO-Lite buffer and on the same page.

* Application Interface

IO-Lite provides an extended I/O API that is based on buffer aggregates to
application programs. IOL_read and IOL_write are the two core calls. IOL_read
takes a standard file descriptor and size, and returns a buffer aggregate
containing at most the amount of data given as an argument. IOL_write takes
a file descriptor and a buffer aggregate, and replaces the data in an external
data object with that of the buffer aggregate parameter.

* Filesytem Interaction

In IO-Lite, buffer aggregates form the basis of the filesystem cache. The 
rest of the filesystem remains unchanged. The IO-Lite file cache consists of
a data structure that maps triples of the form <fileid, offset, length> to 
buffer aggregates that contain the corresponding data extents.

* Network Interaction

The network subsystem uses buffer aggregates to store and manipulate network
packets. However, in order to meet the requirement that the ACL of a data
object must be determined before storing it, network drivers must perform
packet filtering in order to identify its associated I/O stream, a process
known as early demultiplexing.

* Cache Replacement and Paging

Cache replacement in a unified caching/buffering system is different from
that done for conventional file caches, since cached data is potentially
concurrrently accessed by applications. Thus, replacement must consider both
references to a cache entry as well as virtual memory accesses to the buffers
associated with that entry. IO-Lite uses a simple stategy for selecting cache
victims. The cache entries are maintained in a list ordered first by current
use, then by time of last access. When a cache entry needs evicted, the least
recently used among the not referenced cache entries is chosen, if one exists,
else the least recently used among the currently referenced entries is chosen.

* Impacts of Immutable Buffers

All modifications to data objects stored in buffer aggregates require storing
the new values in a newly allocated buffer. If every word in the data object
is modified, the only additional cost is the buffer allocation. If only a 
subset of the words are modified, the new data is stored to a new buffer, and
the buffer is logically chained using buffer aggregate operations. When
modifications to a data object are so widely scattered that the costs of
chaining and indexing exceed the cost of a redundant copy of the entire
object, contiguous storage and in-place modification is a must. To support
this case, IO-Lite allows mmap'ing of data objects, allowing in-place
data modification.

* Cross-Subsystem Optimizations

Unified buffering/caching enables certain optimizations across apps and OS
subsystems not possible in conventional I/O systems. Such optimizations
leverage the ability to uniquely identify a particular I/O data object
throughout the system.

An example of an optimization of this type is the Internet checksum used by
TCP and UDP. With IO-Lite, this checksum can be computed for each slice
of a buffer aggregate and cached, such that future transmits of the same
slice can reused the cached checksum.

To support such optimizations, IO-Lite provides a generation number for each
buffer that is incremented every time a buffer is re-allocated. This number,
when combined with the buffer's address, provides a system-wide unique id for
the contents of the buffer.