Threads and Input/Output in the Synthesis Kernel

Henry Massalin & Calton Pu @ Columbia Univ

Proceedings of the 12th ACM Symposium on OS Principles, 1995

1. Introduction

Design goals

High performance

Self-tuning capability to dynamic load and configuration changes

Simple, uniform and intuitive model of computation with high-level interface

Problems & solutions

How to make OS kernel calls fast?

Use specialized code -- synthesize special version of code for each case

Once kernel calls are made fast enough, synchronization becomes a new bottleneck. Then what should we do?

Avoid synchronization - three techniques

Optimize synchronization

2. Synthesis Background

2.1 Synthesis Model of Computation

Model: a von Neumann machine with threads, memory protection boundary, and I/O devices

Thread graph

Node - thread

Directed edge - data flow channel from a source thread to a destination thread

Thread

User threads executing user jobs & kernel threads not associated with any user job

Threads run programs in quaspace (quasi address space)

All threads in a machine share an address space: no virtual address

Each quaspace is a subspace of the global address space

Some part of quaspace is blanked out by kernel so that the thread couldn't see it

I/O devices move data between threads

2.2 Kernel Code Synthesis

Code synthesizer generates special versions of code for specific situations

Methods of code synthesize

Factoring Invariant - like constant folding

Collapsing Layers

collapse vertical layer defined by layered structure of modules

collapse horizontal layer defined by pipelined threads

Executable Data Structures

shorten data structure traversal time when the data structure is always traversed the same way

2.3 Basic Kernel Components

Synthesis kernel is composed of quajects

Quaject

consists of procedures and data structures

Example quajects: threads and I/O device servers

is implemented by combining building blocks

Building blocks

Representative building blocks

queue

switch - similar to C switch statement

pump - connects passive producer to passive consumer and actively copies data from input end to output end

gauge - counts events such as procedure calls, data arrival, etc.

scheduler - uses gauge to collect data for scheduling decisions

monitor

Each building block may have several implementation, and most efficient one is used for each situation. For example,

queue - two types

synchronous - two implementations

dedicated: for single producer or consumer

optimistic: for multiple producers and consumers

asynchronous queue 

dedicated queue

optimistic queue

Quaject creation

Created by quaject creator

Steps

allocation of memory for quaject and all associated synthesized procedures

factorization - Factoring Invariant

optimization - peephole optimization

Quaject execution

Done by quaject interfacer by installing the quaject in the invoking thread

Steps

combination - find the appropriate connecting mechanism (queue, pump, ...)

factorization - Factoring Invariant

optimization - peephole optimization

dynamic link - store the entry points of synthesized codes into the quaject

3. Reduced Synchronization

3.1 Overview: Methods to reduce synchronization

Code Isolation: If two process works on different parts of the shared data, why synchronize?

Isolate and separate code fragments which manipulate the data structure

Do NOT synchronize fragments which deal with its own part of the data structure

For example, unix kernel locks proctable to update a process information. Actually no reason to lock because each process updates its own slot

Procedure Chaining

If synchronization is unduly complex or costly, or even unprofitable, do NOT synchronize but take the most naive approach--do not run threads concurrently

Optimistic Synchronization - Interference between threads is rare. So,

Go ahead and make changes to the data structure without locking

Check whether our assumption is true

Rollback and retry, if not

3.2 Optimistic Queues: The most important example of optimistic synchronization

SP-SC (Single Producer & Single Consumer): Fig-1

Get or put data simply by checking tail or head

Block when the queue is empty or full respectively

Compare this algorithm with the conventional one, which

Get the exclusive access to the queue

Put or get an item

Release the access right

MP-SC: Multiple producer & single consumer, atomic insertion of multiple item

Algorithm Fig-2

Each producer reserves space in the queue atomically and inserts as many item as it reserves into the reserved spots

Use extra array of flag to indicate whether real data is in the slot or not

4. Threads

4.1 Synthesis Threads

Synthesis thread - lightweight process

Each thread executes in a context, which is called TTE and has fields:

register save area

vector table - pointers to:

thread-specific system calls

interrupt handlers

error traps

signal vectors

address map table

context-in & context-out procedures

Unix like system call handling

Waiting list per resource

4.2 Context Switches

Floating-pointer registers are not saved during the context switch, on the assumption that ordinary threads do not use any of them

Illegal instruction trap handler is invoked when the thread executes floating-point operation first time

This saves the half of the time of context switch

No 'dispatcher' procedure

Ready threads are chained together, forming a ready queue

Context-out procedure of the leaving thread directly jumps to the context-in procedure of the next thread

Could implement fair and responsive sheduling algorithm with this approach?

4.3 Thread Operations

Unix-like signal handling

Debug support by signal

User threads can catch error trap and handles it as it does signals!

4.4 Scheduling

Policy

The shorter the previous run is, the longer quantum is given

Priority

Give longer quantum, or

Move to the head of the waiting queue when the thread is waken up

5. Input/Output

I/O - all data flow among hardware devices and quaspaces

Stream - data move along logical channels, which connect source and destination

5.1 I/O Device Servers

Device server - physical I/O device encapsulated in a quaject

write - data movement from caller to callee

read - data movement from callee to caller

Each device server has its own thread or not

High-level server may be composed from more basic servers

For example Unix tty device

connected to keyboard server by dedicated queue

connected to screen server by optimistic queue: user thread can also write to screen server

process 'erase' and 'control' characters and bypass other characters from keyboard server to screen server

File system server is composed of several filter stages

5.2 Producer/Consumer

Stream model can be described as a composition of different types of consumer/producer

Passive producer - passive consumer

Example

producer: xclock program

consumer: screen which displays the clock

implementation: pump connecting the consumer and producer

Active-passive

Single-single

Active part calls the procedure which passes or receives data

Multiple-single

Attach a monitor to the end to which multiple participants are attached

Active-active: a queue need to be inserted between producer and consumer

Single-single: SP-SP queue

Multiple-single: MP-SC or MP-MC

5.3 Interrupt Handling

Difference from Unix interrupt handling

Each thread synthesizes its own interrupt handler

Interrupt handling is done in the context of the currently running thread

Signal handling within interrupt handling routine

Signal handling is delayed by chaining the signal handling routine at the end of the chain of interrupt handling routines

Each thread has its own interrupt, trap, system call routines

Usually system call are synthesized at the time of quaject (for example file) open

E.g., specialized write function for a file with a single writer

Evaluations

Incremental optimization!
Quaject synthesizing seems quite dynamic, right?. Running quajects could be very fast, but what is the overhead of dynamic fabrication of quaject (at run time)?