Remzi's notes: -------------- - lock is cheap, in both user-level and kernel-level thread - hence, in the paper, the workload is just create a thread - User-level: most operations (create, lock, sync) are fast, but lack of system integration (kernel don't know what is going on) - Kernel-level: most operation are slow: since cross boundary, checking ... but good system integration **Scheduler Activation: Effective kernel support** **for the User-level management of parallelism** =================================================== # 0. Take away -------------- - user level thread: lack of system integration + think about when a thread block (page fault, IO) + when a thread is preempted while holding a lock - kernel thread: + poor performance: cross boundary protection... + lack of flexibility: not one size fit all any more, need to support a lot of application needs - high level solution: more user level / kernel integration, instead of providing user-level system an abstraction of a processor (like traditional approach), provide an abstraction of *a multiprocessor*. + kernel upcalls to user-level system when events that affect scheduling decisions ==> user level system free to enforce scheduling policy on its threads + user level system notify kernel event that affect processor allocation - dealing with thread preemption while thread is holding spinlock + continue a little while the end of critical section, then return + how: make copy of critical section, and place return code at the end of copy + check for Program Counter to see if is is in critical section - VS. asynchronous IO + SA works for both IO, page faults, processor preemption + less change to user-level code and kernel - VS. Hydra scheduling policy: + Hydra, ~ let apps decide which thread to run, ~ i.e. move scheduling out of the kernel ~ BUT: performance cost, scheduling decisions require communication through kernel to a scheduling policy server, and than back to kernel to implement context switch + SA: apps can set its own policy, and implement this policy without trapping to kernel # 1. INTRODUCTION ----------------- - user level thread: + managed by run-time libraries linked into application code + no kernel intervention + good performance (if no system event like IO, page fault...) + flexible: customized to the needs of user app + thread package views each process as a *virtual processor*, which is multiplexed across real physical ones by kernel > poor performance under page fault, IO, multiprogramming (think about why? user thread blocks for IO, kernel thread correspond to it blocks, making the physical processor unused) e.g: if kernel does not aware about thread, a thread block --> all related threads are blocked > kernel interpret these events on its own - kernel thread: + direct kernel support for multiple threads per address space + thread directly scheduled by kernel (not by user level thread scheduler) > hence, events like IO, page faults is handle correctly > a thread blocked does not affect others + but, *poor performance*, heavy weight - the dilemma: what to use? kernel-thread "works right" but poor performance? user-thread on top of kernel threads or processes, which perform well, but functional deficient Need to differentiate: - application using user-thread lib - app using kernel-thread lib - user-thread management system can be build on top of kernel thread support ==> this model lead to a lot of problem, which is the main topic in this paper # 1.2 Goals - functionality: the pro of kernel thread management system (even in case of IO, page fault, multiprogramming) + no processor idles if there is still ready thread + no priority inversion + when a thread blocks, the processor on which the thread runs can be used to run other thread (on the same or different address space) - performance: the pro of user-level thread - flexibility: to change scheduling policy (still keep the decision on user level) or concurrency model *The challenges*: gap between user level and kernel - kernel needs to know user-level scheduling information (e.g. how much parallelism there is in each address space) - user level needs to be aware of certain transitions that are hidden in the kernel (e.g. processor re-allocations and I/O events) # 1.3 Approach: information exchange between user level and kernel - application is provided an abstraction of a *virtual multiprocessor* (instead of a virtual processor like that in user-level threads systems) - application knows the number of processors allocated - application decides scheduling policy for its threads (not the kernel) - kernel controls processor allocation - kernel notifies events that affecting user-level scheduling to address space (IO, blocked, preempted ..., rather interpret these events on its own) ==> increase functionality - application notifies kernel events affecting processor allocation (e.g: if it needs more processor) ==> good performance is preserved since most events (like scheduling decisions do not need to be reflected to the kernel) All of these is encapsulated in a kernel mechanism: scheduler activation ... - execution context (like kernel threads) for event from kernel to address space - address space use this context to handle the event: + execute user-level thread + modify its data structure + make requests to kernel .. and a modified user-level thread package that leverage scheduler activation QUESTION: what is the difference and similarity between kernel threads, and scheduler activation? # 2. USER-LEVEL THREADS: good performance but limited functionality ------------------------------------------------------------------- - User-level thread: + better than kernel threads + flexible + lack of system integration due to inadequate kernel support # 2.1 The case for User-level Thread management - Kernel thread + has *poor performance*: cost of accessing thread management operation > extra protection boundary > extra kernel trap > extra parameter copy and check ==> User level thread: faster > no cross checking > address space is protecting vehicle > leverage compiler technique: register allocation, inline expansion + *poor flexibility* cost of generality: not one size fit all > kernel thread cannot satisfy all applications' needs (model) ==> User thread, flexible, because app is linked with run time library that is best for its need # 2.2 Sources of Poor Integration in User-level threads built on traditional kernel interface - lack of kernel support - kernel threads are wrong abstraction for supporting user-level thread management + kernel events are handled *invisibly* to user level + kernel threads are scheduled *obliviously* with respect to user-level thread state This is bad, because this may cause performance penalty for user-level threads. (see example below) Example 1: #kernel threads = # physical processor + when a user-thread block on page fault, kernel thread serving that user thread also block --> the physical processor is *lost*, because no extra kernel thread for running other user-level threads Example 2: allocate more kernel threads than # physical processors so that when a kernel-thread blocks on a page fault, another kernel thread is available to run user-level threads on that processors ==> problem: when a page fault returns, there will be more kernel threads than # processors, each (kernel thread)with user-level thread context ==> schedule these kernel threads implicitly schedule the user threads which may cause a lot of problem (see below) - Problem of Time slicing kernel threads (when there are more kernel threads than # processors) + *spin lock*: if a kernel thread is preempted while its user-level thread holding a spin lock is preempted, other user-level threads will spin wait until the lock holder is rescheduled ==> can lead to poor performance + *waste of cpu time*: a kernel thread running a user-level thread could be time-sliced out to run another kernel thread that has no user-level thread available to it ... + *priority inversion*: a kernel thread running a high-priority user-thread could be de-scheduled in order to run a kernel thread that happens to be running a low-priority user-level thread ==> this is due to the semantic gaps between user-thread and kernel-thread Same problem happens with multiprogramming (see the paper) Example 3: ensuring *logical correctness* of a user-level thread system build on kernel threads can be difficult. - Because of the problems mentioned above, kernel threads can be preempted regardless of the states of corresponding user threads (whether it is idle, holding a lock, priority or so one). --> the assumption of application built on user-level threads (e.g. all runnable threads eventually receive processor time) may no longer holds --> can lead to deadlock # 3. EFFECTIVE KERNEL SUPPORT for USER-LEVEL Management of Parallelism ----------------------------------------------------------------------- - user-level thread system is provided with its own *virtual-multiprocessor* + kernel has *complete* control about (virtual) processors allocation + user-level thread system (ULTS) has *complete* control about its scheduling policy + kernel notifies ULTS about events that affecting its scheduling > changes to number of of processors assigned to ULTS > user-level thread blocks > user-level thread wakes up + ULTS notifies kernel when its want more or needs fewer processors (those event that affecting kernel processor allocation) ==> allowing kernel to distribute processors efficiently among address space + application programmer see no difference Question: what if user-level thread system lies to the kernel, by asking more and more processors? - OS profiling - use Multi-level feedback to encourage applications to provide honest information for processor allocation decisions. - favor address spaces that use fewer processors and penalize those use more. # 3.1 Explicit Vectoring of kernel events to ULTS - Scheduler activation: used to communicate between kernel and ULTS + an execution context (for running user-level, making request to kernel..) + notifies ULTS of a kernel event + space to save processor context of user-level thread (associated with that SA) when the thread is stopped by the kernel - quite similar to kernel thread + 2 execution stacks: > kernel stack: e.g when executing system call > user-level stack - how its work? + a program starts, kernel creates a SA, upcalls to user level + ULTS uses that SA as context to initialize itself and run main app thread + New thread creation: 1) ULTS requests additional processor 2) kernel create new SA, upcalls to ULTS 3) ULTS choose thread to execute in the context of that SA + notification of kernel event (a thread block ...) 1) Kernel create new SA, upcalls to ULTS 2) this new SA can be used to handle event, run user threads, etc - SA vs. kernel thread + its data structure is somewhat similar (2 exec stacks...) + but, when a user-level thread of a SA is stopped by the kernel (due to multiprogramming), it is never resumed by the kernel instead, new SA is created and upcalled to ULTS, ULTS removes thread state from old activation, and tell the kernel that the activation can be reused, and decide which thread to be run on the (just given) processor + in kernel thread, once a user-thread is blocked, kernel thread is blocked too, and it is resumes by the kernel, not by ULTS - SA can be used with any application programming model (SA is mechanism) *Example* (refer directly to the paper - page 100) - what happens when a user-level thread blocks in the kernel - what happens when a processor is preempted What if? - user-level thread has priorities ==> More downcalls from ULTS to ask kernel to stop the low-priority thread - preemption (or page fault) happens with user-level thread manager, when no user-level thread running ==> kernel saves the state for ULTS, subsequent upcalls allows ULTS to recover - user-level thread blocked in kernel still need to execute in kernel mode when IO completes ==> kernel resumes the thread *temporarily* until it blocks again or reach the point that leave the kernel # 3.2 Notifying the kernel of User-level events affecting processor allocation - ULTS notifies kernel only when + its need more processor (more thread) + its needs less processor (Hence, ULTS does not need to tell kernel about *every* thread operations ==> this gains performance, compare to directly used kernel threads. In this case, every thread operation leads to a kernel trap) - Note that these are *hints*, not guarantee + e.g, if ULTS request for more processor, kernel search for idle processor (from other address space that is not using it). If not found, nothing happens, but the address space may get a processor in the future. If it does not need it, it simply return processor to the kernel - Problem of dishonest address space: address space can misbehave by telling kernel that it needs more and more processors ==> solution: favor address spaces that use fewer processors and penalize those use more (i.e multi-level feed back, we can see a similar idea in ESX Server) + encourages address space to give up processors when not need If # threads in the system < # processors ==> idle processors should be left in the address space that most likely to create work in the near future + avoid re-allocation when the work is needed # 3.3 Critical section - what if user-level is blocked or preempted while in a critical section + poor performance (if it is holding a spin lock...) + dead lock (if it hold a lock on the ready queue, and is put on the ready queue by upcalls) - Two possible solutions + prevention: use a scheduling and locking protocol between kernel and user level, i.e, but this has some drawbacks > requires kernel to yield control over processor allocation (violate the semantics of address space priorities) > inconsistent with efficient implementation of critical section > require pinning to physical memory that all virtual pages may be touched while in a critical section, identify this pages may be cumbersome + recovery: user-level thread system check if a thread is executing a critical section (how, by a flag, for example), and if yes, continue executing that thread by user-level context switch. Until that thread reach a safe point, relinquish its CPU Feature: > free from deadlock, once a lock is acquired it eventually will be released. > works with any user-level spin-lock > however, wasted spin-wait when a spin-lock holder has a page fault ==> coan relinquish the processor after spinning for a while # 4. IMPLEMENTATION -------------------- - kernel: deal only with schedule processors among address space - thread scheduling policy: at user level runtime library - critical section optimization: how to detect if a thread in critical section 1) use a flag when a lock is acquired ==> performance overhead on lock acquisition and release. 2) make exact copy of every low-level critical section (DON'T GET IT, ask Sankar) - creating SA is not free, hence + caching discarded SA, so that it can be reused. # 4.1. Processor Allocation Policy - How to allocate processors across address space? - Time share vs. space-shares: + choose space-shares: > reduce the number of processor re-allocations > (may make use of locality / cache / TLB more, but this is not fundamental, can be solve by longer time slice) # 4.2. Thread scheduling policy - kernel has no knowledge of app's concurrency model or scheduling policy - app is completely free to choose these as appropriate > e.g: can be a LIFO ready queues for per-processors to improve cache locality # 4.3. Performance Enhancements 1. Deal with critical sections: ------------------------------- - How user-level thread system check if a thread is in a critical sections (i.e that thread is holding a lock)? - Approach 1: set a flag when a thread acquires a lock, set a *flag*, when release that lock, clear the flag. NOT so good: + overhead on lock acquisition and release when page faults or IO although these events are infrequent (Lesson: optimize for common case, not for infrequent case) + the thread holding the lock can still be allocated CPU time during wait for IO and Page Faults - Approach 2: make no overhead in the common case (i.e. thread not holding the lock when preempted), how? + Create copy of each critical section ~ require code knowledge and compiler support + At the end of each copy (not the original one), place return code to the original upcall + How it works? Still need to check something, but lightweight check ~ when preemption occurs, kernel start new SA to notify ULTS ~ the activation checks the preempted thread's PC to see if it is in critical section ~ if no, just do simple preemption as normal ~ if yes, jump to the corresponding place at the copy and continue, till the safe place. ~ the control is return to the original upcall at the end of critical section ==> No overhead of setting flag every lock/unlock But space overhead at compile time ... 2. Speed up creation of scheduler activation - Creating SA is expensive: need to allocate and initialized data structure + Solution: cache discarded SA to re-use - Once ULTS returns SA to kernel, it make a kernel call, which is expensive, too + Solution: ULTS returned discarded SAs to kernel in bulk