CS 736 Reviews - Spring 2016: Scheduler Activations: Effective Kernel Support for the User-Level management of Parallelism.

1. Summary
The paper argues that kernel threads are the wrong abstraction to provide support for user-level threads and describes the design and implementation of new kernel mechanism called scheduler activations and a user level thread package that together can provide the performance and flexibility of user-level threads along with the functionality of kernel threads by providing an abstraction of a virtual multiprocessor to every address space.

2. Problem
Threads can be either user-level or kernel-level. Kernel level threads (KLT) are inherently costly and offer poor performance with no flexibility for applications to implement its own scheduling policies. User level threads (ULT) are managed by run-time libraries linked into application code and are typically implemented on top of kernel-level threads. ULT requires no kernel intervention in normal operation and offers good performance while providing the flexibility for applications to have customized scheduling policies. When a ULT blocks on I/O or a page faults, the KLT corresponding to the ULT blocks and all the related ULTs are blocked, leaving the physical processor unused. One could allocate more KLTs than physical processors, but the KLTs still block and resume invisible to the user level library and the scheduling of KLTs is oblivious to user-level thread state. Hence KLTs are a wrong abstraction for supporting ULTs.

3. Contributions
The paper provides a new kernel mechanism along with a user level thread package that together can achieve the performance and flexibility provided by ULTs when there are no page faults or I/O and can mimic a KLT on I/O or page fault making sure that no processor idles when there are ready threads and that there is no priority inversion. The main contributions of the paper are: i) The paper introduces an abstraction of a dedicated physical machine to each application using virtual multiprocessor, where the kernel allocates processors to address spaces whereas the user-level thread library for each address space have complete control over which ULTs to run on its allocated processors. ii) The paper introduces scheduler activations, a kernel mechanism to achieve the above functionality. A scheduler activation provides the execution context within the kernel on which ULTs can be executed, notifies the ULT library of kernel events (when an activation blocks, resumes or is preempted) and provides data structures in kernel to save processor context of associated ULT. The key idea here is: using scheduler activations the kernel makes sure that there are exactly as many running scheduler activations as the no of processors assigned to the address space. It is achieved in a neat way by creating an extra activation when one activation blocks and preempting another activation when a blocked activation resumes. iii) The system also provides a way for user level library to communicate if it requires more processors or a processor is idle. iv) By allowing a thread to temporarily continue to finish executing in the critical section they devise a technique free of deadlock. Thus a mechanism to provide proper kernel support for user level threads is introduced.

4. Evaluation
The paper begins by motivating through an evaluation to show how thread operations on ULTs are orders of magnitude faster than its KLT counterpart. One key thing to evaluate here is if providing the functionality similar to KLT on I/O or page fault using scheduler activations affect the performance in the case where there is no I/O or page fault. The paper evaluates this correctly and shows how scheduler activations along with ULTs occur very small overhead when compared to ULTs running on Topaz KLTs. The next evaluation is to examine what happens on excessive I/O or page faults. We know that ULTs suffer when there is excessive I/O. Is ULTs running on scheduler activations able to perform similar to KLTs during I/O?. By examining what happens when there is limited memory (hence more page faults) the paper shows that ULTs on scheduler activations perform better than both ULTs on KLT and KLTs. The next thing that comes to mind is: does ULTs benefit parallel applications? That is what the paper evaluates next. They show that ULTs on scheduler activation provide a speedup slightly better than original ULTs and way better than KLTs. Next is to evaluate what happens in the presence of multiple applications? The paper evaluates by running two instances of parallel program used in previous evaluation and shows that it achieves a speedup closer to the theoretical limit. This is the only evaluation where I thought, they could have done more, say, by running a mix of I/O and CPU intensive workloads. They do mention that they are limited by the hardware and that could be the reason for not performing this evaluation.

5. Confusion
Is ULTs supported by modern operating systems?

Posted by: Aishwarya Ganesan | February 25, 2016 08:57 AM

1. Summary
The paper introduces a hybrid approach of user threads and kernel threads,using Scheduler Activations, which provides the flexibilities of user threads in terms of application favored scheduling and low overheads for single address space thread management and at the same time providing awareness of kernel event such as processor reallocation, faults and I/O completion.

2. Problem
In previous implementations, threads were supported either at user level or at the kernel level. Support at user level impacted program execution (in terms of performance and occasionally, correctness) due to being unaware of multiprogramming, I/O, page faults etc. Implementing kernel threads to counter these issues, lead to multiple other problems - the threads were too heavyweight for standard parallel programs, thread scheduling within the same process required kernel traps and unnecessary safety checks. Also, general purpose implementations at the kernel levels had to be used across all programs reducing inefficiency and occasionally leading to fairness in allocation issues as well.

3. Contributions
Firstly, the authors perform intuitive analysis to showcase the overheads in kernel threads in comparison to user threads in normal scenarios and reason about worst case scenario for user threads mechanism.

The authors introduce a hybrid approach, combining the useful knowledge from both kernel threads and user threads implementations. The kernel allocates a set of processors completely to a particular address space, within which thread scheduling management is given to the thread scheduler which can prioritize threads and also allow thread switching without concerning the kernel. The kernel controls the number of processors allocated to the address space and can add or remove processors based on idleness of machines, priorities between address spaces, I/O events, faults etc.

The key implementation behind this approach is the scheduler activation which the kernel creates and upcalls it into a particular address space when the kernel needs to perform reallocation, or inform the address space of any kernel events. The thread level scheduler informs the kernel about idle processors, request for extra resource, priorities in allocation etc. The number of traps to kernel are largely reduced in comparison to older kernel thread implementations.

The authors also study deadlock issues and performance bottlenecks created via handling of critical sections and implement recovery mechanism to avoid them.

4. Evaluation
The authors implement their modifications atop both, the Topaz OS for DEC SRC Firefly multiprocessor system and FastThreads. They firstly evaluate the common case, with limited I/O and show that their implementation performs akin to user threads with marginal overheads. The authors gradually increase the program size of their test case requiring its I/O operations to increase - at this point, the FastThreads performs degrades fast, while the paper implementation scales well for higher I/O and for multiprogramming systems as well. While their evaluation supports their claims, it would have been nice to see evaluation across a broader spectrum of usecases to show the impact vs. user threads on a broader scale.

5. Confusion
How real is the argument of kernels implementing some standard form of scheduling vs. applications being suited to a different policy - while this is acceptable in theory, don’t kernels provide enough knobs to suit a wide range of applications - even h/w implements sandboxing techniques to find best policies (for eg. in cache prefetching) for applications, the OS could easily do a reasonable job of the same?

It is not clear why there might be scenarios when some threads do not receive processor time (leading to deadlock) in the case when multiple user threads are multiplexed on lesser number of kernel threads. How would this fail with, say, a two level round robin?

Late discard of activation A in the illustrated example.

Posted by: Gokul Subramanian Ravi | February 25, 2016 08:57 AM

Problem:
Managing performance at the user level is essential to high performance parallel computing but kernel threads or processes are poor abstraction on which to support this.

Summary:
The paper describes design, implementation and performance of a kernel interface and a user level thread package that together combine the performance of user level threads with functionality of kernel level threads. It provides each address space with a virtual multiprocessor in which application knows exactly how many processors it has and exactly which of its threads are running on those processors. Responsibilties are divided between kernel and each application address space.

Contribution:
In my opinion:
i) The whole concept of scheduler activation using which kernel an user application interact and notify each other of the important events. The address space thread scheduler uses this context (scheduler activation) to handle the event, modify data structures, execute user level threads,
ii) The idea that kernel should not relinquish its power to allocate processes , and schedule address spaces based on priorities at the same time notifying the application to take necessary steps to prevent performance degradation due to the kernel actions is awesome. The overall handshake between the two
iii) The performance optimization for management of critical sections is very innovative.

Evaluation :
The design was implemented on the FastThreads user package on top of Topaz kernel threads on a DEC FireFly Multiprocessor system. Overall, all major aspects of the system have been evaluated for FastThreads on Topax Threads, FastThreads on Scheduler Activation and Topas Threads by dividing them into three categories: comparison of the cost of user-level operations , cost of communication between the kernel and user level and overall application performance .
When the cost of basic null-fork and signal-wait are compared, they not only explain that it is nearly similar to the orig FastThreads, but also explain that it would have been worse without the critical section optimization ( zero-overhead in marking that when a lock is held) . I like it because it helps analyze the benefit of this optimization.

Upcall Performance : The overhead added by scheduler activation machinery is seen to be 5x worse than Topaz threads. This was evaluated by measuring the time for two user level threads to signal and wait through kernel. They claim that if this difference is not inherent to scheduler activation mechanism but to the implementation aspects which when tuned the upcall performance would commensurate with Topaz thread performance. However, profiling of these issues has not been done hence it is a bit obscure. Overall claim is that the analysis is pessimistic. In my opinion the cost of upcall performance is a very important aspect and time taken by communication between two threads should have been evaluated for different cases like memory available to during commnication , varying number of threads running, and ratio of threads to processors can complicate should be given to prove that nothing inherent to the scheduler activation is responsible for this. And the degradation remains constant (5x) .

Application performance : It has been measured for different scenerios, no kernel involvement , increasing kernel involvement and multiprogrammed environment. I really like the table and the explanation of speedup 2 Nbody applications running on 6 processors with 100% memory available has been shown to be 2.45 in case of new fast threads clearly shows that new Fast threads do better in this base case. This explains the concept very clearly but doesn’t prove the performance for corner cases. Further complicated scenarios' could have been evaluated, like increasing number of application/processor ratio I would say. Just 2 applications and 6 processors is not sufficient for complete analysis.

Confusion :
Where does the user-level scheduler control everything from ? Which stack does it stay in?

Posted by: Vishakha Dhelia | February 25, 2016 08:55 AM

1) Summary : The authors in this paper initially put forth the pros and cons of kernel-level threads and user-level threads. User-level threads are more flexible, perform better but not supported by different systems, whereas kernel-level threads have less restriction while accessing kernel services. Taking cue from this the paper discusses the design and implementation of a thread interface, a hybrid system{user-level package and kernel interface} to support parallelism, where the interface presented to the application programmers remains unchanged.

2) Problem : The authors state that user-level and kernel-level threads have their own advantages and drawbacks. User-level threads exhibited fast performance, flexibility and no requirement to modify the underlying kernel, there could be a dip in performance due to the difference between virtual and physical processors in scenarios like page faults, I/O and multiprogramming. Kernel-level threads efficiently avoid idle processors, versatile to system integration but are an order of magnitude slower the user-level threads. The authors attempt to solve the problem of how a programmer can combine functionality of kernel-level threads with user-level thread’s performance and flexibility.

3) Contribution : The bottleneck in the above paradigm is that kernel-level threads are executed as system calls and have many context switches with extra protection and generality whereas user-level threads are application specific and use user-level libraries to make a function call. In a single-threaded kernel, we know a system call can block the entire process. The absence of an integrated system reduces the performance. The primary problem is addressed in terms of sub problems, i]how to avoid idle processors, when a kernel call is needed and maintain thread priorities and return processors to address space on system trap, ii]an easy way to support application customization by changing policies, iii]how can user-threads that do not call kernel achieve comparable performance to that of normal user-level threads. The design contributions are; a] Scheduler activation-a mechanism in which the kernel exhibits vector control over an address space during kernel events, it creates a scheduler activation and performs an upcall into application address space. b]kernel is responsible for processor allocation to application address space and application address space does thread scheduling. c]user level scheduler on being notified by the kernel events pass information to the kernel about the events that effect processor allocation decisions. d]kernel is notified by user application when there exists more runnable threads in comparison with the number of processors or vice-versa. A multilevel feedback mechanism is implemented to ensure sharing of resources proportionally. e]Periodic checking of user threads when they are pre-empted or blocked in a critical section. this prevents deadlocks continuing to execute in temporary thread and user level context switching till critical section is exited. f]When processing is to be done it is guaranteed no processor is idle and processor allocation policy shares processors by adhering to priorities.

4) Evaluation :The authors evaluated their design by modifying the Topaz OS in DEC SRC Firefly multiprocessor. User-thread package(which is fast) and the original Topaz kernel-threads are compared with the Fast-threads and scheduler activations. Fast-threads and scheduler activations demonstrated similar performance to fast threads in terms of thread performance in 2 operations, but better than kernel-threads. In the case of negligible I/O in N-body simulation Fast-threads and scheduler activations outperformed both Fast-threads and kernel-threads. When it cam to I/O Fast-threads and scheduler activations performed better that fast-threads owing to better implemented lock mechanism. The authors state that an overall speedup of three fold was achieved. Although the authors have carried out satisfactory evaluation it would have been more useful to have the system evaluated using real time workloads, profiling of the workloads could help understand where the majority of time is being spent in the hybrid interface.

5) Confusion : If this hybrid interface is touted to obtain better performance why do prevalent operating systems still opt for the conventional 1 is to 1 kernel-thread is to user-thread design.

Posted by: Shruthi Racha | February 25, 2016 08:54 AM

1. summary
The paper proposes kernel support for user level parallelism which combines the functionality of kernel threads and the flexibility and performance of user level threads.It distributes the necessary control and scheduling information between the kernel and the application's address space using the scheduler activation and virtual processor abstraction.

2. Problem
User-level threads are managed by runtime library routines which are directly linked to the applications thus thread operations do not require any kernel intervention.This leads to good performance and flexibility.But this approach suffers from poor performance during page faults and I/O. On the other hand, the kernel can provide direct support for multiple threads per address space but this suffers from performance problems due to heavy kernel involvement during thread operations.It is also possible for user level threads to be implemented on top kernel level threads but this approach also suffers from similar problems.

3. Contributions
The operating system kernel provides each user level thread with its own virtual multiprocessor which is an abstraction of the physical machine.The user level thread system has complete control over which threads can be run on the allocated processors.The user level thread system is notified when the kernel changes the number of allocated processors or when a thread blocks or wakes up. The thread system in turn notifies the kernel when it needs fewer/more processors.A drawback here is that applications might not be honest about their processor requirements but a feedback system can be used to enforce honesty.

The scheduler activation is the abstraction that vectors kernel events to the user level thread scheduler.It provides space in the kernel to save the context of activation's current user level threads.When the kernel needs to notify the user level system of an activity it assigns a processor to the address space and initiates an upcall on that processor.

To prevent poor performance or deadlocks when a thread running in a critical section is preempted ,a recovery based approach has been proposed .If a thread executing in a critical section is preempted it is allowed to complete execution till the end of critical section before being preempted.
4. Evaluation
The Topaz kernel thread management was modified to support scheduler activations and FastThreads was modified to process upcalls.It is shown that the cost of user level thread operations is the same as those in FastThreads. The system preserves the order of magnitude advantage of user level threads over kernel level threads.The small degradation observed for Null fork and Signal Wait operations are well justified.When an application makes minimal use of kernel services , it runs as quickly as on the original FastThreads approach.When the application requires kernel involvement (when it does I/O)scheduler activation approach outperforms the FastThreads approach.The upcall performance is significantly slower than topaz kernel thread operations . This is attributed to implementation specifics of maintaining more state and is not directly because of the scheduler activations. Since upcall performance is critical to this design , more emphasis should have been paid at the implementation stage to ensure better performance for upcalls.
5. Confusion
Comparison of FastThreads vs scheduler activation vs Linux thread in terms of communication between kernel and user address space and handling of critical sections.

Posted by: shreya kamath | February 25, 2016 08:52 AM

1. Summary
The paper describes the design, implementation and performance of a new kernel interface and a user-level thread package that together provide the same functionality of kernel threads without compromising the performance benefit of user-level thread policies. The design also provides the advantages of flexible user-level management of parallelism. The paper also compares kernel and user-mode thread schemes and presents various trade-offs.

2. Problem
Threads can be supported at either user level or kernel level.User level threads have the advantage of performance and can be customized to the needs of the language or user without kernel intervention. However in the case of multi-programming , IO and page faults, user level threads suffer from poor performance or incorrect behaviour. Kernel level threads address this problem faced by user level threads during IO page faults but are very heavyweight. The paper addresses this problem by presenting a combination of a kernel interface with user-level thread package to harness the advantages of both user and kernel level thread.

3. Contributions
The paper points out the disadvantages involved with the use of either kernel-level or user level threads and presents a new thread-management design.
Each application is provided with an abstraction of a dedicated physical machine called the virtual multiprocessor. Each application has complete control over which of its threads are running and which processor has been allocated. The kernel has complete control over the allocation of processors among address spaces. To achieve this, the kernel mechanism scheduler activations are used allowing the application to have knowledge of its scheduling state.
Scheduler activation vectors control from the kernel to the address space thread scheduler on a kernel event. The thread scheduler can use the activation to modify user-level thread data structures, execute user-level threads, and make requests of the kernel.

4. Evaluation
The authors evaluate their implementation of the thread management scheme by looking at the cost of user-level thread applications like fork and signal-wait. Their implementation preserves order of magnitude speedup over kernel threads. The upcall performance for their implementation is much slower than kernel threads. It would have been interesting if they would have presented experimental data for analyzing the overhead. Their implementation shows a better speedup with number of processors as compared to kernel threads and unmodified user threads and also performs faster when available memory percentage is low. However, this result was presented using only 1 application. They do not present the results on a mix of compute and IO intensive applications.

Posted by: Anshul Purohit | February 25, 2016 08:51 AM

1. Summary
This paper describes a new design of user-level thread management. It claims that neither kernel thread nor user-level thread achieves satisfactory performance, and for user-level thread much of the disadvantage can be avoided with better abstraction.
2. Problem
Threads can be supported either at user level or in the kernel, however either approach has benefit and weakness. For kernel thread the overhead of context switch is heavy when user code is preempted into the kernel space. On user-level thread, lack of system integration is a problem: when a certain user-level thread blocks on I/O and page fault, the whole process will also be blocked.
3. Contributions
The goal of the paper is to design the user-level thread management system so that it does not suffer from the problems in traditional user-level thread library. The kernel abstracts the physical processors as virtual multiprocessor, and the allocation and de-allocation of them is communicated with the user-level thread scheduler using a data structure called scheduler activation. In cases where the kernel intervenes the activity of user-level threads, such as I/O block and hardware preemption of a user thread, it notifies the user-level thread scheduler with a new scheduler activation with information about the virtual processors, so that the scheduler can reschedule the user-level threads to run on currently available virtual processors. On the other side, the user-level scheduler may also ask the kernel for more processors, or remove some when they are idle.
4. Evaluation
The paper evaluates the performance with a test comparing a modified version of the Topaz operating system which implements scheduler activation, on which runs the FastThreads user-level thread library, with the Topaz kernel thread FastThread running on top of it. In terms of thread operation latency, scheduler activation shows slightly higher delay than running user-level thread on top of kernel thread, but order of magnitude better than naïve kernel thread. The paper also gives a detailed explanation of where the overhead of scheduler activation comes from. The paper also tests on a N-Body problem to measure application performance, and the result show that the user-level thread running on scheduler activation benefits the most from multiprocessors.
5. Confusion
How can user-level thread be preempted to the user-level scheduler, without trapping into the kernel?

Posted by: Fujie Zhan | February 25, 2016 08:50 AM

Summary
This paper details the design, development and evaluation of an user-level thread system and the required kernel support to achieve effective user-level management of parallelism on a multi-processor system. This arrangement allowed combining the functionality of kernel threads with the performance and flexibility of user-level threads.

Problem
User-level threads are effective in managing parallelism in user-level applications due to their higher performance (over kernel threads) and the flexibility that they provide to the application programmer to select his/her own concurrency model. However, such a system of user-level threads without any kernel support is unable to integrate well with the entire system, e.g. its performance degrades when the system is multiprogrammed or is dealing with I/O. Ensuring performance and correctness was thus a challenge for plain user-level threads . User level threads built on top of kernel level threads also suffer from similar limitations. On the other hand, kernel threads have high performance overhead as they require a kernel trap, and then copying and checking of parameters to ensure security. Also, using kernel threads requires the kernel to have a single general-purpose implementation that assumes use of all of its possible features by any application, making this implementation generic.

Contribution
The authors developed a system that ensured effective kernel support for the user-level management of parallelism. The kernel provided each system with its own virtual multi-processor and an user-level thread manager which was responsible for interacting with the kernel and making user thread scheduling decisions. The kernel is responsible for allocating processors to address spaces and for notifying the user-level thread system of a process regarding the processors allocated / deallocated to its address space. The kernel also communicates key thread events (blocked, woken up, preempted) to the thread scheduler, instead of interpreting them itself. The user-level thread system notifies the kernel on a certain subset of thread operations that may affect the processor allocation decisions in the kernel. While the application programmer sees no difference, the underlying thread concurrency model can now be chosen such that it suits the need of the application.

The scheduler activations (an execution context for an user thread) and the upcalls made by the kernel to the scheduler activation allow it to communicate key events to the user-level thread system, such as processor allocation/deallocation and scheduler activation transitions to blocking /unblocking states. When a user-level thread blocks in the kernel or is preempted, no expensive data copying from the kernel is required to restore the the user thread state as we store this data in the thread control block at the user level itself. As opposed to kernel threads, preemption of a scheduler activation requires creation of a new scheduler activation to update the corresponding user-level thread system about its preemption. The implementation also uses notifications / hints that allows processes to request more processors from the kernel. Mismanagement of this feature is prevented by having a multi-level feedback queue that penalizes processes that use a higher number of processors. The authors also developed a recovery-based solution to deal with the problem of inopportune preemption, that causes poor performance and/or deadlock. Instead of using a flag to indicate holding of a lock, they execute a modified copy of a critical section to make the thread yield on completing the critical section. Another optimization performed is the caching of discarded scheduler activations. The implementation of this design uses dynamic space-sharing of processors to reduce the number of processor reallocations.

Evaluation
The design suggested by the authors was implemented by modifying the FastThreads user-level thread library and Topaz kernel thread management routines. The authors compare their implementation on the Topaz OS running on a Firefly machine against kernel-level Topaz threads and the original FastThreads. The first test is that of evaluating performance of the authors' implementation for the null-fork and signal-wait calls, which shows around 10% perf degradation for Null-fork and 15% degradation for Signal-Wait. The upcall overhead was indirectly measured through a Signal-Wait test with kernel-level synchronization, and it found the authors' implementation to perform significantly worse. Another analysis ran the N-body problem with increasing number of processors and found the authors' implementation to scale best. The same problem was run with increasing memory pressure and found out that the scheduler activation based FastThreads implementation gave the best performance over the other two (original FastThreads and kernel-level Topaz threads).

This evaluation seems to demonstrate the basic viability of using user-level threads with explicit kernel support for running parallel, inherently scalable workloads such as the N-body problem. However, the authors could have better explained the performance degradation observed for Null-fork, Signal-Wait and the upcall measurements, or they could have reran the tests against an improved implementation. Also, the execution time analysis for the N-body problem could have been improved by performing profiling to exactly evaluate the overhead the user-level thread system.

Questions/ Confusion
1. Is this design being currently used?
2. The handling of the page fault for the user-level thread manager was not clear.

Posted by: Shantanu Bhate | February 25, 2016 08:41 AM

1. Summary
The authors of this article have described the design, implementation and performance of a kernel interface and a user-level thread package that together manage parallelism to attain high-performance parallel computing in a flexible manner. They have used Scheduler Activation mechanism to allow sharing useful information between user space and kernel, without knowing user-level data structures. This was implemented on Firefly using Topaz and FastThreads.

2. Problem
Threads were introduced in order to remove the notion of a single execution context in order to support parallel programming. Shared memory only handled coarse grained parallelism since they were designed for multiprogramming on a uniprocessor. Threads can be kernel or user-level but neither of them are fully satisfactory on multiprocessors. User-level threads are managed by runtime libraries and require no kernel intervention. This makes them flexible and gives excellent performance but deems them useless when multiprogramming, I/O or page fault arises at the extent of exhibiting incorrect behavior. It may lead to a situation of deadlock wherein there are no kernel threads to move forward if the user threads have consumed all kernel threads. Whereas kernel threads are too heavyweight and also lack the user-level application knowledge of thread state in an address space. It may preempt a running thread to schedule an idle user thread, thereby spending unnecessary time.

3. Contribution
The idea provides kernel-level thread functionality with user-level thread flexibility and performance. This mechanism uses a so-called N:M strategy that maps some N number of application threads onto some M number of kernel entities, or virtual processors. There is a single uniform mechanism to handle processor preemption, I/O and page faults. The application can set its own policy for scheduling its threads onto its processors and implement it without trapping into the kernel each time. The kernel eliminates the need for time-slicing by notifying the application thread system of the event and keeping the number of execution context constant. There is a 5 step process followed each time kernel decides to take a processor away from one address space - send the processor and interrupt, stop the old activation and use the processor to do an upcall in the new address space with a fresh activation, another preemption on a different process running in the old address space to notify it of two preemption and the user-level thread scheduler can decide which thread to run on the given processors, giving it the option of maintaining explicit cache locality. This respects the priorities of the threads contending for the processor and also ensures no processor is idling. There are a few optimizations done - cache the kernel threads when they are destroyed, use special assembler labels to handle critical sections in the user-level thread that does not involve set/clear of a flag but instead copies the critical section of the code and continues the thread till its in a safe place.

4. Evaluation
Thread performance on the CVAX Firefly is worse than FastThreads on Topaz threads without Scheduler Activations due to logic in code to check whether kernel is to be notified, preempted thread to be resumed and to maintain the number of busy threads. This was found out via the null-fork and signal-wait operations. The authors claim that because they didn't make the system from scratch, overhead due to more states being stored were the cause. They could have backed that as part of latency profiling to prove it. The upcall performance adds significant latency but this could have been again better profiled to understand and isolate the parts which were adding to this delay. Proof of concept of reduced SRC RPC costs by recoding Modula-2+ in assembler would have better supported the claim of better application performance. The speedups and execution time is compared with the Topaz and original FastThreads and new FastThreads perform the best and only have a slight degradation from the ideal performance due to bus contention and periodic donation of processor to the kernel daemon. In my opinion, they could have compared the critical section handling with all three prevention and recovery methods(special assembler labels, set/clear variable, check for imminent preemption) in cases when the preemption actually happens to understand the latency introduced in copying the code and running it in another location and yielding at the end of its execution. Using a single upcall for multiple notifications is a good decision since a new exclusive virtual processor needn't be spawned every time an event is to be notified. But they haven't explicitly mentioned the memory footprint incurred as a cost by adding scheduler activations since it adds two execution stacks.

5. Question
I gather that this approach was actually implemented in NetBSD and FreeBSD but they reverted back to 1:1 kernel to user thread mapping. I want to understand why this design was not taken up commercially.

Posted by: Sejal Chauhan | February 25, 2016 08:36 AM

1. Summary
Threads can be supported either by the operating system kernel or by user-level library code in the application address space. However, the authors claim that neither of these approaches has been fully satisfactory. This paper initially describes this dilemma and the problems associated with them. Post that, the authors propose a new kernel interface termed as Scheduler Activation and user-level thread package. The main aim of the authors was to provide the same functionality as kernel threads without losing out on the performance and flexibility of user-level threads.
2. Problem
Concurrency required in parallel programming can be achieved either by using user-level threads or kernel-level threads. The authors claim and show that user-level threads are inherently better than the kernel-level threads as the latter has higher overheads of thread management and tends to be overly general to support all applications. User-level threads on the other hand are lightweight and perform better as they can be optimized for the particular application in question. However, the authors feel that kernel threads are the wrong abstraction for supporting user-level thread management. The authors solve this problem by designing a new abstraction in the kernel to support user-level threads.
3. Contribution
The main contribution of the proposed solution is the introduction of a new abstraction known as virtual multiprocessor for the user-level thread system. This abstraction maps to a dedicated physical machine. Firstly, the kernel provides each user-level application with it’s own virtual multiprocessor. Secondly, the kernel has complete control over the number of processors allocated to each address space whereas the user-level thread system has complete control over which threads to run on its allocated processors. This approach of clearly separating the various responsibilities along with transparently being able to implement the proposed design is something that I find very important and interesting. Lastly, the communication between the kernel and the user-level thread system happens via scheduler activations. A scheduler activation several multiple important roles as it provides the execution context for running the user-level threads, notifies the user-level thread system of kernel events and provides space in the kernel for saving the processor context in case the user-level thread is stopped by the kernel. The authors also discuss about using copying instead of deadlock prevention to handle critical sections and briefly talk about reusing scheduler activations.
4. Evaluation
The authors implement the proposed solution on the Firefly processor and have carried out a thorough evaluation. Firstly, the authors prove that using scheduler activations does not have a negative impact and their performance is comparable to normal threads on Topaz threads. Secondly, they also try to quantify the cost of upcalls. They observed a 5x drop in performance but claimed that this degradation was implementation specific and could be dealt with. It would have been interesting to see the authors mitigate the problem and then compare the upcall performance. Lastly, the authors show the proposed abstractions offers better speedup and scales well, even during the presence of memory pressure. The authors also show that the solution performs well in multi-programmed scenarios. Though their evaluation is satisfying, it would have been interesting to evaluate their strategy of handling critical sections and quantify the overheads involved in copying. Lastly, it would have also been beneficial to evaluate the system on real-applications rather than simple benchmarks.
5. Confusion
I did not quite understand the usage of the term ‘address space’ in this paper. Does it simply refer to a user-level application? Also, why are KLTs still being used as an abstraction to support ULTs?

Posted by: Arjun Singhvi | February 25, 2016 08:29 AM

Summary
This paper presents the design and implementation of an efficient thread management mechanism by providing user-level threads with kernel interfaces support, scheduling threads based on upcalls that deliver hints from kernel to user level in terms of scheduler activations. The paper mentions that kernel thread's are expensive and wrong abstraction for existing multi-processor systems.

Problem
User-level threads based on the traditional kernel interface are fast to schedule, but suffer from concurrent issues (e.g. race hazards, starvation caused by blocking system calls), which may be avoided by kernel level thread. But kernel level threads sacrifice flexibility and performance due to a general scheduler policy and longer time to context switch. The authors have strived to get the best of both worlds by combining the functionality of kernel threads with the performance and flexibility of user-level threads.

Contribution
In this paper the authors have provided a hybrid approach to threading. Each address space is provided with a virtual multiprocessor and has complete control over which of its threads are running on those processors. The kernel has complete control over the allocation of processors among address spaces. To achieve this, the kernel mechanism scheduler activations are used.
Scheduler activation sends control from the kernel to the address space thread scheduler on a kernel event. The thread scheduler can use the activation to modify user-level thread data structures, execute user-level threads, and make requests of the kernel.

Evaluation
The authors did the evaluation on null fork and signal-wait operation, and demonstrated that the performance of their design is almost similar to usual user level threads. They show that there is a performance degradation when kernel intervention is needed using upcalls in schdeuler activations. The authors mention that with more number of processors, the original user-level thread package and the new package both is far better than Topaz threads.
They showed that performance of application when changing main memory the new design provides much better speedup when compared to original mechanisms in Topaz threads. They could do some more evaluation on mixed workloads which are I/O intensive and CPU intensive. They could also talk about the overhead and cost of adding scheduler activation to the system.

Confusion
With increased system complexity and requiring changes to both kernel and user level code of OS integrating the solution could be a challenge. Are the gains achieved enough to get patch all the current OS with this design ?

Posted by: Ankur Srivastava | February 25, 2016 08:12 AM

1. Summary
This paper describes the design and implementation of a new mechanism for scheduling threads on multicore processes, called a scheduler activation. This mechanism abstracts a processor that is available to a user-level thread, which allows the user-level scheduler to be simpler and creates an improvement in performance on real workloads.

2. Problem
When this paper was written, kernel threads were typically too inefficient for most applications. Thus, many user-level thread libraries had been implemented on top of kernel threads. However, kernel threads are the wrong abstraction on which to build user threads, because they can block and resume without notification to the user threads running on them and because they are scheduled without information about the user-level threads running on them. For example, the kernel thread could be blocked while one of its use threads was holding a lock needed by other user threads running on other kernel threads, potentially creating deadlock.

3. Contributions
This paper contributes a new mechanism which uses scheduler activations. Each one is associated with one physical processor and represents a virtual processor which may be added to or removed from a user-level thread system at any point. When a scheduler activation is added to a thread system, the thread system looks at its ready list and runs one of those threads on the processor. When a scheduler activation is preempted, the thread running on it is added back to the ready list.

The crucial difference between scheduler activations and kernel threads is that a scheduler activation is never blocked: if the thread running on the processor associated with the activation blocks, the activation is removed from the user-level thread system and the thread is never resumed by the kernel. Instead, the kernel supplies a new scheduler activation record, associated with the same processor, on which the thread system can allocate a thread. This creates a much improved interface for scheduling user threads.

The paper also describes mechanisms for a user-level thread scheduler to provide the kernel with information that it can use to allocate processors better. One system call is provided to ask the kernel for another thread and another to notify the kernel that a processor is idle.

The authors also contribute an implementation of their system. While they claim that the mechanism is flexible enough to handle many different policies for scheduling threads onto processors, their implementation uses shared-memory threads as the model for concurrency. This implementation included facilities for debugging the code, with little extra work. It also included performance enhancements for running critical sections and for reusing the data structures in scheduler activations.

4. Evaluation
This paper evaluates the implementation both on microbenchmarks and on a real-world program. With microbenchmarks, the authors find that their implementation is an order of magnitude faster than the kernel threads on the Topaz system which the implement on. They also find that an implementation of a user-level thread library, FastThrds, on top of their new mechanism is slightly slower on the microbenchmarks than the same library on top of kernel threads.

They also evaluate the speed of the calls from the kernel to the user space and find them to be several times slower than for kernel threads on their system. They attribute this to implementation issues. Their code is written as a modification of that kernel thread system and their code is written is a high-level language (Modula-2+) rather than carefully-tuned assembler.

Moreover, the authors evaluate their mechanism on a real-world application, which solves the N-body problem using an O(n log n) algorithm. This application can be either compute-bound or I/O-bound, depending on the amount of available memory. The authors find that in both cases, the application performs better with FastThrds running on their mechanism than with FastThrds running on kernel threads.

This evaluation seems appropriate to the paper. They give applicable microbenchmarks, not obfuscating that their mechanism performs worse than a competitor. They also test it on a scientific program, a realistic application for a multiprocessor computer at the time.

5. Confusion
Could you explain more about upcalls and what makes them fast or slow?

Posted by: Stephen N. Lee | February 25, 2016 08:12 AM

1. Summary
This paper describes the design, implementation and evaluation of Scheduler Activations, an hybrid mechanism for effective multithread support. The authors discuss the inherent problems with both Kernel Threads and User Threads to provide the necessary context.

2. Problem
Threads are used to achieve parallelism. Typically user level threads are flexible with high performance but they perform poorly when the system encounters "real world" activity like multiprogramming, I/O or page faults. User threads sometimes exhibit poor performance and incorrect behaviour. Kernel level threads are expensive and incur high cost due to generality of design. The authors aim to solve this problem by describing a kernel interface and user-level thread package that provide the necessary performance and flexibility.

3. Contributions
The primary contribution of this paper is the design of the new hybrid mechanism. In the proposed solution, the OS kernel provides each user-level thread system with its own virtual multiprocessor; the kernel has the ability to change the number of processors in that abstraction during program execution. The kernel vectors events to the user level thread system for every change. Also, the user-level thread system has complete control over thread scheduling and it notifies the kernel when it requires more resources. Also, this design is independent of the application scheduling policy and concurrency models.

4. Evaluation
The authors implemented a prototype by modifying the Topaz Kernel thread management system and FastThreads for the Topaz OS on DEC SRC Fly multiprocessor machine. They evaluate thread performance for cases which require minimal kernel support by measuring the latencies of Null-Fork and Signal-Wait on a three different systems and show that Fast Threads on Scheduler Activations are comparable to Fast-Threads on Topaz.They also evaluate their upcall performance which involves significant kernel support and observe that their system is slower than Topaz Kernel Threads; the authors state that this might be due to an inefficient implementation using Modula2+. They additionally evaluate application performance by attempting to solve the N-body problem. They measure speedup as a function of the number of processors and observe that their implementation is comparable to the FastThread implementation. They also evaluate execution time as a function of available memory on a 6 processor system and observe that their implementation performs significantly better than the other mechanisms even under memory pressure.
Overall, the authors have come up with a brilliant design to address the problem of providing a performant, flexible thread management system. The authors could have performed a study which compared the performance of their system when multiple applications were running. They could have also given a breakdown of the cost associated with creating and managing scheduler activations

5. Confusion
What is the difference between coarse grained and fine grained parallelism ??
The strategy employed to recover from deadlocks isn't very clear and more details in class would be useful ?

Posted by: Vinothkumar Siddharth | February 25, 2016 08:04 AM

Summary:
The paper describes the design and implementation of a new thread management system that consists of a user-level thread management package and a new kernel interface for the kernel to provide both, the flexibility of user level threads along with a functionality of kernel threads, using “Scheduler Activations” which has been implemented transparently from the user.

Problem:
The author is completely in support of user level threads and claims that is does better than kernel level threads for a range of use cases, which do not need the support that the kernel provides. Moreover the kernel does not provide support for a range of parallel programming models as well. The author also mentions that the kernel threads are not the right abstraction for user level threads, and it’s performance is significantly lower compared to user level threads. Though user level threads are extremely flexible from an programming languages perspective, it does not have the necessary kernel support right now and causes system integration problems. So the author intends to bring out a solution that combines the best of both worlds.

Contribution:
The paper introduces virtual multiprocessors: a complete abstraction of the physical processors with virtual processors allotted to applications to schedule their user level threads on. This way the kernel lets the user application decide during thread scheduling.

Scheduler Activations: This is how the kernel provides support for virtual multiprocessors to the user level. They are used to communicate with the user level thread system in the case of kernel events. They allocate and deallocate virtual processors which could be as a request from the user for more processors or a preemption from the kernel in the event of a resource crunch, communicates thread blocking information to the user level thread system to schedule another thread on its allocated processor instead of making a decision itself, thus providing support for user level threads, and also provides an address space for the user thread to run on. The way kernel events are sent to the user level thread system is via upcalls, which are basically fixed entry points in the address space. So when a processor needs to notify of an event to the user level thread system, the kernel uses a different processor and scheduler activation to perform the upcall. And in the case of the execution of the critical section, the kernel lets the user level thread complete the critical section before attempting to preempt it.

Evaluation:
The design was implemented on the FastThreads user package on top of Topaz kernel threads on a DEC FireFly Multiprocessor system. The author claims that the changes to both the kernel and the user level thread management library were minimal. The cost of a Null Fork and Signal/Wait in Scheduler Activations was compared to that of a user level thread system (FastThreads) and shows that they are similar in performance. Application performance measurement was also done was Scheduler Activations by implementing a parallel application and shows that they run faster than the Topaz Kernel threads and comparable to FastThreads, when there are minimal kernel interactions in the application. But since the problem that’s being solved in the paper can be tested only on evaluation of a range of interactions between the kernel and the user level thread management system, the authors have tested the parallel application and shown that they Scheduler activations perform better than FastThreads in the event of a lot of I/Os and thread blocks, which I think is the right way to measure the performance of this system. The author could have tested the system on excessive load, such as too many requests for virtual processors compared to the number of physical processors, a load involving lot of preemption, a load involving a lot of code running in the critical section most of the time and not just multiple user level thread switches. There is not much information on the amount of additional space that the Scheduler Activations would require, but looking at it from a high level I guess it shouldn’t be a huge issue.

Doubts:
How does it handle very malicious applications?

Posted by: Siddharth Suresh | February 25, 2016 07:54 AM

1. Summary
This paper introduces a new paradigm for thread management. This paradigm borrows the performance and flexibility of user level thread libraries and the functionality of kernel level thread management. The paradigms comprises of scheduler activations that let the kernel coordinate with user level thread management libraries. The authors implement and analyze a prototype on top of the Topaz kernel.
2. Problem
Two thread management paradigms existed at the time. User level thread management was faster as it did not involve costly kernel calls for its operations. They were also more flexible as any parallelism policy could be implemented without changing the kernel. However, they were very restrictive as the thread library would not know when one thread was blocked or preempted and could not make smart decisions regarding resource utilizations. Kernel level scheduling solved this problem but suffered in terms of performance due to the high cost associated with making kernel calls. They were also not flexible as the kernel could not implement new policies without adding complexity.
3. Contribution
This paper introduces Scheduler Activations which serve as a vessel for informing a user space threading library about system changes by the kernel. This allows the user level thread library to make smart decisions when a kernel event occurs. These events include a process being preempted, blocked and unblocked from a condition such as a page fault. This allows for more dynamic use of resources since an address space’s virtual processors aren’t blocked just because a thread is. Conversely the user level thread library informs the kernel when it needs more computing resources and when it can discard resources rather than spawning extra processes hoping to get more time on a physical processor in an uncoordinated fashion (as the scheduling is still up to the kernel). The most important of this design however is that other than the above mentioned scenarios, no other communication is required between the user and kernel space. Hence operations in the performance critical path of any parallel application such as scheduling policy decisions and creation of threads can occur without any costly system calls or any kernel book keeping. Corner cases such as critical section execution and deadlock avoidance have been specially handled by letting a user threading module execute a thread even after its prevention just long enough for it to exit its critical section.
4. Evaluation
The authors evaluate their solution by prototyping a solution using the FastThreads user package on top of Topaz kernel threads. They compare the basic cost of a Fork and a Signal Weight combination to see the cost added by the Scheduler Activations. They then test application performance scaling on a multiprocessor system. Here they demonstrate that the SA solution is able to scale as FastThreads for the case with no kernel involvement but beats both when an application requires kernel requirement as the threading library and kernel are able to cooperate around I/O bottlenecks. The authors evaluation seems sufficient as they are able to show the performance improvements for the best case as well as how the module handles non ideal workloads and is still able to scale performance. This implementation is not completely tuned and hence the results should be taken with a grain of salt and a productized version should achieve better results.
5. Confusion
I do not understand and hence have not mentioned the Processor Allocation Policy of apace sharing of a compute resource in a multiprocessor and multi programming system.

Posted by: Abhinav Mehra | February 25, 2016 04:42 AM

Summary
This paper talks about the design and implementation of a kernel interface and a user-level thread package that combines the performance of user-level threads with the functionality of kernel threads. The kernel interface exports an abstraction of virtual multiprocessor and vectors kernel events through specially designed scheduler activation structures, while allowing the user-level thread manager to efficiently create and schedule threads in user space.

Problem
When designing parallel computing applications, a parallel programmer is often faced with a difficult dilemma of choosing between user-level threads and kernel-level threads because both have their inherent advantages and disadvantages. The flexibility and performance provided by user-level threads is restricted to uniprogrammed applications in the absence of I/O. On the other hand, kernel threads do not have these restrictions, but they suffer from poor performance in general. This paper proposes a solution that addresses this dilemma by providing the best of the both worlds.

Contributions
According to me, the following are the novel contributions of this paper:
(1) a virtual multiprocessor abstraction for each address space, in which the application can know exactly the number of processors it has and what to schedule on them,
(2) the separation of responsibilities between the kernel and user-level thread management, where the kernel is only responsible for processor allocation and the user-level thread management is in full control of thread scheduling,
(3) a new kernel mechanism called scheduler activation, which is an execution context for vectoring control from kernel to the address space on kernel events.
(4) the user-level thread scheduler has to communicate to kernel only those events that affect the processor allocation decisions and hence, many of the thread operations can now be implemented just at the user-level,
(5) a huge advantage of the designs introduced in this paper is that the kernel makes no assumption about the application’s concurrency model or scheduling policy. Thus, it provides a great deal of flexibility and better performance as the applications are free to choose their own data structures to represent parallelism at user level.

Evaluation
The authors have implemented their design ideas by modifying Topaz kernel thread management routines and FastThreads(a user-level thread package) for Topaz operating system on DEC SRC Firefly multiprocessor workstation. In their evaluations using Null Fork and Signal-Wait operations, the authors have shown that the cost of user-level thread operations in their proposed design remains almost same as that of the original FastThreads, while still providing the benefits of better system integration. However, their upcalls are slower than those of Topaz kernel threads, which the authors speculate might be because of their inefficient implementation in a higher level language, Module-2+. Finally, they have also evaluated the effect of their solution on application performance using a parallel application that solves the N-body problem. It turns out that when the application makes minimal use of kernel services, their solution runs as fast as the FastThreads and still much better than Topaz kernel threads. And their solution really shines and performs better than original FastThreads when the application starts interacting with kernel using I/O.
I believe that the authors have been able to justify most of the advantages of their proposed solution in their evaluations. But, I think there are still a couple of points for which the authors could have presented more detailed analysis. For example, the authors have not presented an evaluation of the case when multiple applications are running at the same time, sharing the underlying physical processors. This would have tested the overhead involved in moving processors between address spaces with different priorities. Additionally, the authors have not talked in their evaluation about the memory requirements and overhead of creating and managing scheduler activations structures, which could be significant if the kernel has to multiplex processors between various multithreaded applications running at the same time.

Confusion
The explanation on the internal data structures for scheduler activation is somewhat not clear. How are the data structures of the scheduler activation manipulated by the kernel and the user level thread manager during an upcall?

Posted by: Saket Saurabh | February 25, 2016 04:16 AM

1. Summary The authors combine the efficiency and control of user-level thread libraries with the kernel's concrete knowledge of physical processor state by providing an up-and-down-call interface centered around "scheduler activations." Through this interface, user processes and the kernel can communicate information about their respective states.

2. Problem User-level threads provide a lightweight means of implementing parallel programs. As a program's thread scheduling occurs entirely in userspace, standard thread control actions, such as waiting and synchronization can be implemented without incurring the costs inherent in context switching into the kernel. Moreover, However, in a conventional system, the kernel is still in charge of multiplexing physical processors, and user-level threads must be built on top of whatever abstraction the kernel provides, and is thus subject to the kernel's scheduling policies. For example, if a user-level thread library possesses a set of kernel threads as "virtual CPUs", even though the user-level policy may intelligently schedule the process's threads across the virtual CPUs, I/O interrupts and kernel traps cause the kernel-level thread to become unavailable to the user-level library until the kernel has completed handling the interrupt. Moreover, the kernel scheduler is ignorant of program state, and may schedule threads that are available, but do not perform useful activity, due to spin locking or being in an idle state. While using kernel threads alone obviates some of these pathological interactions, every thread manipulation and communication call requires communication with the kernel, incurring significant overheads.

3. Contributions The authors replace threads as the kernel's vessel for user process state with a new abstraction, the "scheduler activation," and present a bidirectional communication interface through which user processes and kernels can communicate their respective scheduling states. A scheduler activation is essentially a data structure representing a virtual CPU upon which a user-level threading library can execute a single thread as it sees fit. The userspace threading model can also request virtual CPUs via a new system call. Additionally, user-level thread libraries can declare a virtual CPU as being idle, allowing the kernel to execute a different scheduler activation. Through a set of four upcalls, the kernel can communicate changes in physical CPU state to the user-level library. Whenever the kernel changes the state of a process's virtual CPUs, e.g. when a new one is allocated, when a thread is blocked/preempted, or returns from blocking, the kernel presents a fresh scheduler activation to the user process. The user-level thread library can thus deschedule threads and schedule free threads on the activation in accordance with its policies, based on the notice from the kernel. The user-level threading is also aware of critical sections, so that the scheduler can always choose to schedule processes so that the program always makes progress.

4. Evaluation The authors show that for basic wait and fork operations, user threads are an order of magnitude faster than kernel threads, and that both are at least an order of magnitude faster than the standard process abstraction. The authors repeat this experiment with scheduler activations, and show that the operations are only a few microseconds slower than standard user threads. To evaluate a more complex workload, the authors execute a parallelizable solution of the n-body problem on each threading model, and show scheduler activations scale well with the number of CPUs, and I/O volume. I am disappointed in the evaluation of scheduler activations with respect to multiprogramming, however - while they show that their implementation performs well with two parallel versions of the problem, their simplistic results give no sense of how well it scales with respect to higher degrees of multiprogramming; I would like to see how well their FastThreads implementation handles a higher degree of processor contention. I would also like to see more explicit discussion of FastThreads' performance with respect to locking - the problems they discuss are interesting, but we don't get quantitative evaluation of how their solution affects performance.

5. Confusion
I still don't understand their discussion of how user threads can potentially deadlock if the kernel schedules the underlying threads badly.

Posted by: Michael Vaughn | February 25, 2016 03:19 AM

1. Summary
The paper proposes a new technique which strives to achieve the best performances of both user level and kernel thread implementation. The kernel creates scheduler activation in which user level thread scheduler can run its applications and also serves a communication medium to take best decisions on processor allocation and thread scheduling.

2. Problem
User level threads can be executed magnitude faster than kernel threads however they face significant overheads in case I/O operations and multiprogramming scenarios. For best performing cases kernel threads showcase mediocre performance because of the overheads due to thread management. Also, kernel threads do not have any information about application to make an informed decision about which thread to schedule next. Systems which let user level influence kernel decisions spend significant chunk of time in communication which negates performance and flexibility. This work tries to solve the problem with minimal overhead.

3. Contribution
The proposed technique abstracts processor as virtual multiprocessor systems and presents it to the application as kernel threads over which user level scheduler can schedule application execution. The kernel decides on processor allocation policy to each application and lets the user level scheduler to take informed decisions on threads running those processors. This technique is realised through scheduler activations, which lets the kernel vector control the address space of thread scheduler. It serves a medium to communicate user thread decision which affects processor allocation and vice versa. It also serves as the context for user thread execution and to save processor context if the execution is stopped due to events like I/O. The method ensures that kernel is agnostic to user scheduler policies.
The paper talks about the implementation of the technique on a multiprocessor system called firefly. It discusses overheads involved in executing critical sections. It borrows concept from Trellis/Owl to avoid the cases of dead locks and execute faster than techniques like using set/clear flags. Also they use discarded scheduler activations to avoid the cost of creating a new one.
The authors present the evaluation of the existing techniques like user thread, kernel threads execution and process creation. The study shows that each of them varies by an order of magnitude(~10x between each technique) and analyzed the cause of overheads in each of these techniques.

4. Evaluation
The authors have implemented their proposed abstraction by modifying the Topaz, native operating system on Firefly system. The authors have evaluated their idea on various fronts like overhead characterization, memory and compute bound performance evaluation. They have used simple applications like null-fork and signal-wait to characterize the overhead in creation of threads and in communication through kernel. Scheduler activation performs nearly on par with user threads in this case. To evaluate upcall performance, signal-wait is done through kernel locks. This performs 5x worse than Topaz kernel threads and authors have attributed this to implementation inefficiencies. It would have been interesting to see the real overhead associated with upcall, as it is the cornerstone for Scheduler Activation abstraction. They also evaluate the system by running an application (solution to N-body problem) and show how it scales with various processor count. The proposed technique achieves speedup on par with user threads. Also they evaluate the same application under memory constrained environment and show that execution time does not worsen as bad as in case of Originial fastthread(user threads) implementation which shows that SA handles I/O as good as kernel threads. Also it performs well in multi program environment.
Even though the evaluation covers most of the cases, it has not evaluated real time scenario like running multiple processes which are thread intensive. Running a single application would give a good picture various latencies and point out to inefficiencies/improvements in the system but does not present a true picture for implementing it on real time systems. Also, since the scheduler activations take up space in memory( serve as context for execution, save the processor context) it would have interesting to see the memory footprint of this technique compared to fastthreads.

5. Confusion
How does user level context switches happen? How is thread switching possible when user has no control over changing the PC to point to another thread?
Are there any threading libraries which has incorporated this mechanism?
How does the trellis/Owl technique solve nested locks in critical sections?

Posted by: Bhardwaj Krishnamurthy | February 25, 2016 02:44 AM

1.Summary:
This paper is about the design and implementation of 'Scheduler Activations' as a mechanism for efficient user level thread management. The authors have implemented and evaluated Scheduler Activation by modifying the Topaz kernel thread management and FastThreads, a user level thread package.

2.Problem:
The existing thread management systems at kernel level are too heavy-weight and those at user-level, being more efficient, suffer from poor system integration and that there is no effective communication between the kernel and these thread management systems regarding events such as I/O, page faults that impact resource allocation. The authors state that this is because kernel level threads are not the correct abstraction for allowing user level parallelism and come up with the idea of scheduler activations.

3.Contributions:
Scheduler activations are upcall mechanisms that the kernel can fire when events that can impact the thread's scheduling policies occur. The kernel allocates 'virtual processors' across address spaces and from then on scheduling of individual threads in the address spaces is handled by the user level thread schedulers. Such an interface allows a full-fledged scheduling mechanisms for the threads to be run on user level. The authors also talk about other performance enhancements that they designed, such as having multiple copies of critical section code which allows the thread scheduler to temporarily run a preempted critical section, complete its execution and resume its scheduling work, so as to prevent other threads from waiting on a lock held by the preempted thread and reusing the scheduler activation data structures.

4.Evaluations:
The authors evaluate the cost of user-level thread operations, upcall communications and applications performance on scheduler activations. The performance of a single null thread with scheduler activations is shown to be on the same order as in existing user level thread systems. Also the measurement shows that upcalls are considerably slow as a mechanism and the authors say that this is due to their implementation constraints than any inherent issue with the upcall design. With respect to general application performance, the scheduler activations version of thread management outperforms both kernel level threads and user level thread scheduling systems. The latter while performing relatively closely in non-IO intensive cases, degrades much more quickly with increase in IO blocking, as in the version of the system that they were using, once a user thread blocks, the corresponding kernel thread was also blocked, rendering the physical processor unusable for the address space till unblocking.

5.Confusion:
What is the overhead of checking for critical section code execution?

Posted by: Sharanya Devaraj | February 25, 2016 02:40 AM

1.Summary
This paper discusses the limitations of kernel and user-mode thread management schemes and their various trade-offs. The authors propose a new scheme in which the kernel provides a generalized support for user-mode management of parallelism. They describe a kernel interface and a user-level thread package that combines the performance and flexibility of user-level threads and the functionality of kernel threads.

2.Problem
Kernel threads are too heavyweight and slow since they need to make calls to kernels to implement various threading schemes. This increases complexity into the kernel. User-mode threads have excellent performance and require no kernel intervention. They can be customized to fit the needs of a language or user without modifying the kernel. However, they perform poorly on block I/O, page faults and multiprogramming in general. User-mode threads run on top of kernel mode threads which are scheduled with no knowledge about the needs of the user-mode threads. This leads to under utilization of processing power potential deadlock for user threads.

3.Contributions
* The first major contribution of this paper is that it points out some of the existing (at that time) inefficiencies of the kernel thread abstractions.
* The design of giving each application a virtual multiprocessor machine and the interface for taking processors away from this virtual machine. Processor allocation is done by kernel and scheduling done by user-level packages, separating policy and mechanism.
* Scheduler activation mechanism allows user scheduler to act on thread events and thus separates the user thread management from kernel.

4.Evaluation
The overall evaluation approach seems reasonable to argue about the effectiveness and promises of the design but there are some rooms for improvements.
On the thread performance evaluation the authors aim to demonstrate that it preserves the order of magnitude advantage that user-level threads offer over kernel threads. However it would have been more convincing had the authors justified the overheads with actual data.
The evaluation of the upcall performance shows that their implementation is considerably slower. The authors defend by claiming that upcalls are infrequent and hence the added overhead is not significant. But the authors do not provide any data to justify it.
Finally, the authors could have added more diverse set of applications instead of comparing two instances of the same application.

5.Confusion
How exactly "scheduler activations" are implemented?
I am curious to know what would be the added complexity due to this on the kernel and the user-space code? Would it make application development (in terms of human time) simpler?

Posted by: Udip | February 25, 2016 02:38 AM

1. Summary

The paper describes the design, implementation and performance of a new kernel interface and user-level thread package that together provide the same functionality as kernel threads without compromising the performance and flexibility advantages of user-level management of parallelism. They demonstrate the performance of their prototype through their implementation on the DEC SRC Firefly multiprocessor workstation.

2. Problem

The user-level threads have high performance and flexibility but they lack functionality. The kernel threads have poor performance and flexibility, however they have high functionality. Thus, the authors describe a kernel interface and user-level thread management system that together combine the functionality of kernel threads with the performance and flexibility of user-level threads.

3. Contribution

After a detail analysis of the performance and overheads in current thread management systems, a new system is proposed. Here, each application is provided with a virtual multiprocessor and the application knows how many and which processors allocated to it by kernel. The application has complete control over which threads are running on processors. The kernel notifies thread scheduler of events affecting address space. The thread scheduler notifies kernel regarding process allocation. The kernel mechanism used is called scheduler activations. One scheduler activation per processor is assigned to address space. The concurrency model is built on top of scheduler activations. The system tries to avoid idle processor in present of runnable threads. The processor allocation for a thread is based on available parallelism in each address space. The kernel is notified when user-level has more runnable threads than processors or is there are more processors than threads. The kernel uses notifications as hints for actual processor allocation. A copy of critical section is used to handle deadlocks.

4. Evaluation

The authors implemented the system on DEC SRC Firefly multiprocessor workstation by modifying Topaz operating system and user-level thread package FastThreads. They show that relatively few modifications were required for implementation. The cost of user-level thread operations remained unchanged. The upcall performance demonstrating the communication cost was slower, however this was analysed and attributed to the implementation and tuning would improve the performance. For measuring application performance, compute and I/O bound loads were used. The outcome in each case was broken down and analysed.

5. Confusion

1) What are the real time implementation of this system?

Posted by: Nivetha Singara Vadivelu | February 25, 2016 02:16 AM

1. Summary
The paper discusses about efficient decoupling of the User Level Threads (ULT) and Kernel Level Threads (KLT) by making changes in the kernel. They also talk about isolating scheduling in the user space which does not need kernel intervention yet at the same time the performance benefits and flexibility of ULT and privilege of KLT can go hand in hand.

2. Problem
The problem existed since other previous solutions (Psyche and Symunix) only tried to use shared memory to inform kernel about the requirements of user processes. Other solutions had to still deal with system integration problems such as priority management, even though support in user level threads existed.

3. Contribution
The authors tried to isolate and improve the common case by minimizing the interaction of user space with the kernel space. They introduced the concept of Scheduler Activations (SA) which was similar in design to the kernel threads in Topaz. SA were used to inform the user threads about kernel events by generating an upcall, similar to a trap. Other interactions between user and kernel space were thus avoidable.
The authors gave threads within an address space of a view of running on set of physical processors. This set of number of processors was managed by kernel and could be reallocated if more processors were required (by sending an upcall) or deallocated if user space informed that fewer processors were needed. The user space scheduler had the control to then schedule the threads on the available processors.
They also discuss the technique of handling critical section optimization by allowing the threads to run and complete the critical section and then yielding the processor. This was achieved by making a copy of the code and using compiler techniques to identify and mark critical sections. In case a thread was blocked performing I/O in critical section, they suggested to allow spin-waiting for some before relinquishing the CPU. This technique did not hurt the performance as it would eliminate lock contentions via scheduling decisions.
Also, they talk about optimizations where they cache the state of thread in user space when it was blocked so that no copying of data structure (thread control block) was needed from kernel to the user space.
By using efficient techniques such as multilevel feedback queue and reducing the priority of the address space requiring more resources (processors) they were able to address the issue of gaming the system. Hints provided by the user space applications were also passed to the kernel so that if there are idle processors available, kernel can allocate it to another process.

4. Evaluation
The authors present the overhead of their implementation, which they report to take 5 microseconds extra for Null-call, but this was against their basic design choice to improve the common case. This contributes to almost 15% extra overhead. Also, the authors did not profile the implementation (measure time for upcall, or the number of the extra context switches made to inform blocking or resuming threads).

5. Confusion
I did not understand the terms ‘bracket calls’ and ‘vectored events’. Also, the case of informing blocked event by de-scheduling another thread was not very intuitive. I would like to know if there could be a better way to handle it?

Posted by: Vikas Goel | February 25, 2016 02:13 AM

Summary
The paper presents a design and implementation of a threading system implemented at the user level with a new kernel interface to support. This new system was designed to provide a combination of the functionality of kernel threads with the performance and flexibility of user-level threads. The paper also lists out the challenges and motivation for such a design. The design is backed with a detailed evaluation on it's performance and the costs associated with it.

Problem
Traditional user level implementation of threads provided the needed flexibility and ease to manage parallelism but at a cost of restriction of no I/O and multiprogramming. While the kernel level thread implementation had no such restrictions but performed very poorly when compared to the user level implementation. The authors aim to provide a new design combining the good parts of both implementations.

Contributions
1. The newer design modularizes the threading responsibilities among the user level and kernel level.
2. Policies for threading that do not have any kernel involvment like the scheduling among threads and application-specific changes are kept at a user level library. Such user level implementation provides easier ways to modify the policies as per the application needs and a programmer is agnostic about all these.
3. The kernel provides an abstraction of the actual physical processors with a set of virtual processors and is responsible for allocation/deallocation of these virtual processors to address spaces. It also handles kernel events like I/O or a page fault.
4. So, the user level thread library notifies kernel on it's processor needs for an application and the kernel would notify user level library on an increase/decrease in the number of processor allocations through upcalls. The kernel also notifies the library on kernel events and rests the decision policies of scheduling on the library.
5. The communication between a kernel interface and the user level library is done through scheduler activations data structures. This includes a kernel mapped execution stack for kernel calls and a user mapped execution stack for the running of user level threads.
6. Further, a newer policy is introduced for good performance for threads executing in critical section where the thread system takes decisions on scheduling the threads by monitoring the execution of critical sectinos.

Evaluation
The performance of a single Null thread with scheduler activations were found to be in the order of magnitude as the traditional user level thread implementations. The authors claim that the upcall impelementation was slower due to overheads and mainly attributed the slowness to the fact that they had built the scheduler activations on top of the existing kernel instead from scratch. But, they do not seemed to have backed this with any numbers and hence does not seem to me as valid evaluations but some mere speculations. An evaluation of how much extra overhead was present due to the storing of extra states and probably how this could have been avoided in their actual design would have made the argument stronger. Additionally, the newer design was tested against the traditional user level and kernel level implementations with a parallel application in both environments : one involving kernel involvement i.e with some I/O and multiprogramming while the other without any kernel involvement. In both the cases, the newer design outperforms both the traditional ones. Overall, the evaluation system good enough to present the design's effectiveness but could have strengthened with some extensive testing with more applications.

Confusion
Is the overhead involved with user level libraries looking for critical sections among threads significant ?

Posted by: Akshay Kanfade | February 25, 2016 02:09 AM

Summary:
The paper describes the design and implementation of a kernel interface and a user-level thread package that combines the performance of user-level threads and the functionality of kernel threads.

Problem:
The authors claim that user threads perform very well in comparison to kernel threads. The caveat here is that this happens only in a uniprogrammed environment in the absence of I/O. The kernel threads, however, offer the environment awareness but at the cost of being heavyweight and hence, decreased performance. The paper attempts to take the best of both the worlds by having an efficient interface between kernel and user-level thread package.

Contributions:
1) Each user-level thread system is given a set of virtual processors that the thread manager could use for thread operations within that address space. By providing this abstraction, the kernel isolates itself from making the performance critical thread scheduling decisions. All it does is manage the physical CPUs and respect the global address space priorities.

2) The kernel provides the virtual processors to the user-level thread system through scheduler activations. Scheduler activations are execution contexts, much like the kernel threads for running user-level threads. It is also used to communicate with the user-level thread system in times of a kernel event. Scheduler activations provide memory in that address space for the user threads to run. They also have pre-allocated kernel memory to save the current user-level thread state in times of a preemption or other kernel events.

3) The kernel events are communicated with the user-level thread systems through upcalls which is at a fixed entry point in the address space. The notification could be about a user-level thread that is blocked in the kernel, allocation/deallocation of a virtual processor or preemption of a particular processor which is running that scheduler activation. The last one is tricky in that the kernel needs a different processor/scheduler activation to notify that another scheduler activation in that address space has been preempted.

4) The user-level thread system could also notify the kernel about events that could possibly alter the allocation of virtual processors to that address space. One such event is change in the parallelism of the application.

5) The system also handles preemption of processors that have entered a critical section in an address space. In such situations, the user-level thread is run until it comes out of the critical section after which it is preempted. This is achieved through a combination of flags and instrumenting the application binary.

Evaluation:
The paper evaluates the design on three parameters. First, the thread performance in a uniprogrammed environment. This demonstrates that this idea attains the performance of user-threads given the circumstances. Second, the authors evaluate the upcall performance of this design and go on to say the added overhead is not significant as upcalls are infrequent. This claim, however, has not been justified. A heavily loaded machine will start showing increase in upcalls due to frequent preemptions and page faults. Third, it evaluates the overall application performance by running two instances of the same application. This again seems to be too little to convince the readers about the performance of the design. Overall, although the idea seems promising, it needs convincing through more rigorous benchmarking to justify the added complexity of a user-level thread system and application binary instrumentation.

Confusion:
More insights on debugging in this system.

Posted by: Prashanth Balasubramanian | February 25, 2016 02:08 AM

Summary
The paper presents a new mechanism for implementation of user level threads with performance of user level thread and functionality of kernel thread and added flexibility to choose thread scheduling policy.This is achieved by introducing a co-ordination mechanism where the user thread library and kernel can notify each other by keeping the interactions as low as possible.
Problem
User level threads even though have performance better than kernel threads ,the problem is in integrating user level threads with other system services as there is very little support from kernel in existing multiprocessor operating system and kernel schedules user level threads directly onto physical processors . Kernel level threads are wrong abstraction for user level threads. The performance of kernel threads is very poor when compared to user level thread as they are heavy weight. So paper describes a kernel interface and a user level thread package that combine the functionality of kernel threads with performance and flexibility of user level threads.
Contributions
1]. User level threads now get notified if the thread is rescheduled by the kernel.
2]User level thread can now notify the kernel when its processor allocation needs to be reconsidered.
3] There is no effect on lock latency when a thread is executing in the critical sections as now the User level thread scheduling library gets to know through up call when a thread executing in critical section is preempted and need not require check of shared variable in kernel or checking of bits which occurs in the Pyche and symunix mechanisms.The user level thread is notified before preemption so application can decided to place the thread in safe state and voluntarily relinquish a processor
4] User level thread scheduling library now gets notified when a thread is preempted on I/O and when the I/O completes by kernel , hence thread scheduling library can now utilize the idle processor for some other thread.
5] Better Asynchronous kernel I/O:a]if blocked in I/O ,preempt current scheduler activation and assign a new one to address space b]Less code changes required to both application and kernel to handle this situation as compared to other Asynchronous kernel I/O mechanisms
6] Application can now decide its own policy for scheduling its threads onto its processors & can implement this without trapping
to the kernel.
Evaluation
The evaluations is very clear and author is done a good job comparing the modified FastThreads(new FastThreads) with original FastThreads and Topez kernel threads.The author shows that Null-Fork,Single wait is almost same as original FastThread and is better than kernel threads.The Speed up of N-Body Application of the New FastThreads is almost same as original FastThread as shown in Figure 2 as the number of processor increases.The new FastThreads reduces the executive of application as shown in Figure 3 when the amount of memory decreases .The table V shows that when 100% memory available the speed up of NewFast thread is better.One of the main feature is the up call performance which seems to be worse than Topaz threads. The Author suggests that this was due to quick modifications to existing Topez kernel thread using Module-2. The author himself has quoted that the if changed into assembly then the performance would be better as it is proved by Schroeder and Burrows[19] paper
What i feel missing is an evaluation of Preemption control approach what is suggested does not cause much overhead.A graph showing there are multiple threads executing in critical section and what will be the performance since now there will be multiple copies of critical sections.
Another evaluation would be where there many threads ,half of them do I/O for very short period of time and come back again, since the processor is might now be reallocated to some other threads, Regaining may reduce performance i think.
Confusion
The paper mentions in 3.2 section on how to handle the dishonest applications. I am not sure how this is handled.

Posted by: Mushahid Alam | February 25, 2016 02:03 AM

1. Summary
This paper discusses the concept of Scheduler Activations, a new threading mechanism. The purpose of this was to describe a kernel interface and a user-level thread package that enable the combination of the functionality of kernel threads with the flexibility and performance of user threads, while remaining transparent to the application developer.

2. Problem
User-level threads are flexible in terms of programming languages and environments, and have excellent performance. However, when implemented on top of traditional processes, system integration problems cause them to exhibit poor performance. The authors argue that this is not an inherent issue with user-level threads, but due to inadequate kernel support, and that kernel threads are the wrong abstraction to support user-level parallelism management.
Kernel Threads do not have these system integration problems, but are too rigid and heavyweight (cost of generality) for use in many parallel programs, i.e., they are slower than user-level threads.
Kernel threads are the wrong abstraction because they are scheduled without taking user-level thread state into account, and because they change states without notification to the user level. For eg., pre-emption while a spinlock is held.

3. Contributions
Virtual multiprocessor : abstraction of a dedicated physical machine, with a dynamically changing number of processors.
N user-level threads are mapped to M virtual multiprocessors, which is divergent from 1:1 (kernel threads) and 1:N (user-level).
Combines salient features of user-level and kernel-level threads.
There is a user level Thread Scheduler system.
Kernel has no knowledge of the application's concurrency model, or scheduling policy.
Kernel events are explicitly vectored / upcalled to the user level Thread Scheduler.
Scheduler activation : upcall - kernel calls user-level process. Reverse direction of sys calls.
Processes notify the kernel when they have more runnable threads than processors, or vice versa. This is used in policy and in dynamic allocation / reclamation of processors.
Pre-emption in critical section - Prevention is hard and has drawbacks. Recovery - user level context switch is done until thread exits critical section.

4. Evaluation
Applications might be dishonest in reporting their parallelism to the OS.
Benchmarks:
FastThreads on scheduler activations have performance comparable to FastThreads on Topaz kernel threads. Upcall performance is slightly worse. Application performance is comparable when I/O is negligible, and better when I/O is involved.
My Opinion:
Thread operation latency profiling - benchmarks chosen were sensible, and comparison with procedure call was a good addendum.
Numerical tabled data, in addition to charts - useful.

5. Confusion
"address spaces" vs "processes" ?
Upcalls.

Posted by: Adithya Bhat | February 25, 2016 02:02 AM

1. Summary
In this paper the authors present the design of a new kernel interface called Scheduler Activation and user-level thread management package to support parallel programming. The main idea is to combine the performance and flexibility of user-level thread management systems with the integration semantics of kernel threads.
2. Problem
Using threads is the most common parallel programming approach and they can be supported in modern systems in two ways - one is using user-level threads where a user-space thread scheduler makes the scheduling decision to suit application performance and the second is using kernel threads where the kernel scheduler decides which kernel thread to schedule. Kernel threads offer low performance since they have high thread management overhead and the kernel scheduler cannot make the best decision for the application. The authors claim that user-level thread management is well suited for supporting parallel programming, however it faces challenges in getting integrated with operating system. For example, user-level threads implemented on top kernel threads have poor performance in IO heavy and multiprogrammed environments. The bad interface offered by operating systems is to blame for these integration challenges, that the authors set out to solve.
3. Contributions
The biggest contribution of this work is to provide a new abstraction of Scheduler Activation that allows the user-level thread scheduler and the operating system kernel to cooperate with each other. The scheduler activation provides a way to communicate to user-level thread scheduler the information regarding kernel events that can be crucial to the performance of an application. These events include IO, page fault, increase/decrease in number of physical processors allocated to the address space. This eliminates the weaknesses of traditional user-thread based systems. On the other hand, scheduler activation allows user-level scheduler to convey to the kernel any information that can help kernel manage the resources better. For example, if an address space needs more processors or can afford to relinquish a processor, it can ask the kernel for the same. This eliminates the drawbacks that kernel threads suffered from due to lack of knowledge of application state. A crucial piece of optimization that this work implements is the copying of critical section that helps in the case when a thread observes a page fault or gets pre-empted within critical section. It allows this system to match the performance of traditional user-threads by handling the infrequent events mentioned above gracefully, without degrading the common case performance. Lastly, the design can cache old scheduler activations and reuse them later to avoid the overheads to creating new activations.
4. Evaluation
The authors have done a good job of backing their initial hypothesis experimentally and then evaluating the proposed design thoroughly. They use Null Fork and Signal-wait workloads to estimate the overheads in user-level scheduler and user-threads. In fact, they also highlight the tremendous benefit offered by the critical-section-copying optimization that allows them to match traditional user-threads’ performance. However, while evaluating upcall performance, they explain the observed bad performance by blaming the implementation without substantiating their claim of a from-scratch implementation offering much better performance.
The authors also show that under no memory pressure, the design scales well with number of processors and is able to keep up with user-threads’ performance. Further they show that under memory pressure, which leads to IO/page faults, the design does much better than the conventional user-threads, which is what the work aimed for. They also show how the system performs much better than traditional designs in multiprogrammed environment, another major motivation behind the work.
Though the evaluation is almost satisfying, it would help if the evaluation was done using more workloads, including some from the mainstream applications. This would help understand what are the events (and their frequency) associated with common workloads and how well the proposed design handles them. Lastly, the paper stuck to the invariant of maintaining as many kernel threads as physical processors in the system. It would be interesting to see the results by breaking this invariant. For example, would the user-threads perform so bad on IO intensive workloads even if the number of kernel threads is much greater than the number of physical processors; since then a blocked thread can be replaced with another thread instead of wasting the idle physical processor.
5. Confusion
How does the user-level thread scheduler get invoked in absence of timer interrupt? What are the semantics of user-level thread/context switch?

Posted by: Lokesh Jindal | February 25, 2016 01:56 AM

Summary
This paper proposes a high performance and flexible user level thread management mechanism with modified kernel interface called Scheduler Activation. Scheduler Activation mechanism provides a virtualized processor and well described notification to the user level to make parallel programming at the user level easy, efficient, flexible, and high performance.
The problem
User-level threads are performance efficient and support application customizable scheduling but the processor gets blocked during system services. On the other hand the issue with kernel threads is that they are slow (because they require kernel calls to manage) Moreover making them support various threading schemes introduces complexity in the kernel. Running user-mode thread packages on top of kernel mode threads which were scheduled with no knowledge about the needs of the user-mode threads resulted in processing power underutilization, and user threads could remain blocked unnecessarily long or deadlock. Hence the goal of the paper was designing a system with the functionality of kernel threads and performance and flexibility of user level threads.
Contributions
1.The paper introduces a threading mechanism that implements a "N:M" strategy that maps some N number of application threads onto some M number of kernel threads, or "virtual processors." This is a hybrid between kernel-level ("1:1") and user-level ("N:1") threading. Each application address space is provided an abstraction of a virtual multiprocessor
2.Another novel feature is that the kernel allocates processors explicitly to different address spaces
3.The kernel notifies the address space thread scheduler of every event affecting the address space.
For example when a user-mode thread issues a blocking I/O, the kernel executes an "up-call" to the user-mode scheduler who can then decide to move all the other user mode threads (which would have otherwise also blocked on the I/O) onto a different "virtual processor."In this way, the user scheduler has the opportunity to intercept kernel calls and keep user threads scheduled on unblocked kernel threads. Likewise, when an I/O completes, the user-mode scheduler is up-called again, which has the option of continuing the thread that issued the I/O or continuing another thread. The address space notifies the kernel of the subset of user level events that can affect processor allocation decisions. In addition, the user-mode scheduler can also request more virtual processors from the kernel or relinquish an idle processor to the kernel, allowing for information-passing the opposite direction.
Evaluation
The performance of the scheduler activation system was compared against a kernel thread system (Topaz)and an existing user-mode thread system (FAST). The latency of thread management calls on the scheduler activation system was only slightly longer than on the existing user-mode thread package,
and much shorter than on the kernel thread system. Further, in a variety of experiments in which certain variables, such as 1) processor allocation, 2) I/O activity, and 3) contention with other processes,
were manipulated, the scheduler activation system performed significantly better than the two other systems. What I really appreciate about the evaluation is that they give a detailed explanation of each and every observation like a reasoning why upcall performance is 5X worse than Topaz threads(they blame it on the implementation). I also like the way the evaluation is carried out in a three-fold way by separate analysis of thread , upcall and application performance which helps in understanding the overhead breakup. However I feel that the application performance analysis should have been carried out on a larger number of workloads with varying characteristics . Also consideration of a mixture of several I/O bound and memory bound workloads running together would have given a better idea of the performance of the proposed system under multiprogramming rather than just running two copies of the N-body problem application. Another interesting evaluation would have been the memory footprint of maintaining a copy of every critical section.
Confusion
I understand that there is a single user-level thread scheduler scheduling threads over multiple processors hence multiple scheduler activations. It is said that the scheduler runs on the activation 's user level stack. My confusion is which activation's stack does it use?

Posted by: Amrita Roy Chowdhury | February 25, 2016 01:54 AM

1. Summary
This paper talks about an efficient kernel interface and a novel user level thread scheduling implementation. This allows a user-level thread scheduler to achieve the same level of effectiveness as a kernel scheduler but with much lower overheads on common operations like context-switching and thread creation.

2. Problem
User level threads in the best case give an order of magnitude better performance than kernel threads, mainly due to lesser kernel crossings and efficient context switches. They also give the application programmer the full flexibility for thread scheduling. However, user-level thread performance degrades in the presence of kernel events such as preemption (multiprogramming), page faults, disk I/O. The user-level scheduler is not aware of blocked threads; also if a thread holding a lock is preempted it may cause unnecessary spinning. This lack of exchange of information makes it difficult for the system and the user mode scheduler to make informed decisions. The paper is aimed at bridging this gap

3. Contributions
One key contribution of this work is to identify user level threads have both better performance (~10x) and flexibility than kernel threads. They suffer in the case of system events such as I/O and page faults due to lack of integration /. better interfacing with the kernel.
The authors describe a new abstraction called “scheduler activations” that defines an efficient method for the kernel to communicate system events to the user thread system.
1- Scheduler activations define an execution context for an address space but differ from a usual kernel thread in the way a thread resumes. Rather than restoring context a new scheduler activation always makes an upcall to the user level thread scheduler.
2- Using the upcall a kernel notifies the user thread scheduler of events such a thread being blocked on I/O or addition of a new logical execution context. It informs the application of its share of allocated execution contexts and lets the user space make the scheduling decision.
3 - The thread scheduler can make kernel calls informing it about a request for more processors or for pre-empting a concurrently running low priority thread. It may also relinquish an idle scheduler activation to be used by another program.
4 - Interaction with the kernel is only required when a scheduler activation is scheduled out and not on every context switch like kernel threads. The subsequent upcall allows the kernel to avoid making a scheduling decision on behalf of the application which is the case for kernel threads.
5 - Pre-empting a thread running a critical section can lead to a degradation in performance and possibly a deadlock. To deal with this the user thread scheduler identifies if a preempted thread held a lock and if so allows the thread to resume execution till it releases the lock and yields to the scheduler. This is done efficiently using (an overly complex!) code replication scheme.

4. Evaluation
To start with the authors evaluate the common case of thread management using two micro-benchmarks, and prove that their modified user thread library closely tracks the performance of the best case user thread library. They also measure the overhead for the upcall interface, this turns out to be much higher than kernel threads. Their justification that this is due to implementation issues is not very clear. An interesting way the author evaluate their final system is by using a benchmark that initially fits in the memory and they then increase the problem size so more I/O is involved. As the proportion of I/O increases the original user thread library suffers due to blocked kernel threads. However, the upcalls allow the new user thread system to scale well. They also observe similar results in the case of multiprogrammed workloads.

5. Confusion
Why do we have a deadlock possibility when the number of user threads is more than kernel threads?

Posted by: Brian Coutinho | February 25, 2016 01:45 AM

1. Summary
This paper provides a new threading mechanism that gives the performance and flexibility of a user library managed system and the non blocking ability of kernel level threads.

2. Problem
Threads are the basic platform for many parallel programming environments. Currently, there are two kinds of thread management system. They could either be supported by OS Kernel or by the user level library code . The user threads are lightweight and can have application aware thread scheduling policy but can get blocked because of I/O, Page Faults etc. Kernel Level threads could alleviate the last issue but involve more overhead than user threads and use an application agnostic scheduling policy. Moreover, kernel level thread are developed for generic application which result in inherent inefficiencies that user level threads can avoid. This makes parallel programming inefficient and prohibits them from reaching its maximum performance benefit.

3. Contribution
The authors quantify the overhead associated with kernel threads over user threads providing significant insight related to the scope in optimization. They identified that the only way to achieve the benefits of both the systems was to have some form of communication between the application/user level library and the kernel. This helps user level library manage threads depending on the application requirement and the kernel event statuses. The communication also helps kernel allocate resources more appropriately. They achieve this using Scheduler Activation. These would be the main contribution of this paper. They also discuss bottlenecks associated with preemption of threads in critical section and an associated optimization to handle it.

4. Evaluation
They have implemented the design by modifying Topaz OS for DEC SRC Firefly multiprocessor system and FastThreads. For NullFork and Signal Wait functions they perform almost as well as FastThreads. For application not performing I/O, their performance is again very similar to FastThreads. As an application starts performing I/O operations, We see a significant degradation in FastThreads implementation. After a certain point, it performs poorly with respect to Topaz Threads and their implementation. In all these experiments, they performed significantly better than Topaz Threads. Thus they manage to create a thread level library that performs as good as user managed library without the bottlenecks associated with them. They also measure the upcall performance however their justification for the numbers is not very convincing. Overall, some of the performance benefit comes from the optimization of marking critical section. This involves delimiting critical code section by the application developer. I feel that this is burdensome to an application developer. The authors have also maintained the invariant in the paper related to number of kernel threads equal to the number of processors in the system. I feel the justification for such a behaviour is not concrete especially for applications which see significant I/O. Some performance evaluation around this would have helped.

5. Questions
The authors talk about a deadlock scenario when user level threads are multiplexed across a fixed number of kernel threads. Some more explanation about this scenario would be helpful.

1. Summary
This paper provides a new threading mechanism that gives the performance and flexibility of a user library managed system and the non blocking ability of kernel level threads.

2. Problem
Threads are the basic platform for many parallel programming environments. Currently, there are two kinds of thread management system. They could either be supported by OS Kernel or by the user level library code . The user threads are lightweight and can have application aware thread scheduling policy but can get blocked because of I/O, Page Faults etc. Kernel Level threads could alleviate the last issue but involve more overhead than user threads and use an application agnostic scheduling policy. Moreover, kernel level thread are developed for generic application which result in inherent inefficiencies that user level threads can avoid. This makes parallel programming inefficient and prohibits them from reaching its maximum performance benefit.

3. Contribution
The authors quantify the overhead associated with kernel threads over user threads providing significant insight related to the scope in optimization. They identified that the only way to achieve the benefits of both the systems was to have some form of communication between the application/user level library and the kernel. This helps user level library manage threads depending on the application requirement and the kernel event statuses. The communication also helps kernel allocate resources more appropriately. They achieve this using Scheduler Activation. These would be the main contribution of this paper. They also discuss bottlenecks associated with preemption of threads in critical section and an associated optimization to handle it.

4. Evaluation
They have implemented the design by modifying Topaz OS for DEC SRC Firefly multiprocessor system and FastThreads. For NullFork and Signal Wait functions they perform almost as well as FastThreads. For application not performing I/O, their performance is again very similar to FastThreads. As an application starts performing I/O operations, We see a significant degradation in FastThreads implementation. After a certain point, it performs poorly with respect to Topaz Threads and their implementation. In all these experiments, they performed significantly better than Topaz Threads. Thus they manage to create a thread level library that performs as good as user managed library without the bottlenecks associated with them. They also measure the upcall performance however their justification for the numbers is not very convincing. Overall, some of the performance benefit comes from the optimization of marking critical section. This involves delimiting critical code section by the application developer. I feel that this is burdensome to an application developer. The authors have also maintained the invariant in the paper related to number of kernel threads equal to the number of processors in the system. I feel the justification for such a behaviour is not concrete especially for applications which see significant I/O. Some performance evaluation around this would have helped.

5. Questions
The authors talk about a deadlock scenario when user level threads are multiplexed across a fixed number of kernel threads. Some more explanation about this scenario would be helpful.

Posted by: Urmish Thakker | February 25, 2016 01:41 AM

1. Summary
Threads are very important for concurrency in parallel programming. This paper talks about the problems associated with having kernel-level, user-level threads and puts forward a design of "Scheduler activation" which combines the functionality of kernel threads and the performance and flexibility of user-level threads. They also talk about their implementation and how it compares with traditional kernel and user threads

2. Problem
Both user-level and kernel level threads had flaws when dealing with concurrency in multiprocessor system. Kernel threads have poor performance and poor flexibility whereas user threads lack functionality and needs kernel intervention. They were also some integration problems when trying to build user-level threads on traditional kernel interface (like kernel not considering user-level information/views in decision making)

3. Contributions
The main contribution is creation of a new interface between kernel and user which gets information from/ sends information across both kernel and application's address space (in the decision making process) with kernel activating processes and user-level thread system scheduling them. They propose "scheduler activations" which is used for running and notifying user-level threads as well as saving their processor context. It basically uses kernel events to inform user-level threads and hints (this idea reduced the amount of communication) to inform kernel. They also discuss the problem of dealing with critical section, and their idea of avoiding deadlock using copying is pretty interesting although an argument could be made that because deadlocks are rare, choosing deadlock recovery instead of prevention may have been fine. Another performance enhancement feature mentioned in the paper is the idea of reusing scheduler activation (the idea of reusing has been discussed in previous lectures too).

4. Evaluation
They implemented their design on the firefly processor and were able to justify their idea of using scheduler activation by showing that the performance of fast thread with scheduler activation was comparable with normal fast threads. They also showed that their upcall performance was worse by a factor of 5 but they believe this is implementation-specific and they can remove it (Did they actually achieve this?). Finally they evaluated and compared their design on application and showed that they provided better speedup and scaled better. One evaluation which I would have liked to see was the amount of overhead the whole copying critical section added to the system. Otherwise they have been able to justify most of their claims

5. Confusion
My main questions regarding the idea of copying critical section and using deadlock avoidance instead of recovery are were deadlocks common?, could they have assumed it is rare and they can recover if necessary?, was the overhead insignificant? . I did not completely understand the mechanism behind preemption/relocation

Posted by: Anubhavnidhi "Archie" Abhashkumar | February 25, 2016 01:15 AM

1. Summary
Threads are very important for concurrency in parallel programming. This makes puts forward the problems associated with having kernel-level, user-level threads and puts forward a design of "Scheduler activation" which combines the functionality of kernel threads and the performance and flexibility of user-level threads. They also talk about their implementation and how it compares with traditional kernel and user threads

2. Problem
Both user-level and kernel level threads had flaws when dealing with concurrency in multiprocessor system. Kernel thread have poor performance and poor flexibility whereas user threads lack functionality and needs kernel intervention. They were also some integration problems when trying to build user-level threads on traditional kernel interface (like kernel not taking user-level information/views in decision making)

3. Contributions
The main contribution is creation of a new interface between kernel and user which gets information from/ sends information across both kernel and application's address space (in the decision making process) with kernel activating process and user-level thread system scheduling them. They propose "scheduler activations" which is used for running and notifying user-level threads as well as saving their processor context. It basically uses kernel events to inform user-level threads and hints (this idea reduced the amount of communication) to inform kernel. They also discuss the problem of dealing with critical section and their idea of avoiding deadlock using copying is pretty interesting although an argument could be made that because deadlocks are rare, choosing deadlock recovery instead of prevention may be fine. Another performance enhancement feature mentioned in the paper is the idea of reusing scheduler activation (the idea of reusing has been discussed in previous lectures too).

4. Evaluation
They implemented their design on the firefly processor and were able to justify their idea of using scheduler activation by showing that the performance of fast thread with scheduler activation was comparable with normal fast threads. They also showed that their upcall performance was worse by a factor of 5 but they believe this is implementation-specific and they can remove it (Did they actually achieve this?). Finally they evaluated and compared their design on application and showed that they provided better speedup and scaled better. One evaluation which I would have liked to see was the amount of overhead the whole copying critical section added to the system. Otherwise they have been able to justify most of their claims

5. Confusion
My main questions regarding the idea of copying critical section and using deadlock avoidance instead of recovery are were deadlocks common?, could they have assumed it is rare and they can recover if necessary?, was the overhead insignificant? . I did not completely understand the mechanism behind preemption/relocation

Posted by: Anubhavnidhi "Archie" Abhashkumar | February 25, 2016 01:10 AM

summary~
In this paper, the authors presented a new approach for managing parallelism that gives better parallel computing performance. They first argued about the advantages and disadvantages of two current approaches: kernel level threads and user level threads, then presented the their design of Scheduler Activation that provides the advantages of both approaches.

problem~
Current approaches for managing parallelism are not satisfying. User-level threads are more flexible and perform better than kernel level threads because they are lightweight and do not need intervention by the kernel. But the lack of kernel support for user-level threads limits the its performance when the operations like I/O and exception handling are frequent. So new abstractions from the kernel are needed to accommodate this issue.

contributions~
They did the details analysis of the performance and functionality of both user-level threads and kernel level threads, and reasoning that the performance of user-level threads is inherently better than kernel-level threads, and then identify the problem that current user-level threads implementation suffered from the bad abstractions provided by the kernel.
So they introduced the abstraction of virtual multiprocessor and the mechanism of scheduler activations. The responsibilities are divided between kernel and each application address spaces to let the kernel to make the decisions of processor allocation, and let the application address spaces to take care of thread scheduling. Kernel and address space will also notify each other if there are events that might affect each other.

evaluation~
The authors justify the effectiveness of their design by comparing the performance of their implementation in various scenarios.
The evaluation result of non I/O bound workloads shows that the scheduler activation imposes small overhead this common case and the performances was similar to the unmodified FastThreads library, the evaluation result of I/O bound workload shows significant performance gain over the unmodified FastThreads library.

confusion~
The part where kernel and address space notify each other about the events that might affect each other seems like involving lots of communication, but from the evaluation it seems like not much overhead was introduced by this process?

Posted by: Yudong Sun | February 25, 2016 12:57 AM

1. Summary: This paper is about scheduler activations, which provide an effective way to manage user-level threads by modifying the kernel, thus providing the best of the two worlds: User-Level threads(ULTs) and Kernel-level threads (KLTs).
2. Problem: User-level threads provide flexibility and performance benefits, since they avoid unnecessary kernel interventions. In the presence of activities like I/O or page faults (which require kernel intervention) though, ULTs perform poorly. For ex: If one user-level thread blocks, all others in the process get blocked too. KLTs on the other hand, do not suffer in the above scenario. Only the corresponding kernel thread gets blocked. Others can still be used. However, creating a kernel thread each time is expensive. Plus, scheduling of kernel threads is oblivious to user-level thread state, resulting in poor performance. For ex: A thread is in its critical section, and its kernel thread can be preempted, keeping all other contending threads blocked too. Moreover KLTs often implement generic policies, making them inherently slower. To solve the problem, the authors introduce the concept of scheduler abstractions, which essentially keeps the control of physical processors in the kernel, but provides the control of scheduling threads within a process to the user level library. They also aim to keep the number of user/kernel crossings minimum to avoid communication overhead.
3. Contributions: One of the major contributions of this paper was the concept of scheduler activation which featured in systems like NetBSD, FreeBSD and Linux kernel too! Thus, the contribution was big. Their idea provides each application, the freedom to decide its own thread scheduling policy, an idea predominant in systems like Exokernels. Each application is provided with a virtual multiprocessor, and the actual number of physical processors assigned to each address space is decided by the kernel based on priority. The scheduler activation provides a way to exchange knowledge between user and kernel level by means of upcalls, and also provides a context to user-level threads to execute in. Another key idea is that both the kernel and user provide each other with just the knowledge of what it is doing, and not asking or waiting for anything, keeping both of their functionality separate. An innovative implementation detail was the use of a critical section copy to handle deadlocks while preemption, or blocking. This provides a zero-overhead way of marking when a lock is held, helping to maintain the performance of their implementation. One other neat detail is they kept the interface of the user-level library the same, and manage virtual multiprocessors transparently, helping in backward compatibility.
4.Evaluation: They show that their implementation is as good as ULTs for a CPU-bound workload, and better than KLTs for IO-bound workload. A comparison involving a mixed workload would have provided a complete comparison detail. Plus, results on more benchmarks would have helped. They also note the slow performance of their upcall routine, and owe it to the unoptimized implementation on Topaz. I would have liked to see a memory footprint of their implementation compared to ULTs and KLTs, especially with a copy of critical section per thread. Also, they do not provide an analysis of how to obtain the hysteresis time period. Without an appropriate parameter, the system may lose out on performance if the applications change their processor requirements too frequently, resulting in a kernel notification each time. They also don’t provide a measurement of the responsiveness of the system. It seems that the priority of all threads within a process are presented by a single priority number per address space to the kernel. This is quite unlike OSs like Linux, where priority is per-thread basis, helping in responsiveness.
5. Confusion: Why do current OSs use a KLT implementation? It seems that once a scheduler activation is provided to a high-level priority process, the high-priority process will run to completion, unless it blocks for IO or is pre-empted by another higher priority process. How do they handle starvation?

Posted by: Mohit | February 25, 2016 12:50 AM

1. Summary
This paper introduces new kernel interface and user level thread package to maintain functionality of kernel threads and flexibility of user-level threads. Authors approach provide virtual multiprocessor to each application using scheduler activations to modify user-level data structures, execute user-level threads and handles requests to and from kernel with kernel having control over allocation of physical processors.
2. Problem
User-level threads without any modification to kernel though theoretically have high performance, in reality exhibit poor performance or incorrectness primarily due to unevenness between virtual and physical processors caused by multiprogramming, IO and page faults. Kernel threads on the other hand being heavyweight have worse performance both in theory and practice. And hence parallel programmer in conundrum has to consider this tradeoff.
3. Contributions
The abstraction of virtual multiprocessor helps kernel maintain fairness between user-level threads, moving thread scheduling logic to user-level and retain transparency to programmer with a normal thread interface and to handle these, scheduler activation serves as an execution context to notify user-level of kernel event and also saving the processor context of its current user-level thread. This also makes it easier at user-level to explicitly manage cache locality and priorities. Kernel maintains invariant that active activations match assigned processors. Kernel unaware of data structures used to represent parallelism helps programmer build any concurrent model. User level can also request to kernel for more processors which if not available kernel stores them as hints provided resource arise later. To restrict unfair multiplexing of resources between threads, it encourages multi-level feedback and employ policies like equal distribution and least remaining service. Authors use recovery mechanism to deal with thread blocked in critical sections thus preserving address space semantics and consistency. Due to transparency of their design, authors were able to achieve object code compatibility without static partitioning of processors.
4. Evaluation
Authors used FastThreads package and Topaz kernel for implementation purpose. In case of executing thread, their design face degradation with Null Fork due to managing active thread count and polling for processor need and Signal-Wait due to resetting codes of resumed thread and partly due to implementation issues in kernel. By evaluating and comparing kernel and user-level thread speedups, authors show that kernel thread with lock contention and thread creation indeed cause performance issues. Once the available memory decreases, modified user-level and kernel thread take advantage of IO latency to schedule another thread. It also performs well in multiprogrammed environment though degrade is observed in donating processor for kernel daemon thread. Authors evaluation miss the overhead required by user-level thread system in serializing its notification to kernel for ordering. They also do not discuss in detail the implications of implementing user-level on top of kernel threads vs scheduler activations. It’d have been better if authors had experimented with multiple, multiprogrammed parallel applications.
5. Confusion
BSD kernels seems to have discarded this design in favor of 1 kernel-thread/user-thread. What was the reason for reverting back to slower scheduling mechanism?

Posted by: Unmesh Phalak | February 25, 2016 12:47 AM

Summary
Through the paper, the authors argue that kernel thread's performance is inherently worse than that of user-level threads and kernel threads are not the right abstraction to support user-level management of parallelism. Based on those facts, the authors propose a new kernel level interface - scheduler-activations and a new user-level thread package that provide the same functionality as kernel threads without compromising the performance and flexibility advantages of user-level threads.

Problem
User-level threads provide excellent performance and flexibility as thread management operations require no kernel intervention and is managed by run-time library linked into the application. However, multi-programming, I/O, page faults can lead to poor performance or even incorrect behavior. On the other side, kernel threads are too heavyweight for use in many parallel programs though one doesn't have to worry for system integration problems. Implementations having user-level threads built on top of the kernel threads instead of traditional processes are also available. Though, the same problems exists. Kernel threads are scheduled without the knowledge of the user-level thread state and can block, resume or preempt without informing the user level threads about it leading to poor performance.

Contribution
The authors start by specifying the goals for their design of the new kernel interface and the user-level package. Firstly, they want the performance of the user-level threads to be at par with the best existing user-level thread management systems in case of no kernel intervention. Secondly, in case of a kernel intervention needed, no processor should be idle in case a ready thread is present and if a thread in kernel mode blocks, another thread can be scheduled on the processor where the first thread running got blocked. In that aspect, scheduler-activation is proposed that acts as a bridge to convey any kernel event affecting the user-level threads to the user-level thread scheduler. The information passed helps the thread scheduler to modify the user-level thread data structure, execute user-level threads and to make requests to kernel. This communication is so flexible that it can be used for any scheduling policy or concurrency model. The user-level thread package has control of what threads to schedule, notify the kernel if the application needs more or fewer processors. The authors adapt a recovery based approach to handle critical sections instead of prevention based approach. If the user level thread is running in critical section when its preempted or blocked, the thread can complete the critical section execution and then yield the processor.

Evaluation
The authors start the evaluation by showing that in case of kernel intervention not needed, on running null-fork and signal-wait operation, the performance of their design is almost similar(slight degradation) to the performance of a pure user-level thread package. Next, they show the performance penalty of using upcalls when kernel intervention is needed. Surprisingly, the authors didn't provide any number but concede that there is a considerable performance degradation even though they expected otherwise. They attribute this to the fact that their scheduler-activation implementation is not developed from scratch. This sounds like an excuse to me and doesn't look like a proper answer. Lastly, they evaluate an application's performance and show that with more number of processors, the original user-level thread package and the new package both is far better than Topaz threads. The authors explain the divergence of new package's performance and the original package with four-five processors due to daemon processes which causes preemptions when there are not idle processors available. Performance of the application while changing the main memory showed that the new package performs better compared to the original threads and Topaz threads. This is attributed to the fact that I/O latency can be compensated by parallelism which original threads cant. Lastly, they show that the new package gives better speedup of 2.45(theoretically 3 possible) when application and multi-programming generates kernel events.
An evaluation on mixed workload(I/O intensive & CPU intensive) would have been better to show if there is an interference between the applications due to scheduler-activations. The evaluation also missed to show what happens when more than one scheduler-activations are happening in the system on different user-level threads. What overhead is added to the performance due to this? How much do the scheduler-activation cost to the system in terms of resource in such a scenario?

Confusion
Is this idea really implemented by any OS? How much resource intensive the scheduler-activation creation/maintenance is and what burden is put on the system in terms of resources due to it.

Posted by: Yuvraj | February 25, 2016 12:28 AM

1. Summary
Scheduler Activation is a new kernel abstraction of parallel service that can achieve both the performance of user-level threads and the integration that kernel-level threads provide.

2. Problem
Existing multithreading support in neither kernel-level nor user-level is satisfactory. Kernel-level threads require a trap every time scheduling happens, which makes it inherently slow. User-level threads can not react properly when kernel events like blocking on an I/O operation happens, as this information is hidden in the kernel.

3. Contributions
The kernel allocates physical processors directly for address spaces. The user-level scheduler has full knowledge of which thread is running on which processor. Common scheduling can be done without communicating with the kernel. Different applications can use different scheduling policies.
The kernel notifies the user-level scheduler of certain kernel events using upcalls. An upcall is done by creating a new scheduler activation and assigning it a new or preempted processor. Events are sent when a new processor is allocated, an old one is preempted, or a scheduler activation is blocked or resumed. The user-level scheduler can run on the new scheduler activation and schedule threads properly.
The user-level scheduler requests or returns processors with minimal communication to the kernel. Messages are only sent when there is a transition between having more threads than processors and having more processors than threads. The kernel is responsible to reallocate processors so that there can not be two address spaces in the opposite states above.
Threads preempted in critical sections are temporarily resumed until reaching a safe place. To avoid incurring overheads for all locking operations, the resumed thread runs a modified copy of the critical section, which relinquishes the processor at the end.

4. Evaluation
The authors first show the performance of common thread operations is comparable with that of the user-level FastThread. Then they measured the latency when kernel upcalls are involved, which is a factor of five worse than the original kernel-level Topaz threads. They blame this to poor implementation and cite that a factor of four improvement in RPC was achieved on the same environment simply by recoding Modula-2+ in assembly. Obviously it will be more convincing if they can test on optimized implementation.
They use a parallel algorithm that solves the N-body problem to evaluate the application performance. When the workload is compute bound, their implementation runs as fast as user-level FastThread. When the workload is I/O bound, they show a large advantage over FastThread. They also run two copies of the application at the same time to confirm that the new mechanism performs better in multiprogramming settings.
The evaluation they did is almost thorough, except for the performance of dynamically reallocating processors. The N-body algorithm seems create all threads it needs at the beginning. In Null Fork they admit there is a degradation because of this. But how severe is this problem in real applications?

5. Confusion
Kernel gets control back from running processes by clock interrupt. How can user-level scheduler get control back from threads without the help of kernel?

Posted by: Xiangjin Wu | February 25, 2016 12:14 AM

1. Summary
Scheduling in a multithreaded, multiprocessor environment is a difficult problem to solve to achieve both high performance and flexibility. In this paper, the authors propose a novel technique of allowing kernel intervention into user-level threading through scheduler activations to attain parallelism. The approach is that each application has complete knowledge of and control over the processors in its virtual multiprocessor, the kernel notifies the address space thread scheduler of kernel events using upcalls.
2. Problem
To achieve parallelism, multithreaded processes either execute with user-level threads: flexible and fast but perform poorly during page faults, I/O when kernel preempts user threads, and then priority scheduling might go wrong; or with kernel threads: avoid system integration problems but have expensive thread management functions- traps, copying, checks, while also being hard to generalize. The ultimate goal is not to keep processors idle, maintain priorities,and reschedule processors with blocked threads.
3. Contributions
This work urges the reader to understand the right abstractions, the right division of function between kernel and user-level libraries, have information flow to make better decisions. With this design they ask the kernel to allocate processors, the user-level thread system(ULTS) schedules the threads while being notified about the changes in #(processors) and lets kernel know about its requirements. Scheduler activation is an allocated processor delivered to ULTS via upcall- like an allocation of a CPU time slice => user code runs in SAs, not threads- a neat abstraction. What they achieve with this is the best of both worlds in achieving the said goals. They avoid deadlocks in an interesting way in which, with compiler support they mark critical sections, copy them and then return control after completion. Optimizations of reusing SAs and providing fair resource allocation through multi-level feedback is apt. For preemption, kernel lets user decide which threads should be preempted, and may require pinning memory to avoid page faults.
4. Evaluation
With SA, they aim to show that it is as fast as FastThreads(user-level threads) when CPU-bound and better than kernel threads when I/O bound. They implement a prototype of their design on the DEC SRC Firefly multiprocessor. This design almost achieves the performance of Fastthreads, given the rearrangements and notifications, for null-fork and signal-wait workloads. With upcall performance they observe that with I/O requests, their design performs worse(5x!) than Topaz threads(kernel threads) and blame it on the implementation limitations. But then they analyze, in detail, the performance of relevant parallel application and prove the efficiency of SA: better speedup on increasing #(processors), and better execution times. They have carefully examined each aspect, and were honest about their results: justified them and proposed possible solutions. I could only point out that they could have tested it on more number of cores(~100s), although they did say in the end that they’d implement in C Threads and Mach.
5. Comments/Confusion
How is marking critical sections and then copying and checking that not cause an overhead? The section on preemption handling using SA was hard to comprehend. Is it still not a good design for I/O intensive applications (future work suggested they’d build kernel from scratch)?

Posted by: Tithy Sahu | February 24, 2016 11:47 PM

1. Summary
Threads are very commonly used in parallel programming, and they can either be supported by the kernel or by user-level library code. Both have their own pros and cons, though. In this paper, the authors propose a new system that involves communication between the application address space and kernel that allows them to combine the advantages of both.

2. Problem
Traditionally, there are two types of threads: user-level and kernel level. User-level threads are fast and flexible with respect to programming models and environments, but they are built without kernel support, which can cause incorrect behavior. Kernel-level threads, on the other hand, are slow due the extra cost of accesses for thread operations, and they cannot be flexibly adapted to each operation. The main benefit is the added coordination and functionality when kernel operations are called.
In general, though, kernel threads are functionally worse than user-level threads, and neither are entirely satisfactory alternatives. The authors argue that kernel threads are the “wrong abstraction” entirely for user-level threads, as they block/resume/are scheduled without input from the user-level state. This can cause physical processers to be “lost” as kernel-level threads may be unnecessarily blocked, or more kernel-level threads than processors available.

3. Contributions
The authors propose a mixed system that involves kernel and address space coordination to enhance performance. This system relies on a structure called “scheduler activation” to carry messages from the kernel to the user-level, and it serves as an execution context for running user-level threads. When a program starts, the kernel creates a SA, assigns it to a processor, upcalls into the app address space which then intializes and starts the application thread. There are three main sets of interactions:
1) kernel → address spaces: allocates processors. Can change the number of processors assigned to an address space at any given point. Notifies the user-level thread system when # is changed via a scheduler activation.
2) address space → kernel: notifies kernel of the subset of operations that can affect processor allocation decisions. Can ask for more or less processors. It does not need to notify the kernel of every little decision.
3) address space → threads: decides which threads to run and when on the set of the processors that it is given.
The main goal is to maintain several invariants. First, the kernel has explicit and undeniable control over the allocation of processors. Second, there are always exactly as many running scheduler activations as processors in a specific address space. Third, no processor will idle so long as there is work to be done. Most importantly, all of this must be hidden from the user.

4. Evaluation
They implemented their system by modifying Topaz, the operating system for the Firefly multiprocessor. Instead of a pure scheduler activation system, though, they allow address spaces to use kernel threads. They test against their custom threads against both user-level and kernel threads. There are two main types of workloads: compute and I/O bound. For compute-bound workloads, the new threads perform a little better than user-level threads and much better than the kernel level threads %-wise. For an I/O workload, the custom threads have better execution time than user/kernel threads regardless of the amount of available memory.
In general, they discuss the discrepancy in results well, explaining how different default behavior of the kernel/user threads can lead to different results. The results display confuse me a little, though. Why use % speedup for the compute workload and not show absolute numbers? Also, they seem to present a dichotomy in terms of workload (ie either compute or I/O, but not both). Why not a mixed workload display?

5. Confusion
Would their results have changed if they used their original implementation instead of letting address spaces use kernel threads? Also, are these threads used today? What exactly is the critical section code processing they do? How labor-intensive is it?

Posted by: En-Ui Annie Lin | February 24, 2016 11:24 PM

1. Summary
The paper introduce the user threads scheduling methods so that a user thread can get the advantage of a kernel thread. The kernel create kernel threads named activation scheduler, which allocate a user thread on top of that and run a user level thread and then deallocate a thread when it is blocked or finished its job.

2. Problem
Both kernel level threads(KLT) and user level threads(ULT) have a advantage and a disadvantage compared to each other. The KLT are less efficient than ULT because kernel level threads are heavier than user level. The KLT avoids system intervention to schedule a KLT because kernel directly schedule KLT on physical processors while ULT needs process, virtual process scheduled by system and time-multiplexed by system. Therefore, the KLT and the ULT have a merit and a demerit each other.
The ULT running on top of KLT can cause several problem. First, the number of KLT should be greater than ULT to avoid idling process when there are ready threads. Second, it suffer from scheduling problems: One example is that ULT wait spin-lock while KLT preempt the process, another example is that a KLT may be preempted by another KLT which ULT is idling.

3. Contributions
The main contribution in the paper is providing the environment to execute user level thread on top of kernel threads in a efficient way. To do so, the paper adapts activation scheduler running a thread on top of that. The activation scheduler is created by kernel and by requesting from threads. If there are fewer activation scheduler than runnable threads, kernel create activation scheduler. The activation scheduler stay in idle state in case that a threads finishes its job and there is no remaining ready threads. The upcall is used to establish the connection between user threads and scheduler activation.

4. Evaluation
To express a motivation to this work, the paper evaluates Null Fork and Singal-Wait so check how much burden the previous system has in terms of the creation, scheduling, execution of a thread or a process. The result shows that processes or kernel thread are slower than user level threads. In the evaluation part, the performance is compared with the same operations. The user level thread on top of kernel threads a little slower than FastThreads and upcall execution is worse than Topaz thread due to experiment limitation. Is it acceptable to evaluate just one application as a result?

5. Confusion
I can not understand what is the difference between scheduler activation and kernel threads and how it works.

Posted by: Choungki Song | February 24, 2016 10:38 PM

1. Summary
Threads are important abstractions for parallel programming. User level and Kernel level thread implementations have their drawbacks and are unsuitable for high performance parallel computing. This article presents the design and development of a new mechanism for managing threads - Scheduler Activation, which gives the best of two worlds of kernel level and user level threads.

2. Problem
Thread support is implemented in a user level library or in the kernel itself. Each of the two implementations have drawbacks. User level threads are designed to execute within the context of a process or a kernel thread, treating it as a virtual processor. But the processes can be preempted and made to wait for I/O or page faults causing poor performance or incorrect behavior of user-level threads. The user-threads are difficult to implement with same level of integration as kernel threads, this leads to various issues like high priority threads being preempted for idle threads, high chances of deadlocks if the library multiplexes user threads to fixed number of kernel-threads. Kernel threads solve the aforementioned issues with the user-threads but kernel-threads are an order of magnitude slower than user-threads making them unsuitable for use in high performance parallel applications.

3. Contributions
The authors set out with the goal of designing a threading mechanism which combines the functionality of kernel threads with the performance and flexibility of user threads. The main contribution of this work is the design and development of - Scheduler Activations which meets the initial goal set by the authors. The problem in achieving the objective of the study was that the necessary control and scheduling information was distributed across the kernel and application, to address the issue Scheduler Activations explicitly vectors kernel events which can affect the threads to the concerned address space. The user level thread scheduler can then take actions based on the information sent by the kernel, these vectors contain information if the thread has been blocked or unblocked in the kernel or it has been pre-empted. And since user threads are not permanently attached to any scheduler activation, the user level thread scheduler can make a scheduling decision to schedule any thread its policy demands. This article also presents a novel way to handle threads in critical regions, the scheduler activation can notify user thread scheduler if it preempts any thread in the critical region and the scheduler can schedule this thread with higher priority and yield as soon as the critical region has finished execution.

4. Evaluation
The article does evaluate all the aspects which can affect the thread performance. It presents the performance evaluation of non I/O bound threads using null-fork and signal-wait benchmarks and explains the slight degradation of scheduler activation mechanism performance from the pure user thread based implementation. The paper also presents the thorough evaluation for I/O bound processes in a multiprocessor environment and measures the upcall performance which the authors claim is worse than expected and attribute it to the fact that the implementation was based on Topaz thread instead of being developed from scratch. But I think the authors could have explored this delay more and provided a better explanation than just guess work.

5. Confusion
The mechanism sounds really cool and implementable, is it already supported in some real time systems?

Posted by: Mihir Shete | February 24, 2016 10:10 PM

1. Summary
This paper describes a new mechanism, scheduler activations, to tackle multithreading to provide benefits of both user-level threads (ULTs) and kernel-level threads (KLTs).
2. Problem
At the time, systems provided the concept of threads by either supporting it in user level or kernel level. However, each approach has its benefits and drawbacks. Some of the benefits of ULTs are performance and flexibility, because there is no need for a kernel trap and context switch when switching between threads, and the user-level library can implement a custom scheduling algorithm to fit the needs of the specific program. However, a drawback of ULTs is that if a thread ends up blocking (say on an I/O operation), the entire virtual processor that it was assigned to with be blocked and wasted.
On the other hand, KLTs avoid this drawback of ULTs because the kernel schedules each thread on a physical processor, and it simply takes the blocked thread out of the runnable queue and allows other threads/processes to execute. However, KLTs are heavyweight and slow for reasons mentioned above.
3. Contributions
For that reasons, the authors decided to explore a hybrid design to gain benefits of both worlds. They introduce a kernel interface and a user-level thread library that work together to take advantage of KLTs’ functionality while gaining the performance and flexibility of ULTs.
In order to make this work, the kernel needs to have some information about the user-level scheduling, such as how many threads exist, and the user-level library needs to know about kernel events, such as I/O. The authors utilize system calls and upcalls to make this communication easy, and to make it even faster, the kernel might batch messages to send to the library in one upcall.
In this design, the library gets virtual processors, and it has complete control over on how to schedule threads on that virtual processor. These virtual processors are really scheduler activations that allow the library to run whatever thread it chooses. Scheduler activations are also a way to allow kernel to do an upcall into the library. Through the dynamic creation and deletion of scheduler activations, the kernel can make sure a program has exactly the same number of processors assigned to that program.
Another contribution is making a copy of critical sections that are wrapped by a yield call, so that once a thread in critical section exists, it can give back control.
4. Evaluation
The authors tested everything; they evaluated the performance of process events such as fork, the upcall performance, performance as number of processes increases, performance as amount of memory decreases, which is great because lots of applications perform I/O, and they also evaluate the performance as multiple multithreaded programs run simultaneously. With the upcall performance, the admit that it does not perform well, and they say that if it had been written better and in assembler, then it would have performed much better, but I am not sure if this is 100% true. Is Modula-2+ not compiled down to assembly language?
5. Confusion
Is this concept used today? Has it been evaluated on NUMA systems?

Posted by: Arman Shanjani | February 24, 2016 07:22 PM

1. Summary
This paper describes a new mechanism, scheduler activations, to tackle multithreading to provide benefits of both user-level threads (ULTs) and kernel-level threads (KLTs).
2. Problem
At the time, systems provided the concept of threads by either supporting it in user level or kernel level. However, each approach has its benefits and drawbacks. Some of the benefits of ULTs are performance and flexibility, because there is no need for a kernel trap and context switch when switching between threads, and the user-level library can implement a custom scheduling algorithm to fit the needs of the specific program. However, a drawback of ULTs is that if a thread ends up blocking (say on an I/O operation), the entire virtual processor that it was assigned to with be blocked and wasted.
On the other hand, KLTs avoid this drawback of ULTs because the kernel schedules each thread on a physical processor, and it simply takes the blocked thread out of the runnable queue and allows other threads/processes to execute. However, KLTs are heavyweight and slow for reasons mentioned above.
3. Contributions
For that reasons, the authors decided to explore a hybrid design to gain benefits of both worlds. They introduce a kernel interface and a user-level thread library that work together to take advantage of KLTs’ functionality while gaining the performance and flexibility of ULTs.
In order to make this work, the kernel needs to have some information about the user-level scheduling, such as how many threads exist, and the user-level library needs to know about kernel events, such as I/O. The authors utilize system calls and upcalls to make this communication easy, and to make it even faster, the kernel might batch messages to send to the library in one upcall.
In this design, the library gets virtual processors, and it has complete control over on how to schedule threads on that virtual processor. These virtual processors are really scheduler activations that allow the library to run whatever thread it chooses. Scheduler activations are also a way to allow kernel to do an upcall into the library. Through the dynamic creation and deletion of scheduler activations, the kernel can make sure a program has exactly the same number of processors assigned to that program.
Another contribution is making a copy of critical sections that are wrapped by a yield call, so that once a thread in critical section exists, it can give back control.
4. Evaluation
The authors tested everything; they evaluated the performance of process events such as fork, the upcall performance, performance as number of processes increases, performance as amount of memory decreases, which is great because lots of applications perform I/O, and they also evaluate the performance as multiple multithreaded programs run simultaneously. With the upcall performance, the admit that it does not perform well, and they say that if it had been written better and in assembler, then it would have performed much better, but I am not sure if this is 100% true. Is Modula-2+ not compiled down to assembly language?
5. Confusion
Is this concept used today? Has it been evaluated on NUMA systems?

Posted by: Arman Shanjani | February 24, 2016 07:22 PM

CS 736 Reviews - Spring 2016

Scheduler Activations: Effective Kernel Support for the User-Level management of Parallelism.

Comments

Post a comment