Scheduler Activations: Effective kernel support for the user level management of parallelism. This paper argues/addresses that The performance of kernel threads is inherently worse than that of user level threads, rather than this being an artifact of existing implementations. Managing parallelism at user level is essential to high performance parallel computing. The problems encountered in integrating user level threads with other system services is a consequence of the lack of kernel support for user level threads provided by multiprocessor operating systems. That means that kernel threads are the wrong abstraction on which to support user level management of parallelism. One way to construct a parallel program is to share memory between a collection of unix like processes, each consisting of a single address space and a sequential execution stream within that address space. Why multiple processes are a bad idea to achieve parallelism? Is it due to kernel resources used by each process? Is it due to context switch overhead? The shortcomings of traditional process for general purpose parallel programming have led to the use of threads. Threads separate the notion of a sequential execution stream from the other aspects of processes such as address spaces and I/O descriptors. How threads are better than traditional processes to achieve parallelism? Is it because multiple threads use the same address space of a process? Threads can be supported either at user level or in the kernel. User level threads are managed by runtime library routines linked into each application so that thread management operations require no kernel intervention. How this approach of user level threads does provide better performance? Is it due to less communication between user space and kernel space? Is it due to the flexibility provided by this model, which can be exploited by the applications? Uniprogramming: The simplest model for using memory is to provide uniprogramming without memory protection, where each application runs with a hardwired range of physical memory addresses. This model allows only one process to run at a time, thus an application can use the same physical address every time, even across reboots. Why user level threads can exhibit incorrect behavior in the presence of multiprogramming? Is it because of the way the thread package allocates stack memory to each thread? Is it because of the management of memory done by the thread package? Why user level threads do perform poorly in the presence of I/O? Is it because of sequential execution, because there is only one execution context? Is it because of non interleaving of I/O with CPU? The parallel programmer then has been faced with a difficult dilemma: employ user level threads provided the application is uniprogrammed and does no I/O or employ kernel threads, which have worse performance but are not as restricted. This dilemma is addressed in this paper: We describe a kernel interface and a user level thread package that together combine the functionality of kernel threads with the performance and flexibility of user level threads. The difficulty in achieving these goals in a multiprogrammed multiprocessor is that the necessary control and scheduling information is distributed between the kernel and each application’s address space. To be able to allocate processors among applications, the kernel needs access to user level scheduling information (e.g. how much parallelism there is in each address space). To be able to manage the application’s parallelism, the user level support software needs to be aware of kernel events (e.g. processor reallocations and I/O request/completion) that are normally hidden from the application. Each application knows exactly how many (and which) processors have been allocated to it and has complete control over which of its threads are running on those processors. The OS kernel has complete control over the allocation of processors among address spaces including the ability to change the number of processors assigned to an application during its execution. To achieve this, the kernel notifies the address space thread scheduler of every kernel event affecting the address space, allowing the application to have complete knowledge of its scheduling state. The thread system in each address space notifies the kernel of the subset of user level thread operations that can affect processor allocation decisions, preserving good performance for the majority of operations that do not need to be reflected to the kernel. The kernel mechanism that we use to realize these ideas is called scheduler activations. A scheduler activation vectors control from the kernel to the address space thread scheduler on a kernel event; the thread scheduler can use the activation to modify user level thread data structures, to execute user level threads and to make request of the kernel. The case for user level threads management: There are significant inherent costs to managing threads in the kernel. The cost of accessing thread management operations: The program must cross an extra protection boundary on every thread operation. What are thread management operations? Is it thread_create()/fork(), yield() ? Why this cost is so important? Is it due to invocation of system calls used by the thread management operations? Is it due to the inherent kernel implementation of thread management operations? The table 1 shows the performance of implementations of user level threads, kernel threads and processes. Why this table chooses to show the null fork benchmark? Is it because it just uses thread management operations? Sources of poor integration in user level threads built on the traditional kernel interface: Kernel threads are the wrong abstraction for supporting user level thread management. It is because of the following reasons: Kernel thread blocks, resume and are preempted without notification to the user level. Kernel threads are scheduled without any knowledge of the user level thread state. If the user level thread system has created more kernel threads than the physical processors to avoid the above issue, how can it affect the performance now? Is it due to lack of knowledge of user level threads i.e. user level thread priorities, state of the user level threads? Can it cause priority inversion? The communication between the kernel processor allocator and the user level thread system is structured in terms of scheduler activations. A scheduler activation serves following roles: It notifies the user level thread system of a kernel event. It provides space in the kernel for saving the processor context of the activation’s current user level thread, when the thread is stopped by the kernel. A scheduler activation has two execution stacks, one mapped into the kernel and one mapped into the application address space. Each user level thread is allocated its own user level stack when it starts running. The user level thread scheduler runs on the activation’s user level stack. Can it run on the user level thread’s stack? What will happen when the user level thread completes and free the stack? Upcall Points: Add this processor(processor#) Processor has been preempted(preempted activation, its machine state) Schedule activation has blocked(blocked activation) Scheduler activation has unblocked(unblocked activation, its machine state) How does kernel notifies about the blocking of a kernel thread? Does kernel initiates a new scheduler activation? What processors can be used by the kernel to upcall a new activation to the application? Can kernel use processors assigned to other application to upcall a new activation to a different application? When the I/O gets completed, why kernel creates an upcall? Is it to provide flexibility that was inherent in the user level threads? Can this approach avoid priority inversion? Notifying the kernel of user level events affecting processor allocation: The user level thread system should notify the kernel when it can affect the processor allocation decision. An address space notifies the kernel whenever it makes a transition to a state where it has more runnble threads than processors, or more processors than runnable threads. Add more processors(additional processors needed) This processor is idle() Does this model show fairness? Can a misbehaving program consume an unfair proportion of resources? Critical Sections: If a user level thread is running in a critical section and kernel preempts it or this thread gets blocked, it results in a poor performance and can lead to deadlock. This is resolved by checking if the preempted thread was running in a critical section and if so, the thread is continued temporarily via a user level context switch and when the thread exits the critical section it gives control back to the original upcall. Implementation: The implementation is required at two sides, the kernel and the user level thread package. This paper mentions that the changes are made to the Topaz kernel and fastThreads a user level thread package. The kernel was modified to implement scheduler activations, to do explicit allocation of processors to address spaces. The user level thread package was modified to process upcalls, to resume interrupted critical sections and to provide kernel with the information needed for its processor allocation. Processor Allocation Policy: The policy ‘space-shares’ respects priorities and guarantees that no processor idles if there is work to do. This implementation makes it possible for an address space to use kernel threads rather than requiring that every address space use scheduler activations. It was done to preserve binary compatibility with existing applications. Thread Scheduling Policy: Here, kernel has no knowledge of an application’s concurrency model or scheduling policy. Performance Enhancements: To support critical section, it requires an introduction of a flag which needs to be set whenever thread is in critical section. But this scheme imposes overhead on lock acquisition and release whether or not a preemption or page fault occurs. This problem is resolved by using an exact copy of every low level critical section and at the end of the copy, but not the original version of critical section, code is inserted to yield the processor back to the original upcall at the end of critical section. The discarded scheduler activations were cached and reused when needed. It avoids unnecessary scheduler activations creation and destruction. Performance: Thread Performance (Table IV): Why does operation Null Fork take more time with scheduler activations? Is it due to the extra message passing or communication between user level and kernel for thread allocation? Why signal wait operation takes more time with scheduler activations? Is it due to support critical sections? Upcall Performance: If the cost of blocking or preempting a user level thread in the kernel using the scheduler activations is similar to the cost of blocking or preempting a kernel thread, then scheduler activations could be practical. How critical section support does affect upcall performance? The current implementation of upcall is slower due to its implementation and once tuned should match with the kernel thread. Application Performance: Figure 2: Why do scheduler activations perform better than original user level thread package? Is it due to extra preemption calls due to daemons in the original user level thread package? Figure 3: Why do scheduler activations perform better than kernel threads? Is it due to less kernel interaction compared to kernel threads? Table 5: Why the performance of scheduling activations is better than of original user level threads? Is it because scheduling activations provide more control over the CPU allocation?