Abstract: A scalable thread package for use with high concurrency servers. While recent work has advocated event based systems, we believe that thread based systems can provide a simpler programming model that achieves equivalent or superior performance.

By implementing Capriccio as a user level thread package, we have decoupled the thread package implementation from the underlying operating system.

What is achieved using the user level threads?

As a result we can take advantage of cooperative threading, new asynchronous I/O mechanism and compiler support.

Using this approach, we are able to provide (1) scalability to 100,000 threads, (2) efficient stack management and (3) resource aware scheduling.

Linked stack management: It minimizes the amount of wasted space by providing safe, small and non contiguous stacks that grow or shrink at run time.

Resource Aware Scheduling: It allows thread scheduling and admission control to adapt to the systems current resource usage. This technique uses a blocking graph that is automatically derived from the application to describe the flow of control between blocking points in a cooperative thread package.

Unfortunately event based programming has a number of drawbacks like stack/function ripping, worse debugging, manual stack management etc.

Thread Design Principle: User level versus kernel level threads.
While user level threads and kernel level threads are both useful, they solve fundamentally different problems.

How user level threads are different from kernel level threads?
Do user level threads require a user level scheduler? Can we co-relate fibers and user level threads?

Kernel threads are primarily useful for enabling true concurrency via multiple devices, disk requests, or CPUs.

User level threads are really logical threads that should provide a clean programming model with useful invariants and semantics.

How user level threads provide clean programming model?
Do they avoid manual stack management?
Do they avoid stack ripping issue of event based programming?
Do they avoid debugging issues of event based programming?
Do they provide automatic stack management?

It is argued that any clean semantics for threads requires decoupling the threads of the programming model (logical threads) from those of the underlying kernel. Decoupling the programming model from the kernel is important for two reasons i.e.
First there are substantial variations in interfaces and semantics among modern kernels, despite the existence of POSIX standard.
Second, kernel threads and asynchronous I/O interfaces are area of active research.

Capriccio: This thread package achieves our goals with the help of three key features:

Scalability: improved the scalability of basic thread operations by using user level threads with cooperative scheduling, by taking advantage of a new asynchronous I/O interface, and by engineering our run time system so that all thread operations are O(1).

How does user level thread improves scalability?
Is there any relationship between scalability and cooperative scheduling?

Linked stacks: A mechanism for dynamic stack growth that solves the problem of stack allocation for large number of threads

How do traditional threads systems allocate stack space?
Is there any relationship between scalability and stack space allocation?

Capriccio uses a combination of compile time analysis and run time checks to limit the amount of wasted stack space in an efficient and application specific manner.

Resource aware scheduler: It extracts information about the flow of control within a program in order to make scheduling decisions based on predicted resource usage.

User level threads: they provide performance and flexibility but unfortunately they also complicate the preemption and can interact badly with the kernel scheduler.

How the issue of preemption and interaction with kernel thread does arise?

Flexibility: kernel level thread scheduling must be general enough to provide a reasonable quality for all applications. The kernel threads cannot tailor the scheduling algorithm to fit a specific application. Fortunately, user level threads do not suffer from this limitation and the user level scheduler can be built along with the application.

What does make the user level thread extremely light weight?
Why the kernel level threads are comparatively heavier?

Performance: User level threads can greatly reduce the overhead of thread synchronization. In the simplest case of cooperative scheduling on a single CPU, synchronization is nearly free.

Why synchronization is free in cooperative scheduling?

Even in the case of preemptive threading, user level threads offer an advantage in that they do not require kernel crossing for every synchronization operation.

Do capriccio implements thread preemption?
How futexes do help user level threads?
http://www.kernel.org/doc/man-pages/online/pages/man7/futex.7.html

Finally, memory management is more efficient in the user level threads. Kernel level threads require data structures that eat up valuable kernel address space, decreasing the space for I/O buffers and other resources.

Disadvantages of user level threads:

When a user level thread executes a blocking I/O call, a user level threading package overrides these blocking I/O calls and replaces them internally with non blocking equivalents.

How a blocking I/O does get converted into a non blocking I/O?

The non-blocking I/O generally requires increased number of kernel crossings when compared to the blocking equivalents.

Why does non-blocking I/O require increased number of kernel crossings?

What if the required disk data is already in the kernel buffer cache?

Why capriccio can’t take the advantage of multiple processor? Is it due to cooperative scheduling?

Implementation: Internally capriccio uses the latest linux asynchronous I/O mechanism i.e. epoll for pollable file descriptors (e.g. pipe, sockets) and linux AIO for disk. If these mechanisms are not available, capriccio falls back on the standard unix poll for pollable descriptors and a pool of kernel threads for disk I/O.

Scheduling: Capriccio has a modular scheduling mechanism that allows the user to easily select between different schedules at run time.
Synchronization: Capriccio takes advantage of cooperative scheduling to improve synchronization.
How is it possible to synchronize using a Boolean flag locked/unlocked in Capriccio? Is it feasible due to cooperative scheduling?
What is light weight system call (vsyscall)?
Table 1: Latencies of thread primitives for different thread package:
Context Switch: Why does Capriccio performs better than others?
What about the cost of additional memory required for persisting the thread state?
Synchronization: Why Capriccio performs better than others? Is it due to futex?

Thread Scalability: Figure 1
Why Capriccio does perform better than others? Is it due to user level stack management? Is it due to futex?

I/O performance: Figure 2
Why epoll does perform better than threaded version of Capriccio?
When concurrency is low, why other mechanism does perform better than Capriccio?

I/O performance: Figure 4
Why NTPL does perform better than Capriccio when miss rate is very low? Is it due to asynchronous I/O interface used by Capriccio?

Linked stack management: It uses dynamic stack allocation to reduce the virtual memory dedicated to the stacks. In order to implement dynamically sized stacks, our call graph analysis identifies the call sites at which the checkpoint should get inserted.
What are checkpoint and checkpoint placement and how function pointers do affect this?
Memory Benefits:
How dynamic stack allocation does affect paging? Why user level time is almost same for dynamic and static stack allocation but not the kernel level time?

Resource aware scheduling: one of the advantages claimed for event systems is that their scheduling can easily adapt to the application’s needs. Event based applications are broken into distinct event handlers, and computation for a particular task proceeds as that task is passed from handler to handler. This architecture provides two pieces of information that are useful for scheduling.
The current handler for a task provides information about the location of a task in the processing chain and associated queue’s lengths.
How this information is used by resource aware scheduling?

How resource aware scheduling is different from the policy used by SEDA?

Yield Profiling: The Author mentions that the close() system call, which closes a socket can sometimes takes 5ms even though the socket was opened as non blocking I/O.
The author suggests to use multiple kernel threads as a proper solution to this problem.

How can multiple kernel threads avoid this issue?

Web Server Performance (Figure 9): Its stated that both Capriccio and Haboob (an event based web server) perform non blocking network I/O with standard poll() and use a thread pool for disk I/O.

Why Haboob is performing better than Capriccio? Is it due to resource aware scheduling?