

An Extensible Architecture for Building Certified Concurrent OS Kernels

Ronghui Gu, Zhong Shao et al, Yale University, FLINT Research Group

" Complete Formal Verification is the only way to guarantee that a system is free of programming errors "

- seL4 (SOSP '09)

#### Shared Memory Consistency ???

- "Proofs about concurrent programs are much harder than proof about sequential programs" (seL4, SOSP '09)
  - I/O concurrency
  - $\circ$  Multithreading
  - Multiprocessors
- Can't use a big lock, because performance.
  - $\circ$  Must use fine grained locks.
- "Verification to a kernel version with fine grained locking will far exceed the cost already paid for verifying the single core version" (Peters, ApSys '15)

#### What to prove to verify an OS kernel?

#### • Functional Correctness

- Specification S, Kernel code K, User Program P
- $\circ \quad \ \ \text{For all user program P, } K \bowtie P \text{ refines } S \bowtie P$

#### • Liveness

- All the system calls will eventually return
- Pretty hard, at the scale of the concurrent kernels

## ASIDE

# new technical contributions

certified concurrent layers

logical log + hardware scheduler + environment context

push/pull model

multicore machine lifting

## certified objects

specification of modules to trust









primitives































## verify a sequential kernel [POPL'15]

## kernel

code

#### seq machine



### seq machine

## concurrency

## extensibility is the key to support

## contributions

seq machine

## support concurrency

## contributions

| trap   |             |  |
|--------|-------------|--|
| virt   |             |  |
| proc   |             |  |
| thread |             |  |
| mem    |             |  |
|        | seq machine |  |

## contributions

- reuse

| trap   |
|--------|
| virt   |
| proc   |
| thread |
| mem    |
|        |

**CPU-local machine** 

## contributions

-reuse

| trap   |  |
|--------|--|
| virt   |  |
| proc   |  |
| thread |  |
| mem    |  |

spin-lock

**CPU-local machine** 







multico













## fine-grained lock
























# hardware scheduling step 1: hardware s private1 og atom2







































### **Certified Abstraction Layer**

- A language construct  $(L_1, M, L_2)$  and a mechanized proof object
  - layer implementation M, built on top of the interface L1 (the underlay) is a contextual refinement of the desirable interface L2 above (the overlay).
- A **deep specification**  $L_2$  of a module *M* captures everything *contextually observable* about running the module over its underlay L1.
  - Once we have certified M with a deep specification L2, there is no need to look at M again.
  - Any property about M can be proved by L2 alone.

#### Certified Layers : (L1, M, L2) and a mechanized proof object



Implementation Independence

#### **Simulation Relation**



#### **Composition of Deep Specifications**



# Layer Design



Figure 5: The contextual refinement chain from multicore hardware model  $\Pi_{x86mc}$  to CPU-local model  $\Pi_{loc}$ 

### **Multicore Hardware Model**

- Allows all CPUs to access the same piece of memory simultaneously
- Logically distinguish
  - Private Memory: single CPU/threads (no need to synchronize)
  - Shared Memory: multiple CPU/threads (synchronize using atomic hardware instructions)
- Each shared memory operation can be viewed as if it were atomic.

### Atomic Object

- Abstraction of well-synchronized shared memory.
  - $\circ$  Set of primitives
  - Initial State
  - Logical log
- Current state can be Initial State + Replay log

# Step 0: Multicore Hardware Model II, x86mc

- Allows arbitrary interleavings at the level of assembly instructions.
  - Hardware will non-deterministically choose 1 CPU.
  - Execute next assembly instruction on that CPU.
- Each instruction is classified as:
  - Atomic, Shared and Private
- Only atomic operations generate events.
  - Logical log = [0.atom1, 1.atom2]



## Step 1: Model + Hardware Scheduler $\Pi_{hs}$

- Add hardware scheduler  $\varepsilon_{hs}$  that specifies a particular interleaving.
  - Deterministic machine model
- Insert logical switch points.
  - Query the scheduler to get the CPU id that will execute the next instruction.
  - All the switch decisions are stored in the log.



### **Contextual Refinement**

- We say that layer  $L_0$  contextually refines layer  $L_1$  if and only if
  - $\circ$  ~ For any P that does not go wrong on  ${\rm I\!I}_{_{L1}}$ 
    - P does not go wrong with piLo on an configuration
- The behavior of running a program P over this model with a hardware scheduler  $\varepsilon_{hs}$  is denoted by  $[[P]]_{hs} = \{ \Pi_{hs}(P, \varepsilon_{hs}) | \varepsilon_{hs} \in EC_{hs} \}$ .
- Theorem:
  - $\circ \quad \forall \mathsf{P, [[P]]}_{\mathsf{x65mc}} \text{ refines [[P]]}_{\mathsf{hs}}$
#### Step 2: Machine with local copy of shared memory







# **END ASIDE**

# Certifying the mC2 Kernel

- Layer Architecture
- System Architecture
- Spinlocks
- Memory Management
- Thread Management

#### System Architecture



Figure 3: System architecture for the mC2 kernel

#### Layer Architecture



Figure 2: Contextual refinement between concurrent layers

## Certifying the mC2 Kernel

- Design abstraction layers in a way
  - complex interdependent kernel components are untangled
  - well-organized object stack with clean specification.
- Bottom layer that connects to the CPU-local machine model  $\Pi_{loc}$ 
  - instantiated with a particular active CPU.
- Trap handler is the top layer that provides system call interfaces
  - $\circ$  serves as a specification of the whole kernel
  - instantiated with a particular active thread running on that active CPU.
- Any global property proved at the top abstraction layer can be transferred down to the lowest hardware machine.

#### Spin lock module -- Ticket lock



```
typedef struct {
                     9 t= ►FAI(&L[i].ticket);
volatile uint ticket; 10 while(>L[i].now!=t){}
  volatile uint now;
                               ▶pull (i);
                           11
3
4 } ticket lock;
                           12 }
sticket lock L[NUM LOCK];
                           13 void rel lock (uint i) {
                               ▶push (i);
6
                           14
7 void acq lock (uint i) { 15
                               ▶L[i].now ++;
  uint t;
                           16 }
8
```

Figure 7: Pseudocode of the ticket lock implementation

## **Starvation Freedom for locks**

#### • Invariant

- Environment context that holds lock(i) will never acquire lock(i) again before releasing it.
- Always releases lock(i) within k steps
- Acquiring the ticket-lock eventually succeeds
  - $\circ \quad \text{Mechanized in Coq} \\$
  - Proof idea : upper bound on the number of events generated by context of threads before the current one.
- MCS locks
  - Better scalability than ticket locks
  - Similar Proof for starvation freedom

## Memory Management

- A thread acquiring a lower layer lock cannot acquire a lock defined at a higher layer.
  - Prevent deadlock
- Global invariants:
  - Paging is enabled only after all the page maps are initialized
  - Pages that store kernel specific data have kernel only permission in all page maps
  - The kernel page map is an identity map
  - Non shared parts of user processes' memory are isolated

```
int palloc (uint tid) { 10 if (fp != nps) {
  2
   return ERROR;
               12 AT[i].ref = 1;
3
  ▶acq lock (lock AT); 13 cn[tid].guota --;
4
  uint i=0,fp=nps;
                14 }
5
  while(fp==nps&&i<nps){    is else fp = ERROR;</pre>
6
   if (!AT[i].free) 16 ▶rel lock (lock AT);
7
   fp = i;
                     17 return fp;
8
   i++; }
                      18 }
9
```

Figure 8: Pseudocode of palloc



Figure 9: Scheduling routines yield, sleep, and wakeup

# Evaluation

#### Contents

- Proof effort and cost of change
- Performance evaluation
- Concurrency overhead
- IPC Performance
- Hypervisor Performance

## Proof Effort and Cost of Change

- Code : 6500 lines of C + x86 assembly over 2 person years
- Specification (these are part of the trusted computing base):
  - 943 lines of code for the lowest layer axiomatizing the hardware machine model.
  - 450 lines of code specifying the abstract system call interfaces.
- 5249 lines of additional specifications
  - For auxiliary definitions, theorems, invariants etc.
- 50k lines of Coq proofs.
  - $\circ$  At least  $\frac{1}{3}$  of this is semi-automatically generated and redundant.

## **Performance Evaluation**

- Not the main concern, but still run benchmarks.
- Since power control code has not been verified
  - Turbo boost and power management features of the hardware have been disabled.

## **Concurrency Overhead**

- Runtime overhead is dominated by:
  - Latency of the spinlocks
  - Contention of shared data.
- All shared objects pre-allocated a fine-grained lock.
- MCS locks vs Ticket Locks
  - Efficiency remains the same for ticket locks but increases for the MCS locks.



**Figure 11:** The comparison between actual efficiency of ticket lock and MCS lock implementations in mC2

## IPC Overhead mC2 vs seL4

- seL4 fastpath/slowpath:
  - 1200/1800 cycles
- Mc2 IPC:
  - ~3800 cycles
- seL4 is optimized and tailored for hardware performance.

#### Hypervisor Performance



**Figure 13:** Normalized performance for macro benchmarks running over Linux on KVM vs. Linux on mC2; the baseline is Linux on bare metal; a smaller ratio is better

# Conclusions

- Extensible architecture
- Clean, Rigorous, Layered, Practical