DoublePlay: Parallelizing Sequential Logging and Replay
-------------------------------------------------------

Thread parallelism  vs  Epoch parallelism

Run two instances of the system. Use the thread parallelism to get fast hints on future states of the system. Use epoch parallelism as the 'actual' execution. Record the order in which threads in an epoch are timesliced on the processor.

A thread-parallel execution runs ahead of the epoch-parallel execution and generates checkpoints from which to start future epochs.


Related Work
------------
SMP-revirt: cost of memory protection faults and false sharing

RecPlay logs only explicit synchronization operations

DoublePlay guarantees deterministic replay for programs with and without data races by logging all non-determinism in the epoch-parallel run.

Deterministic  replay based on search. The search space includes all possible orders of shared-memory accesses that are allowed by the logged information, so it grows exponentially with the number of racing accesses.

One can view DoublePlay as using extra cores to search for an equivalent replay during original run.

While DoublePlay must execute intervals that contain races sequentially, this is unlikely to slow the program significantly because these intervals are likely to be a small fraction of the overall execution time.

Uniparallelism
--------------
Epoch-parallel execution thus requires the ability to predict future program states; such predictions are generated by running the second, thread-parallel execution concurrently.

Design
------
DoublePlay is implemented inside Linux and its boundary of record and replay is the process abstraction. DoublePlay records the results and order of system calls executed by the process and returns this data to the application during replay.

All threads in a given epoch run on the same core.

DoublePlay achieves good performance by running different epochs of the same process concurrently.

When the therad-parallel execution reaches the start of the second epoch, DoublePlay checkpoints the process state and uses that state to start running the second epoch in the epoch-parallel execution.

Steps:
1. Record the initial state of the process at start of recording
2. Record the order and results of system calls, signals, and low-level synchronization operations in GNU libc
3. Record the schedule of thread execution (i.e., when context switches between threads occur) of the epoch-parallel execution.

Note: Because each epoch is executed on a single core, DoublePlay does not need to record the ordering or results of any shared memory operations performed by multiple threads.

As long as this hint is correct, the state of the process at the beginning of each epoch in the epoch-parallel execution will match the state of the process at the end of the previous epoch.

DoublePlay compares the process state (memory values and registers) of the thread-parallel and the epoch-parallel runs at each epoch transition. In case of mismatch, it restores the state of the thread-parallel execution to the checkpoint at the beginning of the epoch and restarts both executions from this point.

Two options for rollback:
1. Rollback both executions back to the start of the epoch that diverged and restart execution
2. Roll both executions to the state of the epoch-paralel execution at the divergence and restart from that state. More complicated to implement because need to bring the state of thread-parallel execution to the state of epoch-parallel execution.

Output cannot be externalized until all prior epochs have been found to be free of divergence.

When the epoch-parallel execution executes a system call or synchronization operation, DoublePlay returns the logged result instead of executing the operation. It also delays the thread execution to match the order of operations in the log.

The only situation in which the thread-parallel and epoch-parallel executions might diverge is when the program contains a data race and two threads execute unsynchronized, conflicting operations on a shared memory address.

During replay: current implementation uses single core to execute recorded trace. However could also use uniparallelism to speed up replay.

Implementation
--------------
Built on Respec. Respec ensures that two concurrent executions of the same process are externally deterministic, which means: two executions execute the same system calls in the same order, and that the program state (values in the address space and registers) of the two processes are identical at the end of each epoch of execution.

Respec:
------
Respec duplicates the process's state via a "multi-threaded fork".

Use sequence numbers to ensure that the system calls are called in the same order by the two processes. To enforce the order in the second execution, threads may be made to wait.

Records the args and results of system calls and replays them in the second process (exceptions clone and mmap which change the process address space). Logs the value and timing of signals. Assumes that the underlying file system is copy-on-write so that it can delay logging large file reads until such files are actually modified.

Respec also logs low-level synchronization operations in libc (locks, wait/wakeup).

This ensures that the original and cloned executions execute identically in the absence of data races. To deal with races, divide execution into epochs. Compare/verify memory and register state at the end of each epoch. To reduce cost, only compare memory pages that were modified.

DoublePlay:
----------
- Give half the cores to thread-parallel exec, and half to epoch-parallel
- Use sched_setaffinity system call to set CPU affinity of process.
- Maintain an epoch queue which is filled by thread-parallel exec
- When the process completes executing an epoch, it informs the scheduler that the core is now free and the scheduler allocates the core to the next process in the epoch queue.
- Set the epoch length to 50ms after a rollback. Each epoch without a rollback increases the epoch length by 50ms up to a maximum of one second.
  - End an epoch immediately if a system call requires synchronous external output
	- End an epoch within 50ms after asynchronous external output

Even though epochs may finish out of order, DoublePlay ensures that they commit or rollback in sequential order. On successful epoch completion, checks if all prior epochs have been committed. If so, commit this epoch and discard the associated checkpoint. Otherwise, simply mark the epoch as completed. After all prior epochs commit, commit this epoch.

Divergence check fails if:
1. an epoch-parallel thread calls a different syscall from the ones called by thread-parallel exec
2. the two threads call different libc synchronization operations
3. two threads call the same syscall/syncop but with different args
4. registers or memory state differ at the end of epoch

On failure:
1. terminate all threads executing the current epoch and any future epochs in the epoch-parallel execution
2. terminate all threads of the thread-parallel execution (through kill signal)
3. Allow epochs started prior to the failing epoch to finish and complete their divergence checks
   - If these checks succeed, restart from failed epoch.
	 - If a check for one of the prior epochs failed, restart execution from the earliest epoch that failed.

Logging thread schedules:
Two strategies: deterministic scheduler (poor because can't allow executions that have preemptions) or log the scheduling decisions (much more flexible with little overhead)
Use user-level scheduler (locks, condition variables to enable only one thread at a time)
For the second strategy, use hardware counters to count branches.

Any logging "perturbs" the system and may hide certain behaviour. DoublePlay is better than schemes which require instrumentation of memory accesses.

DoublePlay only has to "record" the results of syncops and syscalls (it buffers and then discards the args). This reduces logsize for syscalls like write.

The checkpoint includes the sequence numbers at the point in execution where the checkpoint was taken, so subsequent entries will have the correct sequence number even on a rollback. Thus, there is no indication of the divergence or rollback in the logs on disk.

Forward Recovery: on divergence, instead of rolling back both thread-level and epoch-level executions, roll to the state of the epoch-level. Guarantees forward progress. What about kernel-state? (associated only with thread-parallel execution). The kernel state of epoch-parallel (if it existed) would have been identical to that of thread-parallel at point of divergence, so okay to merge.

DoublePlay logs an "undo" operation for each system call that modifies kernel state, so that recover can roll back kernel state by applying undo operations -- seems like the most hairy part of the implementation.

In the worst case, every epoch may contain frequent data races, and divergence checks might always fail. In this case, the system proceeds at the speed of a uniprocessor replay since a single core always makes progress.