Flash paper
-----------

How to organize a web server? SPED (potentially blocks), MP (high memory and switching overhead), MT (non-portable, paper claims sync overheads but results dont necessarily agree), AMPED

Single request at a time:

fd = listen();
while (1) {
  sock = accept(fd)
  request = read(sock);
  filename = resolve(request);
  fd = open(filename);
  stat(filename);
  send(response_header);
  if (!(data = find_in_cache())) {
    data = read(fd);
  }
  send(data);
}

What can block? All. Notice that find_in_cache() can also block due to swapping page fault!

SPED:
fd = listen();
while (1) {
  evnt = select(fdset, nfds);
  for (i = 0; i < nfds; i++) {
    if (FD_ISSET(fdset[i])) {
      handle_event(fdset[i]); (could modify/add/sub to fdset)
    }
  }
}

handle_event() could potentially call read/stat/send/resolve/open all of which should ideally return immediately and then you would call select on them. Need to break-up the program into units that could block (usually hard to be exhaustively correct, recall find_in_cache). BUT disk reads could potentially block on 1999 OSes! so loss in performance.

MP:
for (i = 0; i < 200; i++) {
  fork(uniprocessor version);
}

Blocking problems solved, but memory consumption. Can't exploit shared optimizations like shared cache.

MT:
for (i = 0; i < 200; i++) {
  thread_create(uniprocessor version);
}
Memory problem solved 99% but synchronization problems on user-level shared state. Also many OSes did not support kernel-level threads in 1999! would user-level threads help?

AMPED:
fix disk reads by converting them into IPCs to helper processes. The helper process will call read and communicate to the main process using shared memory. select() will play well with the IPC file descriptor. The helper process touches all the pages in its memory mapping and notifies the main server when its finished. The helper processes are lighter than MP processes and a helper is needed only one per concurrent disk operation. Extra cost due to IPC.

AMPED is really solving the portability problem. If you were not worried about portability, how would you design your web server?

AMPED/SPED allow maintenance of shared information/cache faster (no copies, no synchronization).
Cost per connection: AMPED/SPED (fd, kernel state, app-level info), MP (full process), MT (full thread)

Other Flash optimizations: Pathname translation caching, response header caching, mapped files (in 'chunks'). Mmapped files prevent data copying and double buffering. Alignment, Memory residence testing/locking by main process. CGI processes can still block but that does not bother the main process.

Experiments: MP have shorter caches due to replication costs. Special client software that simulates multiple HTTP clients (often hard to saturate server!). Solaris has MT support, FreeBSD does not

Results:
Figures 6/7: performance of different variants nearly identical. Why do you think Solaris is performing so poorly w.r.t. FreeBSD?
Figure 8: Why is SPED performance bettern on Owlnet trace?
Figure 9: As data set size is increased, performance decreases slowly before dropping sharply. The cache hit rate curve will have a similar shape.
Figure 10: MT performing equal or better than Flash most of the time
Figure 11: all three caches help
Figure 12: Initially adding more clients improves parallelism but saturates after a point. For MP, more requests mean more cache copies meaning more swapping so performance decreases even with small number of concurrent clients.

Question: does AMPED make sense on multi-core multi-processor systems? Why or why not? How will you architect? Pure ED -- make cache into a separate process and spawn #ed processes = #processors. On shared memory architectures, using ED loops with MT may be better with #threads=#processors. Also compare with pure MT.