Server Research

High Level Goal

Understand enough about the various I/O implementations to decide on a server implementation to start experimenting with.

December 11, 2016

Goal

Dig into poll. Determine if it uses fo_poll under the hood and identify how it differs from select.

Notes

poll's man page states:

poll() provides a mechanism for multiplexing I/O across a set of file descriptors. It is similar in function to select(2). Unlike select(2), however, it is possible to only pass in data corresponding to the file descriptors for which events are wanted. This makes poll() more efficient than select(2) in most cases.

At first this was confusing, but then I realized that the author's referring to select's fd_set implementation, which implicitly requires memory to encode all possible file descriptors on each fd_set. poll, on the other hand, takes a list of structs that encode specific fds as arguments, reducing its memory requirement. It's hard for me to tell whether this actually matters anymore or if it's an artifact of a more bitwise time. Now that I think about it, there's also a runtime benefit in poll's implementation. select has to iterate through all possible fd_sets, whereas poll can only look at the ones it knows it cares about.

December 3, 2016

Goal

Compare and contrast select and poll implementations. Figure out the source of the two following statements from the C10K problem about select and poll respectively.

Unfortunately, select() is limited to FD_SETSIZE handles. This limit is compiled in to the standard library and user programs. (Some versions of the C library let you raise this limit at user app compile time.)

There is no hardcoded limit to the number of file descriptors poll() can handle, but it does get slow about a few thousand, since most of the file descriptors are idle at any one time, and scanning through thousands of file descriptors takes time.

Code Map

Select

Interface sits in /usr/src/sys/sys/select.h.
- The header file defines key macros for manipulating the fd_set struct, passed as a parameter to select. Each fd_set struct represents a file descriptor as a bitmask, internally implemented using arrays of unsigned 32 bit integers. The hardcoded (FD_SETSIZE equal to 1024) limit on the number of socket descriptors handled by select arises because fd_set structs store 32 fd_mask unsigned integers (32 bits / mask * 32 masks = 1024 bits = 1024 different file descriptors can be represented). In summary, assuming each fd_set represents a file descriptors as a single bit in its bitmask set to "1", each fd_set is capable of representing 1024 different file descriptors.
  
  The macros defined in the header file all leverage the fact that dividing a passed-in fd by bits-per-file-descriptor (8) tells us the fd_mask on which this file descriptor would or does sit. For example, an fd equal to 8 sits on the second fd_mask, because the first includes fds 0 through 7 (in theory not used, because 0 through 2 are terminal descriptors?). The implementation then uses modulo arithmetic and bitwise operations to set, unset, or clear a bit on the selected bitmask. Modulo arithmetic works here because we only care about the remainder number of bits after we divide by bits-per-file-descriptor to get to the right mask. In our prior example, assuming we wanted to set fd 8 on this fd_set, we'd modify fd_mask 1 (zero-indexed). We'd bitwise OR this fd_mask with 1 << 8 % 8 = 0, setting the first 0 on the bitmask to 1.
Implementation lives in /usr/src/sys/kern/sys_generic.c (description is a work-in-progress).
- Most of select just extracts timevalues and converts them to a timespec (or returns an error). Once that's done, it delegates to dopselect.
- dopselect, assuming we ignore all the timespec and retry logic, polls file descriptors to determine if any are ready for data to be read or written at three different priority levels. dopselect uses pretty complicated bit-shifting and memory allocations to do so, but I consider those incidental to the meaning of the function. The main takeaway here is that dopselect uses something called fo_poll to do these poll operations. Is this the same or different from poll? If it's different (which I'm inclined to say it is), how does it differ? Does poll (the sys-call) use it under the hood as well?

Questions

What's the difference between select and pselect?
- pselect adds the option to specify a default descriptor mask and the timeout as a timespec.

The select header file specifies FD_SETSIZE as 1024 but I don't see what aspect of the implementation prevents us from using more descriptors. The below struct definition (with comments for constant values added) defines the fd_set structure that userspace programs pass into select.

  #define __NBBY  8                               /* number of bits in a byte */
  typedef uint32_t __fd_mask;
  #define __NFDBITS ((unsigned)(sizeof(__fd_mask) * __NBBY)) /* Bits per mask. Should equal 32 / 2^5 assuming the standard 4 byte uint32_t */
  #define __howmany(x, y) (((x) + ((y) - 1)) / (y)) /* This breaks down to (x / y) + ((y - 1) / y). Is there unsigned int magic going on? What's the point of the second expression? OHHHH, it's because we want to round up if the first expression produces a decimal.
  typedef struct fd_set {
        __fd_mask fds_bits[__howmany(FD_SETSIZE, __NFDBITS)]; /* Given my understanding of howmany, I would expect this to return 1024 / 32 + .99999 = 32.999999. What's the point of the .999999? Anyway, this makes sense, the fds_bits array stores 32 __fd_masks, each of which stores 32 bits! */
  } fd_set;

December 1, 2016

Goal

Summarize the C10K article enough to understand which approaches it describes would benefit from a "rip out the middle man" improvement strategy.

Heuristics

Given that we know we want to use OpenBSD and intend to leverage a "kill the middle layer" strategy, what approach will work best?

Actual C10K Notes

Strategies Described

Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
I don't fully understand this but sounds complected and strictly worse than asynchronous I/O. He highlights the difference between non-blocking and asynchronous I/O. Non-blocking I/O doesn't block when waiting for network calls to finish but does when waiting for disk calls to finish. Asynchronous I/O handles both asynchronously (surprise!).
Serve many clients with each thread, and use nonblocking I/O and readiness change notification
The author describes one benefit of this model - ease of use with OpenSSL. I may be missing something, but I'm tentatively avoiding servers that use this model.
Serve many clients with each server thread, and use asynchronous I/O
In the earlier sections, the author highlights how asynchronous I/O doesn't block on disk reads and writes. However, in this section he mentions, "AIO doesn't provide a way to open files without blocking for disk I/O". I'm confused! That being said, I get the sense that async I/O, assuming reasonable implementations, presents a good balance of few warts, good performance, and reasonable semantics. Let's hope I'm not proven wrong if I choose this.
Serve one client with each server thread
The author described this as barely feasible when he wrote this. While Moore's Law hardware progress probably increased its feasibility, I'm inclined to stay away from this given his warnings about having to reduce thread stack frame sizes. On the other hand,

Important Distinctions

Level vs. Edge Triggered Kernel Signals
Level triggers occur continuously for a file descriptor in a certain state. With level triggers, the socket-holding program will continue to receive a signal until the file descriptor (and socket) exit the ready state. This deals with an edge case where a program registers interest in a socket after the socket initially becomes ready. Since level-triggering notifies based on state, the calling program will receive a notification from the new socket, even though the ready event technically already occurred. Edge triggers occur at the instant a file descriptor becomes ready. Edge triggering punishes calling programs that miss events. Edge triggered events only appear to calling programs once, so calling programs can't recapture missed events and the I/O they signaled.

In summary, I don't understand why one would use edge-triggering over level-triggering...

Questions

Does OpenBSD have a reasonable async I/O implementation?
How does Go implement high performance servers?
- Can goroutines (which are essentially green threads, right?) make the one client per thread model feasible?

Filing Away for Later

Notes on High Performance Server Design: Linked in the original C10K problem article. Describes 4 high-performance bottlenecks for message-handling applications (servers).
People have built servers into the kernel before. May be worth looking into those implementations to understand whether we're re-treading through Linux's dust.
Need to use the web archive to access this paper about system-wide profiling of non-blocking I/O.
Interesting page on Fast Unix Servers. Seems to be in a similar vein to the main C10K page.
Read the original paper that describes kqueue and the distinction between edge and level triggerings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server Research

High Level Goal

December 11, 2016

Goal

Notes

December 3, 2016

Goal

Code Map

Select

Questions

December 1, 2016

Goal

Heuristics

Actual C10K Notes

Strategies Described

Important Distinctions

Questions

Filing Away for Later

Clone this wiki locally