-
Notifications
You must be signed in to change notification settings - Fork 45
Server Research
Understand enough about the various I/O implementations to decide on a server implementation to start experimenting with.
Dig into poll
. Determine if it uses fo_poll
under the hood and identify how it differs from select
.
poll
's man page states:
poll() provides a mechanism for multiplexing I/O across a set of file descriptors. It is similar in function to select(2). Unlike select(2), however, it is possible to only pass in data corresponding to the file descriptors for which events are wanted. This makes poll() more efficient than select(2) in most cases.
At first this was confusing, but then I realized that the author's referring to select
's fd_set
implementation, which implicitly requires memory to encode all possible file descriptors on each fd_set
. poll
, on the other hand, takes a list of structs that encode specific fd
s as arguments, reducing its memory requirement. It's hard for me to tell whether this actually matters anymore or if it's an artifact of a more bitwise time. Now that I think about it, there's also a runtime benefit in poll
's implementation. select
has to iterate through all possible fd_set
s, whereas poll
can only look at the ones it knows it cares about.
Compare and contrast select
and poll
implementations. Figure out the source of the two following statements from the C10K problem about select
and poll
respectively.
Unfortunately, select() is limited to FD_SETSIZE handles. This limit is compiled in to the standard library and user programs. (Some versions of the C library let you raise this limit at user app compile time.)
There is no hardcoded limit to the number of file descriptors poll() can handle, but it does get slow about a few thousand, since most of the file descriptors are idle at any one time, and scanning through thousands of file descriptors takes time.
- Interface sits in
/usr/src/sys/sys/select.h
.-
The header file defines key macros for manipulating the
fd_set
struct, passed as a parameter toselect
. Eachfd_set
struct represents a file descriptor as a bitmask, internally implemented using arrays of unsigned 32 bit integers. The hardcoded (FD_SETSIZE
equal to 1024) limit on the number of socket descriptors handled byselect
arises becausefd_set
structs store 32fd_mask
unsigned integers (32 bits / mask * 32 masks = 1024 bits = 1024 different file descriptors can be represented). In summary, assuming eachfd_set
represents a file descriptors as a single bit in its bitmask set to "1", eachfd_set
is capable of representing 1024 different file descriptors.The macros defined in the header file all leverage the fact that dividing a passed-in
fd
by bits-per-file-descriptor (8) tells us thefd_mask
on which this file descriptor would or does sit. For example, anfd
equal to 8 sits on the secondfd_mask
, because the first includes fds 0 through 7 (in theory not used, because 0 through 2 are terminal descriptors?). The implementation then uses modulo arithmetic and bitwise operations to set, unset, or clear a bit on the selected bitmask. Modulo arithmetic works here because we only care about the remainder number of bits after we divide by bits-per-file-descriptor to get to the right mask. In our prior example, assuming we wanted to setfd
8 on thisfd_set
, we'd modifyfd_mask
1 (zero-indexed). We'd bitwise OR thisfd_mask
with1 << 8 % 8 = 0
, setting the first 0 on the bitmask to 1.
-
- Implementation lives in
/usr/src/sys/kern/sys_generic.c
(description is a work-in-progress).- Most of
select
just extracts timevalues and converts them to a timespec (or returns an error). Once that's done, it delegates todopselect
. -
dopselect
, assuming we ignore all the timespec and retry logic, polls file descriptors to determine if any are ready for data to be read or written at three different priority levels.dopselect
uses pretty complicated bit-shifting and memory allocations to do so, but I consider those incidental to the meaning of the function. The main takeaway here is thatdopselect
uses something calledfo_poll
to do these poll operations. Is this the same or different frompoll
? If it's different (which I'm inclined to say it is), how does it differ? Doespoll
(the sys-call) use it under the hood as well?
- Most of
-
What's the difference between
select
andpselect
?-
pselect
adds the option to specify a default descriptor mask and the timeout as a timespec.
-
-
The
select
header file specifiesFD_SETSIZE
as 1024 but I don't see what aspect of the implementation prevents us from using more descriptors. The below struct definition (with comments for constant values added) defines thefd_set
structure that userspace programs pass intoselect
.#define __NBBY 8 /* number of bits in a byte */ typedef uint32_t __fd_mask; #define __NFDBITS ((unsigned)(sizeof(__fd_mask) * __NBBY)) /* Bits per mask. Should equal 32 / 2^5 assuming the standard 4 byte uint32_t */ #define __howmany(x, y) (((x) + ((y) - 1)) / (y)) /* This breaks down to (x / y) + ((y - 1) / y). Is there unsigned int magic going on? What's the point of the second expression? OHHHH, it's because we want to round up if the first expression produces a decimal. typedef struct fd_set { __fd_mask fds_bits[__howmany(FD_SETSIZE, __NFDBITS)]; /* Given my understanding of howmany, I would expect this to return 1024 / 32 + .99999 = 32.999999. What's the point of the .999999? Anyway, this makes sense, the fds_bits array stores 32 __fd_masks, each of which stores 32 bits! */ } fd_set;
Summarize the C10K article enough to understand which approaches it describes would benefit from a "rip out the middle man" improvement strategy.
Given that we know we want to use OpenBSD and intend to leverage a "kill the middle layer" strategy, what approach will work best?
- Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
I don't fully understand this but sounds complected and strictly worse than asynchronous I/O. He highlights the difference between non-blocking and asynchronous I/O. Non-blocking I/O doesn't block when waiting for network calls to finish but does when waiting for disk calls to finish. Asynchronous I/O handles both asynchronously (surprise!). - Serve many clients with each thread, and use nonblocking I/O and readiness change notification
The author describes one benefit of this model - ease of use with OpenSSL. I may be missing something, but I'm tentatively avoiding servers that use this model. - Serve many clients with each server thread, and use asynchronous I/O
In the earlier sections, the author highlights how asynchronous I/O doesn't block on disk reads and writes. However, in this section he mentions, "AIO doesn't provide a way to open files without blocking for disk I/O". I'm confused! That being said, I get the sense that async I/O, assuming reasonable implementations, presents a good balance of few warts, good performance, and reasonable semantics. Let's hope I'm not proven wrong if I choose this. - Serve one client with each server thread
The author described this as barely feasible when he wrote this. While Moore's Law hardware progress probably increased its feasibility, I'm inclined to stay away from this given his warnings about having to reduce thread stack frame sizes. On the other hand,
-
Level vs. Edge Triggered Kernel Signals
Level triggers occur continuously for a file descriptor in a certain state. With level triggers, the socket-holding program will continue to receive a signal until the file descriptor (and socket) exit the ready state. This deals with an edge case where a program registers interest in a socket after the socket initially becomes ready. Since level-triggering notifies based on state, the calling program will receive a notification from the new socket, even though the ready event technically already occurred. Edge triggers occur at the instant a file descriptor becomes ready. Edge triggering punishes calling programs that miss events. Edge triggered events only appear to calling programs once, so calling programs can't recapture missed events and the I/O they signaled.In summary, I don't understand why one would use edge-triggering over level-triggering...
- Does OpenBSD have a reasonable async I/O implementation?
- How does Go implement high performance servers?
- Can goroutines (which are essentially green threads, right?) make the one client per thread model feasible?
- Notes on High Performance Server Design: Linked in the original C10K problem article. Describes 4 high-performance bottlenecks for message-handling applications (servers).
- People have built servers into the kernel before. May be worth looking into those implementations to understand whether we're re-treading through Linux's dust.
- Need to use the web archive to access this paper about system-wide profiling of non-blocking I/O.
- Interesting page on Fast Unix Servers. Seems to be in a similar vein to the main C10K page.
- Read the original paper that describes kqueue and the distinction between edge and level triggerings.