Skip to content

Commit

Permalink
Blog: async-io
Browse files Browse the repository at this point in the history
  • Loading branch information
Brian Shih authored and Brian Shih committed Oct 9, 2023
1 parent 07808fb commit a77dce9
Show file tree
Hide file tree
Showing 9 changed files with 133 additions and 7 deletions.
8 changes: 8 additions & 0 deletions mdbook/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,12 @@
---

# Phase 2 - Asynchronous I/O
- [What is Asynchronous I/O?](./async_io/intro.md)
- [Prerequisites](./async_io/building_blocks.md)
- [Non-blocking I/O](./async_io/non_blocking_mode.md)
- [Io_uring](./async_io/io_uring.md)
- [API](./async_io/api.md)
- [Implementation Details](./async_io/implementation_details.md)



1 change: 1 addition & 0 deletions mdbook/src/async_io/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# API
1 change: 1 addition & 0 deletions mdbook/src/async_io/building-blocks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Prerequisites - Building Blocks
1 change: 1 addition & 0 deletions mdbook/src/async_io/building_blocks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Prerequisites - Building Blocks
1 change: 1 addition & 0 deletions mdbook/src/async_io/implementation_details.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Implementation Details
9 changes: 9 additions & 0 deletions mdbook/src/async_io/intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# What is Asynchronous I/O?

In this phase, we will add I/O to our runtime.

A simple approach to I/O would be to just wait for the I/O operation to complete. But such an approach, called **synchronous I/O** or **blocking I/O** would block the single-threaded executor from performing any other tasks concurrently.

What we want instead is **asynchronous I/O**. In this approach, performing I/O won’t block the calling thread. This allows the executor to run other tasks and return to the original task once the I/O operation completes.

Before we implement asynchronous I/O, we need to first look at two things: how to turn an I/O operation to non-blocking and `io_uring`.
43 changes: 43 additions & 0 deletions mdbook/src/async_io/io_uring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Io_uring

On this page, I’ll provide a surface-level explanation of how `io_uring` works. If you want a more in-depth explanation, check out [this tutorial](https://unixism.net/loti/what_is_io_uring.html) or [this article](https://developers.redhat.com/articles/2023/04/12/why-you-should-use-iouring-network-io#:~:text=io_uring is an asynchronous I,O requests to the kernel).

As mentioned, `io_uring` manages file descriptors for the users and lets them know when one or more of them are ready.

Each `io_uring` instance is composed of two ring buffers - the submission queue and the completion queue.

To register interest in a file descriptor, you add an SQE to the tail of the submission queue. Adding to the submission queue doesn’t automatically send the requests to the kernel, you need to submit it via the `io_uring_enter` system call. `Io_uring` supports batching by allowing you to add multiple SQEs to the ring before submitting.

The kernel processes the submitted entries and adds completion queue events (CQEs) to the completion queue when it is ready. While the order of the CQEs might not match the order of the SQEs, there will be one CQE for each SQE, which you can identify by providing user data.

The user can then check the CQE to see if there are any completed I/O operations.

### Using io_uring for TcpListener

Let’s look at how we can use `IoUring` to manage the `accept` operation for a `TcpListener`. We will be using the `iou` crate, a library built on top of `liburing`, to create and interact with `io_uring` instances.

```rust
let l = std::net::TcpListener::bind("127.0.0.1:8080").unwrap();
l.set_nonblocking(true).unwrap();
let mut ring = iou::IoUring::new(2).unwrap();

unsafe {
let mut sqe = ring.prepare_sqe().expect("failed to get sqe");
sqe.prep_poll_add(l.as_raw_fd(), iou::sqe::PollFlags::POLLIN);
sqe.set_user_data(0xDEADBEEF);
ring.submit_sqes().unwrap();
}
l.accept();
let cqe = ring.wait_for_cqe().unwrap();
assert_eq!(cqe.user_data(), 0xDEADBEEF);
```

In this example, we first create a `[TcpListener](<https://doc.rust-lang.org/stable/std/net/struct.TcpListener.html>)` and set it to non-blocking. Next, we create an `io_uring` instance. We then register interest in the socket’s file descriptor by making a call to `prep_poll_add` (a wrapper around Linux’s [io_uring_prep_poll_add](https://man7.org/linux/man-pages/man3/io_uring_prep_poll_add.3.html) call). This adds a `SQE` entry to the submission queue which will trigger a CQE to be posted [when there is data to be read](https://github.com/nix-rust/nix/blob/e7c877abf73f7f74e358f260683b70ce46db13b0/src/poll.rs#L127).

We then call `accept` to accept any incoming TCP connections. Finally, we call `wait_for_cqe`, which returns the next CQE, blocking the thread until one is ready if necessary. If we wanted to avoid blocking the thread in this example, we could’ve called `peek_for_cqe` which peeks for any completed CQE without blocking.

### Efficiently Checking the CQE

You might be wondering - if we potentially need to call `peek_for_cqe()` repeatedly until it is ready, how is this different from calling `listener.accept()` repeatedly?

The difference is that `accept` is a system call while `peek_for_cqe`, which calls `io_uring_peek_batch_cqe` under the hood, is not a system call. This is due to the unique property of `io_uring` such that the completion ring buffer is shared between the kernel and the user space. This allows you to efficiently check the status of completed I/O operations.
60 changes: 60 additions & 0 deletions mdbook/src/async_io/non_blocking_mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Nonblocking Mode

In Rust, by default, many I/O operations, such as reading a file, are blocking. For example, in the code snippet below, the `TcpListener::accept` call will block the calling thread until a new TCP connection is established.

```rust
let listener = TcpListener::bind("127.0.0.1:8080").unwrap();
listener.accept();
```

### Nonblocking I/O

The first step towards asynchronous I/O is turning a blocking I/O operation into a non-blocking one.

In Linux, it is possible to do nonblocking I/O on sockets and files by setting the `O_NONBLOCK` flag on the file descriptors.

Here’s how you can set the file descriptor for a socket to be non-blocking:

```rust
let listener = std::net::TcpListener::bind("127.0.0.1:8080").unwrap();
let raw_fd = listener.as_raw_fd();
fcntl(raw_fd, FcntlArg::F_SETFL(OFlag::O_NONBLOCK))
```

Setting the file descriptor for the `TcpListener` to nonblocking means that the next I/O operation would immediately return. To check if the operation is complete, you have to manually `poll` the file descriptor.

Rust’s std library has helper methods such as `Socket::set_blocking` to set a file descriptor to be nonblocking:

```rust
let l = std::net::TcpListener::bind("127.0.0.1:8080").unwrap();
l.set_nonblocking(true).unwrap();
```

### Polling

As mentioned above, after setting a socket’s file descriptor to be non-blocking, you have to manually poll the file descriptor to check if the I/O operation is completed. Under non-blocking mode, the `TcpListener::Accept` method returns `Ok` if the I/O operation is successful or an error with kind `io::ErrorKind::WouldBlock` is returned.

In the following example, we `loop` until the I/O operation is ready by repeatedly calling `accept`:

```rust
let l = std::net::TcpListener::bind("127.0.0.1:8080").unwrap();
l.set_nonblocking(true).unwrap();

loop {
// the accept call
let res = l.accept();
match res {
Ok((stream, _)) => {
handle_connection(stream);
break;
}
Err(err) => if err.kind() == io::ErrorKind::WouldBlock {},
}
}
```

While this works, repeatedly calling `accept` in a loop is not ideal. Each call to `TcpListener::accept` is an expensive call to the kernel.

This is where system calls like [select](http://man7.org/linux/man-pages/man2/select.2.html), [poll,](http://man7.org/linux/man-pages/man2/poll.2.html) [epoll](http://man7.org/linux/man-pages/man7/epoll.7.html), [aio](https://man7.org/linux/man-pages/man7/aio.7.html), [io_uring](https://man.archlinux.org/man/io_uring.7.en) come in. These calls let you register interest for file descriptors and let you know when one or more of them are ready. This reduces the need for constant polling and makes better use of system resources.

Glommio uses `io_uring`. One of the things that make `io_uring` stand out compared to other system calls is that it presents a uniform interface for both sockets and files. This is a huge improvement from system calls like `epoll` that doesn’t support files while `aio` only works with a subset of files (linus-aio only supports `O_DIRECT` files). In the next page, we take a quick glance at how `io_uring` works.
16 changes: 9 additions & 7 deletions mdbook/src/motivation.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,26 @@
# Motivation

In this blog series, I will explore building a toy version of [Glommio](https://docs.rs/glommio/latest/glommio/), which is an `asynchronous` framework for building `thread-per-core` applications.
I've always wondered how asynchronous runtimes like [Node.js](https://nodejs.org/en/about), [Seastar](https://seastar.io/), [Glommio](https://docs.rs/glommio/latest/glommio/), and [Tokio](https://tokio.rs/) work under the hood. I'm also curious how the [shared-nothing](https://seastar.io/shared-nothing/#:~:text=The%20Seastar%20Model%3A%20Shared%2Dnothing&text=Seastar%20runs%20one%20application%20thread,cores%20must%20be%20handled%20explicitly.), thread-per-core architecture that powers systems like [Redpanda](https://redpanda.com/) and [ScyllaDB](https://www.scylladb.com/) works at a deeper level.

In this blog series, I will explore building a toy version of [Glommio](https://docs.rs/glommio/latest/glommio/), an `asynchronous` framework for building `thread-per-core` applications.

### What is Thread-Per-Core?

A complex application may have many tasks that it needs to execute. To speed up the application, some of these tasks can be performed in parallel. The ability of a system to execute multiple tasks concurrently is known as **multitasking**.
A complex application may have many tasks that it needs to execute. Some of these tasks can be performed in parallel to speed up the application. The ability of a system to execute multiple tasks concurrently is known as **multitasking**.

Thread-based multitasking is one of the ways an operating system supports multitasking. In thread-based multitasking, an application can spawn a thread for each of its internal tasks. While the CPU can only run one thread at a time, the CPU scheduler can switch between threads to give the user the perception of two or more threads running simultaneously. The switching between threads is known as context switching.
Thread-based multitasking is one of the ways an operating system supports multitasking. In thread-based multitasking, an application can spawn a thread for each internal task. While the CPU can only run one thread at a time, the CPU scheduler can switch between threads to give the user the perception of two or more threads running simultaneously. The switching between threads is known as context switching.

While thread-based multitasking may allow better usage of the CPU by switching threads when a thread is blocked or waiting, there are a few drawbacks:

- The developer has very little control over which thread is scheduled at any moment. Only a single thread can run on a CPU core at any moment. Once a thread is spawned, it is up to the OS to decide which thread to run on which CPU.
- When the OS switches threads to run on a CPU core, it needs to perform a context switch. A context switch is expensive and may take the kernel around 5 μs to perform.
- If multiple threads try to mutate the same data, they need to use locks to synchronize resource contention. Locks are expensive and threads are blocked while waiting for the lock to be released.
- When the OS switches threads to run on a CPU core, it performs a context switch. A context switch is expensive and may take the kernel around 5 μs to perform.
- If multiple threads try to mutate the same data, they need to use locks to synchronize resource contention. Locks are expensive, and threads are blocked while waiting for the lock to be released.

Thread-per-core is an architecture that eliminates threads from the picture. In this programming paradigm, developers are not allowed to spawn new threads to run tasks. Instead, each core runs on a single thread.

[Seastar](https://seastar.io/) (C++) and [Glommio](https://docs.rs/glommio/latest/glommio/) (Rust) are two frameworks that allow developers to write thread-per-core applications. Seastar is used in ScyllaDB and Redpanda while Glommio is used by Datadog.
[Seastar](https://seastar.io/) (C++) and [Glommio](https://docs.rs/glommio/latest/glommio/) (Rust) are two frameworks that allow developers to write thread-per-core applications. Seastar is used in ScyllaDB and Redpanda, while Glommio is used by Datadog.

In this blog series, I will explore building a toy version of Glommio, which is an `asynchronous`, `thread-per-core` runtime based on `io_uring`. I will be building a minimal version of Glommio by extracting bits and pieces from it.
In this blog series, I will reimplement a light-weight version of Glommio by extracting bits and pieces from it. Throughout the blog, I will explain the different core abstractions that make up an asynchronous runtime.

I’ve split up the blog series into four phases:

Expand Down

0 comments on commit a77dce9

Please sign in to comment.