From fc5266b9626ac9798dce455f031871f4817c7cea Mon Sep 17 00:00:00 2001
From: CPestka <constantin.pestka@c-pestka.de>
Date: Sat, 5 Oct 2024 15:00:14 +0200
Subject: [PATCH] man/io_uring_internal: Add man page about relevant internals
 for users

Adds a man page with details about the inner workings of io_uring that
are likely to be useful for users as they relate to frequently misused
flags of io_uring such as IOSQE_ASYNC and the taskrun flags. This
mostly describes what needs to be done on the kernel side for each
request, who does the work and most notably what the async punt is.

Signed-off-by: Constantin Pestka <constantin.pestka@c-pestka.de>
---
 man/io_uring_internals.7 | 225 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 225 insertions(+)
 create mode 100644 man/io_uring_internals.7

diff --git a/man/io_uring_internals.7 b/man/io_uring_internals.7
new file mode 100644
index 000000000..1133118c7
--- /dev/null
+++ b/man/io_uring_internals.7
@@ -0,0 +1,225 @@
+.TH io_uring_internals 7 2024-10-5 "Linux" "Linux Programmer's Manual"
+.SH NAME
+io_uring_internals
+.SH SYNOPSIS
+.nf
+.B "#include <linux/io_uring.h>"
+.fi
+.PP
+.SH DESCRIPTION
+.PP
+.B io_uring
+is a linux specific, asynchronous API that allows the submission of requests to
+the kernel that are typically otherwise performed via a syscall. Requests are
+passed to the kernel via a shared ring buffer the
+.I Submission Queue
+(SQ) and completion notifications are passed back to the application via the
+.I Completion Queue
+(CQ). An important detail here is that after a request has been submitted to
+the kernel some CPU time has to be spent in kernel space to perform the
+required submission and completion related tasks.
+The mechanism used to provide this CPU time, as well as what process does so
+and when is different in
+.I io_uring
+than for the traditional API provided by regular syscalls.
+
+.PP
+.SH Traditional Syscall Driven I/O
+.PP
+For regular syscalls the CPU time for these tasks is directly provided by the
+process issuing the syscall, with the submission side tasks in kernel space
+being directly executed after the context switch. The time for completion
+related tasks is either also subsequently directly provided in the case of
+polled I/O. In the case of interrupt driven I/O the CPU time is provided,
+depending on the driver in question, by either the traditional top and bottom
+half IRQ approach or via threaded IRQ handling. The CPU time for completion
+tasks is thus in this case provided by the CPU on which the hardware
+interrupt arrives, as well as the CPU to which the dedicated kernel worker
+thread for the threaded IRQ handling gets scheduled, if that is used.
+
+.PP
+.SH The Submission Side Work
+.PP
+
+The tasks required in kernel space on the submission side are mostly checking
+the SQ for newly arrived SQEs, parsing and check them for validity and
+permissions and then passing them on to the responsible system, such as a
+block device driver. An important note here is that
+.I io_uring
+guarantees that the process of submitting the request to responsible subsystem
+and thus in this case the
+.IR io_uring_enter (2)
+syscall made to submit the new requests, will never block. However,
+.I io_uring
+relies on the capabilities of the responsible system to perform the submission
+without blocking.
+.I io_uring
+will first attempt to submit the request without blocking.
+If this fails, e.g. due to the respective system not supporting non-blocking
+submissions,
+.I io_uring
+will
+.I async punt
+the request, i.e. off-load these requests to the
+.I IO work queue
+(IO WQ) (see description below). 
+
+.PP
+.SH The Completion Side Work
+.PP
+
+The tasks required in kernel space on the completion side mostly come in the
+form of various request type dependant tasks, such as copying buffers, parsing
+packet headers etc., as well as posting a CQE to the CQ to inform the
+application of the completion of the request.
+
+.PP
+.SH Who does the work
+.PP
+
+One of
+the primary motivations behind
+.I io_uring
+was to reduce or entirely avoid the overheads of syscalls to provide the
+required CPU time in kernel space. The mechanism that
+.I io_uring
+utilizes to achieve this differs depending on the configuration with different
+trade-offs between configurations in respect to e.g. CPU efficiency and latency.
+
+With the default configuration the primary mechanism to provide the kernel space
+CPU time in
+.I io_uring
+is also a syscall: 
+.IR io_uring_enter (2)
+This still differs from requests made via their respective syscall directly,
+such as
+.IR read (2),
+in the sense that it allows for batching in a more flexible way than e.g.
+possible via
+.IR readv (2),
+as different syscalls types can be freely mixed and matched and chains of
+dependant requests, such as a
+.IR send (2)
+followed by a
+.IR recv (2)
+can be submitted with one syscall. Furthermore it is possible to both process
+requests for submissions and process arrived completions within the same
+.IR io_uring_enter (2)
+call. Applications can set the flag
+.I IORING_ENTER_GETEVENTS
+to in addition to processing any pending submissions, process any arrived
+completions and
+optionally wait until a specified amount of completions have arrived before
+returning.
+
+If polled I/O is used all completion related work is performed during the
+.IR io_uring_enter (2)
+call. For interrupt driven I/O, the CPU receiving the hardware interrupt
+schedules the remaining work to be performed including posting the CQE to be
+performed via task work. Any outstanding task work is performed during any
+user-kernel space transition. Per default, the CPU that received the hw
+interrupt will after scheduling the task work interrupt a user space process
+via an inter processor interrupt (IPI), which will cause it to enter the kernel,
+and thus perform the scheduled work. While this ensures a timely delivery of
+the CQE, it is a relatively disruptive and high overhead operation. To avoid
+this applications can configure
+.I io_uring
+via
+.I IORING_SETUP_COOP_TASKRUN
+to elide the IPI. Applications must now ensure that they perform any syscall
+ever so often to be able to observe new completions, but benefit from eliding
+the overheads of the IPIs. Additionally
+.I io_uring
+can be configured to inform an application about the fact that it should now
+perform any syscall to reap new completions by setting
+.IR IORING_SETUP_TASKRUN_FLAG .
+This will result in
+.I io_uring
+setting
+.I IORING_SQ_TASKRUN
+in the SQ flags once the application should do so. This mechanism can be
+restricted further via
+.IR IORING_SETUP_DEFER_TASKRUN ,
+which results in the task work only being executed when
+.IR io_uring_enter (2)
+is called with
+.I IORING_ENTER_GETEVENTS
+set, rather than at any context switch, which gives the application more agency
+about when the work is executed, thus enabling e.g. more opportunities for
+batching.
+
+.PP
+.SH Submission Queue Polling
+.PP
+
+Sq polling introduces a dedicated kernel thread that performs essentially all
+submission and completion related tasks from fetching SQEs from the SQ,
+submitting requests, polling requests, if configured for I/O poll and posting
+CQEs. Notably, async punt requests are still processed by the IO WQ, to not
+hinder the progress of other requests. If the SQ thread does not have any work
+to do for a user supplied timeout it goes to sleep. Sq polling removes the need
+for any syscall during operation, besides waking up the sq thread after long
+periods of inactivity and thus reduces per request overheads at the cost of a
+high constant upkeep cost.
+
+.PP
+.SH IO Work Queue
+.PP
+
+The IO WQ is a kernel thread pool used to execute any requests that can not be
+submitted in a non-blocking way to the underlying subsystem, due to missing
+support in said subsystem. After either the sq poll thread or a user space
+thread calling
+.IR io_uring_enter (2)
+fails the initial attempt to submit the request without blocking it passes the
+request on to a IO WQ thread that then performs the blocking submission. While
+this mechanism ensures that
+.IR io_uring ,
+unlike e.g. AIO, never blocks on any of the submission paths, it is, as the
+name of this mechanism, the async punt, suggests not ideal. The blocking
+nature of the submission, the passing of the request to another thread, as
+well as the scheduling of the IO WQ threads are all ideally avoided
+overheads. Significant IO WQ activity can thus be seen as an indicator that
+something is very likely going wrong. Similarly the flag
+.I IOSQE_ASYNC
+should only be used if the user knows that a request will always or is very
+likely to async punt and not to ensure that the submission will not block, as
+.I io_uring
+guarantees to never block in any case.
+
+.PP
+.SH Kernel Thread Management
+.PP
+
+Each user space process utilizing
+.I io_uring
+posses an
+.I io_uring
+context, which manages all
+.I io_uring
+instances created within said process via
+.IR io_uring_setup (2).
+Per default, both the sq poll thread, as well as the IO WQ thread pool are
+dedicated for each
+.I io_uring
+instance and are thus not shared within a process and are never shared between
+different processes. However sharing these between two or more instances can
+be achieved during setup via
+.IR IORING_SETUP_ATTACH_WQ .
+The threads of the IO WQ are created lazily in response to request being async
+punted and fall into two accounts, the 
+bounded account responsible for requests with a generally bounded execution
+time, such as block I/O and the unbounded account for requests with unbounded
+execution time such as e.g. recv operations.
+The maximum thread count of the accounts is per default 2 * NPROC and can be
+adjusted via
+.IR IORING_REGISTER_IOWQ_MAX_WORKERS .
+Their CPU affinity can be adjusted via
+.IR IORING_REGISTER_IOWQ_AFF .
+
+.EE
+.SH SEE ALSO
+.BR io_uring (7)
+.BR io_uring_enter (2)
+.BR io_uring_register (2)
+.BR io_uring_setup (2)