Deadlocks caused by the root task calling other PDs #13

astevins · 2024-06-13T20:26:14Z

astevins
Jun 13, 2024
Collaborator

The Issue

The current state of implementation allows the root task to make requests to other PDs (specifically, PDs that manage a resource space) for two reasons:

Fetch a subgraph of the model state, ie. the resource relations of resources that the callee PD manages
Inform a resource manager PD that some instance of one of its resources is freed, eg. if a PD crashed while holding a file resource, we let the file server know about it

If the root task messages a resource server while the resource server is busy (like handling some client request), then we will get a deadlock if the resource server also needs to request something from the root task.

Potential Solutions

This thread is to track investigation into possible solutions. Some potential avenues we could consider:

Multi-threading (we want to avoid this if at all possible)
Async operations (eg. all requests that the root task makes would be async)
Avoid the root task ever calling other PDs
Some solution we haven't thought of yet

Next Steps

I will be looking into other microkernels / similar architectures (where a "more trusted" part of the system is making a call to a "less-trusted" part of the system) to evaluate the potential solutions.

astevins · 2024-06-17T18:07:36Z

astevins
Jun 17, 2024
Collaborator Author

QNX

While searching for solutions to deadlocks in microkernels, I came across the QNX manual which had a few interesting things to say.

IPC

The manual has a section on IPC, which is relevant to us since their IPC primitives are similar to seL4.

The Send/Receive/Reply IPC primitives allow the construction of deadlock-free systems with the observation of only a couple simple rules:

Never have two threads send to each other.

Always arrange your threads in a hierarchy, with sends going up the tree.

This is a simple and effective way to avoid deadlocks, but what if we need to send messages down the tree? The manual suggests two solutions.

The first is non-blocking event messages:

But how does a higher-level thread notify a lower-level thread that it has the results of a previously requested operation? (Assume the lower-level thread didn't want to wait for the replied results when it last sent.)

QNX/Neutrino provides a very flexible architecture with the MsgDeliverEvent() kernel call to deliver non-blocking events.

The second solution is that a lower-level thread initiates a send to a higher-level thread, and the higher-level thread can reply at its own convenience.

The lower-level thread would send in order to ``report for work,'' but the higher-level thread wouldn't reply then. It would defer the reply until the higher-level thread had work to be done, and it would reply (which is a non-blocking operation) with the data describing the work. In effect, the reply is being used to initiate work, not the send, which neatly side-steps rule #1.

The second solution is interesting, but ultimately not possible for us, because our server PDs cannot "report for work" while they are serving client PDs. Even if we did have some dedicated thread of a server PD that waits for the root task to give it work, still the root task would have to be prepared to handle the result asynchronously.

Signals

QNX implements signals using its asynchronous event-delivery mechanisms. Signals are targeted at a particular thread, or an entire process. Threads can block certain signals. If a signal is directed at an entire process, it is delivered to the first thread that does not have the signal blocked - or, if all threads have the signal blocked, then the signal is queued.

I think that our server PDs are analogous to a process with a single thread, where the "extract model" or "free object" signals should be blocked while the PD is serving a request. If the root task sends these signals while they are blocked, then the signal should be queued for the server to handle ASAP.

Takeaways

We need some principled approach to prevent deadlocks, like the one that the QNX manual describes. We need to ensure that the root task never sends a blocking request to a PD, and for this we need a form of message queue.

0 replies

astevins · 2024-06-17T18:37:30Z

astevins
Jun 17, 2024
Collaborator Author

FUSE (File System in Userspace)

FUSE is a kernel module and userspace library that facilitate running a file system in userspace on top of a Linux kernel. Applications can request regular file system operations on the mounted FUSE file system, which first go to the kernel, then the kernel forwards the request to the userspace filesystem. This was of interest since, like our setup, it involves a privileged PD making requests of a less-privileged PD.

Kernel -> FS communication
The kernel sends requests to the FUSE filesystem through message queues. All communication goes through a single file descriptor, obtained by opening the FUSE device '/dev/fuse', where each file descriptor to the device is a separate FUSE connection.

The kernel module and FUSE fs maintain the queues by reading/writing the '/dev/fuse' file. There are several queues with different priorities, and the filesystem handles them accordingly. When requests are complete, the filesystem writes replies to another queue. The kernel module will read the replies and unblock any waiting application.

The format of the requests is pretty straightforward: all requests have a header structure including an opcode, followed by an opcode-specific data struct, plus up to 2 filenames.

If the FUSE fs needs to make a request of the kernel, eg. to notify user about events on a polled file, then it writes a "notification" to '/dev/fuse'. I'm not certain of this, but I think the kernel replies synchronously with an error code.

Takeaways
We see that async requests / message queues are employed again here. This system is much more complicated than ours, involving multiple queues, etc. However, I think it is another point for async.

References

0 replies

astevins · 2024-06-17T19:07:54Z

astevins
Jun 17, 2024
Collaborator Author

Barrelfish

Upcalls
As a warning, my understanding of this is a bit fuzzy.
Barrelfish has a table of upcall handlers for each dispatcher as addresses. They define handlers for run, traps, message arrival, etc. An upcall causes a dispatcher to jump to the registered address and execute the handler (with the ability to return to the previous context and resume execution afterwards?). For example, when a dispatcher receives an LMP message, the handler will trigger a receive event in the corresponding waitset. Asynchronously, some thread will handle the event by retrieving it from the waitset.

Takeaways
Some of the complexity here is to do with user-level threads, where the kernel is aware of a dispatcher and not aware of how many threads it may have. So the dispatcher can implement specific handlers for events, schedule its own threads, etc. For LMP, the handler just queues the message to be handled asynchronously. For all upcalls, the kernel does not wait on a reply (I believe). I'm not sure that there is anything useful for us here.

1 reply

achreto Jun 17, 2024
Collaborator

yes, the kernel saves the execution context in the save area. The actual context switch happens in the user-level scheduler.
No, there's no waiting for responses from an upcall.

achreto · 2024-06-17T22:17:19Z

achreto
Jun 17, 2024
Collaborator

I think it's important to clearly articulate what the purpose of the root task is in the grand scheme of things -- depending on that certain aspects are possible, some others are not.

It seems that the root task is acting a bit as a monitor / trusted entity in the system. Given that, having it blocked on other non-trusted entities is probably not so great unless those entities become part of the trusted domain. One thing to avoid here is that entities can create dependencies from the root task to their own. Even if we were to make event-based or multi-threaded, another thing to avoid here is that resources are clobbered, i.e., through the allocation of continuations or thread contexts that are never cleaned up because the domain doesn't reply to the request.

Another aspect here is whether all IPC has RPC semantics, or whether this is event driven.
I think a good reference point here is to look at the Barrelfish Monitor protocol.

2 replies

astevins Jun 17, 2024
Collaborator Author

One thing to avoid here is that entities can create dependencies from the root task to their own.

What could cause this to happen?

astevins Jun 17, 2024
Collaborator Author

I think a good reference point here is to look at the Barrelfish Monitor protocol.

I haven't been able to find a protocol, but I do know of some examples where the monitor sends a message to a dispatcher (like a request to bind a UMP service). Is that the right place to look?

sid-agrawal · 2024-06-24T01:33:03Z

sid-agrawal
Jun 24, 2024
Maintainer

https://hubris.oxide.computer/reference/#uphill-send

1 reply

astevins Jun 24, 2024
Collaborator Author

When designing a collection of servers in an application, remember that it’s only safe to send messages to higher priority servers (called the "uphill send rule"). Sending messages to lower priority servers can cause starvation and deadlock.

Same point as in the QNX manual. Makes sense! Hubris also has notifications similar to seL4, and has this to say about it:

Generally, notifications are useful for situations where one might use interrupts or signals in other systems.

So this would be allowed downhill, I assume.

astevins · 2024-06-24T16:47:22Z

astevins
Jun 24, 2024
Collaborator Author

Forming a Game Plan

I'm going to lay out a plan for the two routes I am considering.

Option 1: Async Message Queue

Design:

The root task will have a shared frame with each resource server. The frame will contain a send and receive queue for messages.
The resource servers can send requests to the root task as usual. The root task will never send blocking requests, and will only queue requests to servers via the async message queue.
The root task and resource servers will check for messages in the message queue before calling Recv on their own endpoint.
- This makes the queue higher priority than client requests.
When a message is sent, the receiver will also need to be notified with NBSend (non-blocking send) in case it is blocked on its IPC endpoint.
- Deadlock still possible: The receiver checks the queue, finds it empty. Then the sender writes a message, and does an NBSend. Then the receiver does a Recv and gets stuck.
- Potential solution: some shared mutex for the queues. However, the libringbuffer was not meant to be used with a mutex.

Steps:

Set up a message-queue library
- Use libringbuffer for a send/receive queue in a shared frame.
- Use nanopb to define the message structures (I have already confirmed this works).
- Implement the message-queue API. It will consist of some initialization functions (one for the root task, which does the shared frame setup, and one for the servers, which just connect to it), and simple send/recv functions.
- Make the sender/receiver check for messages at the correct times, and ensure that a send will unblock a receiver waiting on its endpoint.
- I will test the library on some test PDs. To mimic the deadlock scenario, PD 1 will be the "root task" and it will occasionally send async requests to PD 2 via the async message queue. PD 1 will be in a loop, where it makes RPC calls to PD 2. At every iteration, PD 1 will check the message queue for any pending messages, and respond to PD 1 through the queue if so.
Use the library to set up a queue between the root task and each resource server
Make the root task handle async operations

Async model extraction: The root task will only allow one extraction at a time. It will track the number of outstanding requests and accumulate results until it has a full model state, then reply to the PD that requested the extraction.
Async resource free: The root task may not need to track anything here, it can just put all of the free requests to the message queue. Unless if the queue runs out of space...

Option 2: Uphill-only

Design:

The root task never sends a message to a resource server PD, not even async.
Resource server PDs have to "report for work" (as mentioned in the QNX manual) by sending a message to the root task, and the root task may reply with some "work".
The root task internally queues the requests it needs to make to resource server PDs, and sends them when the server reports for work.
The resource servers will have to report for work before waiting on their own endpoint. Similar to option 1, the root task will have to do an NBSend in case the resource server is already blocked on its endpont.
- Deadlock still possible: Similar to option 1. The resource server reports for work, has no work. Then the root task queues some work, and does an NBSend. Then the resource server does a Recv and gets stuck.
- It seems we would need a shared mutex, again...

Steps:

Add the "report for work" API call, resource servers call it before every Recv.
Add some structure for the root task to queue work for each resource server, and send it on request.
Like option 1, the root task needs to be able to handle the async operations

2 replies

sid-agrawal Jun 25, 2024
Maintainer

I am little unclear on how Report to Work works.

When the RS reports to work, and gets some work back. That is one seL4_Call, right?
Then it needs to actually do the work and then let the RT know that it was done, could you elaborate a bit more on that?
I am assuming it will write to a MO, and when it's done writing, somehow let the RT know?

astevins Jun 25, 2024
Collaborator Author

Yes - for both options, we need to be able to send the result back to the RT.

I was thinking this would be another seL4_Call from the resource server to the RT, maintaining the uphill-only requirement.
Eg. for model state, there would be a call something like "send_model_subgraph", then the RT will accumulate the pieces until it has the whole graph.

astevins · 2024-06-24T17:09:23Z

astevins
Jun 24, 2024
Collaborator Author

More on mutexes

In both of the scenarios from the last comment, I mentioned that we would need a shared mutex to prevent a particular deadlock.

I was considering this alternative: I imagine we could have a "polling PD", trusted but separate from the root task. It needs to know which PDs having a pending message / work from the RT, which could be done via some shared structure with the RT. It will do NBSends to those PDs until they have received the message / work. I don't know if this is a terrible idea because it would overload the sel4 kernel with NBSend calls. The message may be dropped most of the time.

This actually sounds worse. I feel better about a mutex in comparison. I think mutex would also be relatively easy to implement using seL4 notifications, but I will confirm it.

0 replies

sid-agrawal · 2024-06-24T18:25:01Z

sid-agrawal
Jun 24, 2024
Maintainer

I really like option 2 (it doesn't mean that it is better) coz it is easier to follow, IMO. Again, that is just my opinion.

However, I am not sure why a deadlock is inevitable in scenario 2. We can talk about it when you are in. Do not think we need to resolve this right away.

Re option one, I think we need a mutex anytime we have tail and head pointers, right? So mutex is necessary for basic operations, leaving deadlock aside.

1 reply

astevins Jun 24, 2024
Collaborator Author

I was thinking that we would not need a lock for basic operation of the ring buffer because the sDDF uses it in a lockless way. I have not looked into how they do it.

I also think option 2 may be simpler.

astevins · 2024-06-24T19:47:51Z

astevins
Jun 24, 2024
Collaborator Author

Update on mutex:
I made a sample program to test using seL4 notifications as a mutex (code here). This program is started twice, and the two PDs share a notification object. Seems to work great! This works because the notification object is essentially a semaphore.

What I don't understand, is why seL4_libs has a mutex that uses both a notification, and a variable (code here). Why did they choose to use both, and am I missing something?

1 reply

astevins Jun 25, 2024
Collaborator Author

One thought about this: Potentially the reason for seL4_libs using both a notification and a variable, is when you only need the mutex within one address space, perhaps the test-and-set is a cheaper operation than Notification Signal/Wait if nobody is holding/waiting on the mutex. Then we can fall back to the Notification if necessary.

sid-agrawal · 2024-06-25T16:30:24Z

sid-agrawal
Jun 25, 2024
Maintainer

That is a classic optimization. So makes sense. Let's looks at the code to make sure that, that is indeed what's happening. Best, Sid Agrawal sid-agrawal.ca

…

On Tue, Jun 25, 2024 at 9:28 AM Arya Stevinson ***@***.***> wrote: One thought about this: Potentially the reason for seL4_libs using both a notification and a variable, is when you only need the mutex within one address space, perhaps the test-and-set is a cheaper operation than Notification Signal/Wait if nobody is holding/waiting on the mutex. Then we can fall back to the Notification if necessary. — Reply to this email directly, view it on GitHub <#13 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFNYFSG5MJQHGYQBV24XTI3ZJGLDRAVCNFSM6AAAAABJJDEUYCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TQNZTG4ZTG> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

astevins · 2024-06-26T21:14:40Z

astevins
Jun 26, 2024
Collaborator Author

Implemented Design

We chose a design based on the uphill-only principle, where PDs must send a "report for work" RPC to the root task. Following this principle, the root task will never send a blocking call to another PD, so it will not deadlock.

Design overview

The root task will add "work" to an internal queue for each resource server PD
The root task signals a notification bound to the corresponding resource server PD, so the resource server will know that it has work to fetch. Using the seL4 bound notification mechanic, also causes it to unblock if it was waiting on its regular endpoint
Resource servers must call a "get work" RPC when they have work and they are available, then the root task will send it the next piece of work from the queue
If the work requires sending some response to the root task (like in the case of model extraction), then the resource server sends the response as a separate RPC
If the root task needs responses from the work, then it needs to maintain some state so that it can handle the response asynchronously

Advantages

No need to deal with message queues, running out of queue space, etc.
Seemed like the easiest design to reason about
The root task can implement any policy to decide which task to send first when a PD reports for work, eg. it may send higher-priority tasks first

Disadvantages

This will be slower than a shared-memory message queue, since the PD must block on a message sent to the root task to get work, vs just reading from a queue

Example
Some app requests a model state extraction while the FS is busy. The root task queues some extraction task, and notifies the FS' bound notification that there is work for it to do. Once the FS finishes the client request, it sees that it has work to do, and it calls the root task to get the work. It sends the extracted subgraph to the root task, and assuming this was the only piece of the model state that the root task was missing, now it replies to the app.

This also works if the FS was not busy when the root task has work for it. In that case, the FS is woken when the root task signals the work notification, and then the FS may complete the work in the same way.

1 reply

astevins Jun 26, 2024
Collaborator Author

I realize there is some disconnect between this message and the previous messages in the chain because it is missing one crucial bit of context:
I realized that seL4 had this bound notification mechanism, that would allow us to wake up a resource server waiting on its endpoint without sending an IPC message. This way, the root task can easily send a signal to the resource server without blocking itself, maintaining the uphill-only policy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlocks caused by the root task calling other PDs #13

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Deadlocks caused by the root task calling other PDs #13

astevins Jun 13, 2024 Collaborator

The Issue

Potential Solutions

Next Steps

Replies: 11 comments · 9 replies

astevins Jun 17, 2024 Collaborator Author

QNX

astevins Jun 17, 2024 Collaborator Author

FUSE (File System in Userspace)

astevins Jun 17, 2024 Collaborator Author

Barrelfish

achreto Jun 17, 2024 Collaborator

achreto Jun 17, 2024 Collaborator

astevins Jun 17, 2024 Collaborator Author

astevins Jun 17, 2024 Collaborator Author

sid-agrawal Jun 24, 2024 Maintainer

astevins Jun 24, 2024 Collaborator Author

astevins Jun 24, 2024 Collaborator Author

Forming a Game Plan

Option 1: Async Message Queue

Option 2: Uphill-only

sid-agrawal Jun 25, 2024 Maintainer

astevins Jun 25, 2024 Collaborator Author

astevins Jun 24, 2024 Collaborator Author

More on mutexes

sid-agrawal Jun 24, 2024 Maintainer

astevins Jun 24, 2024 Collaborator Author

astevins Jun 24, 2024 Collaborator Author

astevins Jun 25, 2024 Collaborator Author

sid-agrawal Jun 25, 2024 Maintainer

astevins Jun 26, 2024 Collaborator Author

Implemented Design

astevins Jun 26, 2024 Collaborator Author

astevins
Jun 13, 2024
Collaborator

Replies: 11 comments 9 replies

astevins
Jun 17, 2024
Collaborator Author

astevins
Jun 17, 2024
Collaborator Author

astevins
Jun 17, 2024
Collaborator Author

achreto Jun 17, 2024
Collaborator

achreto
Jun 17, 2024
Collaborator

astevins Jun 17, 2024
Collaborator Author

astevins Jun 17, 2024
Collaborator Author

sid-agrawal
Jun 24, 2024
Maintainer

astevins Jun 24, 2024
Collaborator Author

astevins
Jun 24, 2024
Collaborator Author

sid-agrawal Jun 25, 2024
Maintainer

astevins Jun 25, 2024
Collaborator Author

astevins
Jun 24, 2024
Collaborator Author

sid-agrawal
Jun 24, 2024
Maintainer

astevins Jun 24, 2024
Collaborator Author

astevins
Jun 24, 2024
Collaborator Author

astevins Jun 25, 2024
Collaborator Author

sid-agrawal
Jun 25, 2024
Maintainer

astevins
Jun 26, 2024
Collaborator Author

astevins Jun 26, 2024
Collaborator Author