BullMQ Pro with DragonFly - warnings/errors/slow processing #3640

wernermorgenstern · 2024-09-03T16:23:49Z

Describe the bug
We are running DragonFly in EKS, for BullMQ Queue.
We want to process currently 90,000 messages in between 1 and 10 seconds. (and in future, even more messages than 90,000 messages).

In DragonFly container, we see these errors:

W20240903 16:00:11.101274    11 transaction.cc:63] TxQueue is too long. Tx count:97, armed:95, runnable:0, total locks: 4, contended locks: 4
max contention score: 360192, lock: 17340088452209433226, poll_executions:297968 continuation_tx: BZPOPMIN@1778056/1 {id=8646}

And then also:

E20240903 15:14:55.102573     8 scheduler.cc:336] Failed to pull active fiber from remote_ready_queue, iteration 0 remote_empty: 1, current_epoch: 130453822, push_epoch: 130453821, next:0
E20240903 15:45:04.683014     8 scheduler.cc:336] Failed to pull active fiber from remote_ready_queue, iteration 0 remote_empty: 1, current_epoch: 186866416, push_epoch: 186866415, next:0

In our application, we have 47 Queues, and then in the Consumer Service, we have 30 Workers per Queue for BullMQ Pro, concurrency is 1000, Batch Size is 10.

So I am not sure if the issue is the BullMQ Lua Scripting, which gets behind, because of the locks. Or is there anything else we can do?

Expected behavior
Be able to process 90,000 messages in less than 10 seconds.

Environment (please complete the following information):

Kernel: # Linux dragonfly-bullmq-bridge-processor-0 5.10.220-209.869.amzn2.aarch64 #1 SMP Wed Jul 17 15:10:20 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
Containerized?: Kubernetes
Dragonfly Version: 1.21.4

Resources

│     Limits:                                                                                                                                                                                                                        │
│       cpu:     47                                                                                                                                                                                                                  │
│       memory:  89Gi                                                                                                                                                                                                                │
│     Requests:                                                                                                                                                                                                                      │
│       cpu:      47                                                                                                                                                                                                                 │
│       memory:   89Gi

Configuration

        - --alsologtostderr
        - --lock_on_hashtags
        - --cluster_mode=emulated
        - --maxclients=1000000
        - --maxmemory=0
        - --dbnum=1
        - --max_client_iobuf_len=1000000000
        - --max_multi_bulk_len=1000000000
        - --shard_repl_backlog_len=1024000000
        - --replication_stream_timeout=5000
        - --proactor_threads=4
        - --conn_io_threads=2
        - --lua_auto_async=true
        ```

AWS Instance Type: `c7g.12xlarge`

The text was updated successfully, but these errors were encountered:

chakaz · 2024-09-03T16:41:55Z

Hi @wernermorgenstern
A few questions to help us help you:

Can you tell how many of the Dragonfly threads are active?
Why did you set --conn_io_threads=2?
What queue names are you using? Could it be that you only use a single {hashtag} common to all (many) queues?

wernermorgenstern · 2024-09-03T16:46:21Z

@chakaz ,

How can I tell how many threads are active? Is there a command, which I can run, when I connect to the container in EKS?
I was trying to play around with that. I can remove it, if that is the issue. Even before I set that though, we still had the same errors.
Queue names are: {bridge-processor-1}, {bridge-processor-2}, and so on.

chakaz · 2024-09-04T04:22:34Z

@wernermorgenstern

I meant via some monitoring tools like htop, I want to see that indeed all threads are active instead of a case in which a single thread handles all queues
I would remove --conn_io_threads unless hard data shows that it's preferable in your case. In previous benchmark I've made, it's not recommended in the general case. By using this flag you are in fact limiting the data-handling threads to 4-2 = 2 threads
That's good. I was trying to eliminate a case which all used a single {hashtag}. That's not the case for you.

Some follow up questions:
4. I see that you use c7g.12xlarge which is quite beefy with 48 vCPUs, but on the other hand you're running Dragonfly with only 4 threads. Is that because you attempt to run other programs on this server? Generally Dragonfly is designed to run as a standalone server, and I wonder if in your case other loads degrade Dragondly's?
5. Related to above, you could try setting --proactor_affinity_mode=off to un-pin Dragonfly from specific CPUs. This is usually not a problem, unless there are, again, other loads on the same (virtual) machine.
6. Could you try running your tests against a standalone Dragonfly running in a c7g.xlarge machine (which has 4 vCPUs), and see if you get different numbers?

Thanks

wernermorgenstern · 2024-09-04T11:57:21Z

Not sure why it woukd be only running on 4 threads. When tne container starts, the logs show it is running on 47 io threads. Or am I missing something? How can I tell dragonfly to run on x threads, and not letting it decide itself Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Shahar Mike ***@***.***> Sent: Wednesday, September 4, 2024 12:22:57 AM To: dragonflydb/dragonfly ***@***.***> Cc: Werner Morgenstern ***@***.***>; Mention ***@***.***> Subject: Re: [dragonflydb/dragonfly] BullMQ Pro with DragonFly - warnings/errors/slow processing (Issue #3640) @wernermorgenstern<https://github.com/wernermorgenstern> 1. I meant via some monitoring tools like htop, I want to see that indeed all threads are active instead of a case in which a single thread handles all queues 2. I would remove --conn_io_threads unless hard data shows that it's preferable in your case. In previous benchmark I've made, it's not recommended in the general case. By using this flag you are in fact limiting the data-handling threads to 4-2 = 2 threads 3. That's good. I was trying to eliminate a case which all used a single {hashtag}. That's not the case for you. Some follow up questions: 4. I see that you use c7g.12xlarge which is quite beefy with 48 vCPUs, but on the other hand you're running Dragonfly with only 4 threads. Is that because you attempt to run other programs on this server? Generally Dragonfly is designed to run as a standalone server, and I wonder if in your case other loads degrade Dragondly's? 5. Related to above, you could try setting --proactor_affinity_mode=off to un-pin Dragonfly from specific CPUs. This is usually not a problem, unless there are, again, other loads on the same (virtual) machine. 6. Could you try running your tests against a standalone Dragonfly running in a c7g.xlarge machine (which has 4 vCPUs), and see if you get different numbers? Thanks — Reply to this email directly, view it on GitHub<#3640 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJPN7CZQSOHFYHG3EKARNTLZU2DKDAVCNFSM6AAAAABNSP5G3OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXHA4DINRWGY>. You are receiving this because you were mentioned.Message ID: ***@***.***>

chakaz · 2024-09-04T12:06:29Z

@wernermorgenstern you passed the following flags:

--proactor_threads=4
--conn_io_threads=2

The first means Dragonfly will only ever use 4 threads (except for an idle main() thread which is not important). Out of these 4 threads, --conn_io_threads=2 means 2 threads will only do I/O (handle connections), which leaves 2 threads for data handling.

I'd start by not passing any of these flags and measure your performance..

wernermorgenstern · 2024-09-04T14:27:06Z

So I made the recommended changes:

Here is the log when it starts up:

│ I20240904 14:26:11.813076     1 dfly_main.cc:640] Starting dragonfly df-v1.22.0-67117ff081b60f28a49c4c5db383a62953f71f3e                                                                                                           │
│ I20240904 14:26:11.813205     1 dfly_main.cc:684] maxmemory has not been specified. Deciding myself....                                                                                                                            │
│ I20240904 14:26:11.813216     1 dfly_main.cc:693] Found 89.00GiB available memory. Setting maxmemory to 71.20GiB                                                                                                                   │
│ I20240904 14:26:11.813663    12 uring_proactor.cc:180] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048                                                                                                        │
│ I20240904 14:26:11.927672     1 proactor_pool.cc:147] Running 47 io threads                                                                                                                                                        │
│ I20240904 14:26:11.932770     1 server_family.cc:783] Host OS: Linux 5.10.220-209.869.amzn2.aarch64 aarch64 with 47 threads                                                                                                        │
│ I20240904 14:26:11.933192     1 snapshot_storage.cc:112] Load snapshot: Searching for snapshot in directory: "/data"                                                                                                               │
│ W20240904 14:26:11.933254     1 server_family.cc:888] Load snapshot: No snapshot found                                                                                                                                             │
│ I20240904 14:26:11.937389    13 listener_interface.cc:101] sock[97] AcceptServer - listening on port 6379

wernermorgenstern · 2024-09-04T14:27:39Z

It shows Running 47 io threads

What does this log line mean:
uring_proactor.cc:180] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048

wernermorgenstern · 2024-09-04T16:20:08Z

@chakaz ,
so I am doing another load test.
We have 30 queues in the same BullMQ Pro DragonFly Instance.
Queue Names are: {bridge-processor-1}, and {bridge-processor-2}, and so on.

Right now, in DragonFly, we have 47 threads.
However, when I run htop in the DragonFly Container, I see only about 15 threads. where CPU is above 10%. The rest are idle, and run at 2%

One thread though runs at 93%, and a few run at 20% up to 30%.

So shouldn't the thread allocation be based on the Key, between the Hashes?
For example, if I have 30 queues (unique based on the Key Name), and 30 threads, shouldn't each queue run in its own thread?
It looks like a DragonFly thread handles more than one queue, instead of one only.

chakaz · 2024-09-04T18:54:05Z

@wernermorgenstern
Keys (and {hashtags} when using --lock_on_hashtags) are assigned to specific threads based on their hash. This is part of our shared-nothing architecture.
So if you use 30 queues (which are 30 keys with regards to thread ownership), you will likely have 2 or more accidentally on the same thread. This is likely what you see on your benchmark.

If this is a pain point, you could try to use --shard_round_robin_prefix=bridge-processor, which will assign keys starting with bridge-processor in a round-robin fashion instead of based on their hashes. This is not without downsides, though, so it's not a general advice. I wrote about this topic here:
https://www.dragonflydb.io/blog/running-bullmq-with-dragonfly-part-2-optimization
(search for subtitle "Round Robin Key Placement")

A few notes / questions:

If all this Dragonfly server does is handle these 30 queues (no other loads), then I don't see a reason to run on a machine with more than 30 CPUs. There might not be a machine with exactly 30 CPUs, but I'm just pointing out that some CPUs will be (close to) idle, except for connection I/O handling
What read/write QPS do you see when you do your benchmarks? If you'll look at the blog post above, you'll see that on c7i.2xlarge (much smaller than what you're using) we reached a peak of 250k reads/s. This is of course when only reading, so not production conditions, but still I'd expect you to experience higher than 90k on such a strong machine. If you'll have the load spread across all threads close to equally, I'd recommend looking at the client code (i.e. the code that uses BullMQ), perhaps the bottleneck is there? Try to add more connections or writers/processors maybe.

wernermorgenstern · 2024-09-04T19:15:52Z

@chakaz , I will look into that article.
A few questions:

For the benchmark mentioned in the post, reaching 250k reads/s, is that with BullMQ? I am a bit unclear on that. Also, what about writes? So adding jobs to the queue, and then consuming and processing them. Does the benchmark take that into account?
For the Round-Robin Placement, so if our queue names contain {} around the queue name, and they have a number (like I mentioned before), what should --shard_round_robin_prefix=xx be? (the xx)? Should it be --shard_round_robin_prefix=bridge-processor? Should I then still enable --lock_on_hashtags?

A note: On this DragonFly Instance, it only contains the queue. No other data is stored, or processed

chakaz · 2024-09-04T19:25:08Z

For the benchmark mentioned in the post, reaching 250k reads/s, is that with BullMQ? I am a bit unclear on that.

Yes, we benchmarked BullMQ with Dragonfly.

Also, what about writes? So adding jobs to the queue, and then consuming and processing them. Does the benchmark take that into account?

We mention this in the beginning of the post, but I now realize I was mistaken in my previous comment here.
In the benchmark we only test adding jobs to the queue. This is to reduce noise that might happen from attempting to read from an empty queue. It's a benchmark that's designed to be as "clean" as possible, not to replicate production conditions. Again, it's all in the post. Sorry.

For the Round-Robin Placement, so if our queue names contain {} around the queue name, and they have a number (like I mentioned before), what should --shard_round_robin_prefix=xx be? (the xx)? Should it be --shard_round_robin_prefix=bridge-processor?

Yes

Should I then still enable --lock_on_hashtags?

Yes

wernermorgenstern · 2024-09-04T19:32:37Z

@chakaz , ok, great. Thank you very much for your help.

One more question (documentation is a bit unclear)
for --shard_round_robin_prefix=bridge-processor, can that contain only one Prefix?
or can it contain multiple prefixes (like if we have multiple high-performance queues running on the same DragonFly Instance?

chakaz · 2024-09-04T19:50:05Z

I'm afraid that, at least currently, one can specify only one prefix. Perhaps you could rename the queues such that high performance queues have their own prefix, different from other queues.

chakaz · 2024-09-24T14:25:50Z

I'll close this issue now, please let me know if there's anything else.

wernermorgenstern added the bug Something isn't working label Sep 3, 2024

chakaz closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BullMQ Pro with DragonFly - warnings/errors/slow processing #3640

BullMQ Pro with DragonFly - warnings/errors/slow processing #3640

wernermorgenstern commented Sep 3, 2024

chakaz commented Sep 3, 2024

wernermorgenstern commented Sep 3, 2024

chakaz commented Sep 4, 2024

wernermorgenstern commented Sep 4, 2024 via email

chakaz commented Sep 4, 2024

wernermorgenstern commented Sep 4, 2024

wernermorgenstern commented Sep 4, 2024

wernermorgenstern commented Sep 4, 2024

chakaz commented Sep 4, 2024

wernermorgenstern commented Sep 4, 2024

chakaz commented Sep 4, 2024

wernermorgenstern commented Sep 4, 2024

chakaz commented Sep 4, 2024

chakaz commented Sep 24, 2024

BullMQ Pro with DragonFly - warnings/errors/slow processing #3640

BullMQ Pro with DragonFly - warnings/errors/slow processing #3640

Comments

wernermorgenstern commented Sep 3, 2024

chakaz commented Sep 3, 2024

wernermorgenstern commented Sep 3, 2024

chakaz commented Sep 4, 2024

wernermorgenstern commented Sep 4, 2024 via email

chakaz commented Sep 4, 2024

wernermorgenstern commented Sep 4, 2024

wernermorgenstern commented Sep 4, 2024

wernermorgenstern commented Sep 4, 2024

chakaz commented Sep 4, 2024

wernermorgenstern commented Sep 4, 2024

chakaz commented Sep 4, 2024

wernermorgenstern commented Sep 4, 2024

chakaz commented Sep 4, 2024

chakaz commented Sep 24, 2024