Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BullMQ Pro with DragonFly - warnings/errors/slow processing #3640

Closed
wernermorgenstern opened this issue Sep 3, 2024 · 14 comments
Closed

BullMQ Pro with DragonFly - warnings/errors/slow processing #3640

wernermorgenstern opened this issue Sep 3, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@wernermorgenstern
Copy link

Describe the bug
We are running DragonFly in EKS, for BullMQ Queue.
We want to process currently 90,000 messages in between 1 and 10 seconds. (and in future, even more messages than 90,000 messages).

In DragonFly container, we see these errors:

W20240903 16:00:11.101274    11 transaction.cc:63] TxQueue is too long. Tx count:97, armed:95, runnable:0, total locks: 4, contended locks: 4
max contention score: 360192, lock: 17340088452209433226, poll_executions:297968 continuation_tx: BZPOPMIN@1778056/1 {id=8646}            

And then also:

E20240903 15:14:55.102573     8 scheduler.cc:336] Failed to pull active fiber from remote_ready_queue, iteration 0 remote_empty: 1, current_epoch: 130453822, push_epoch: 130453821, next:0
E20240903 15:45:04.683014     8 scheduler.cc:336] Failed to pull active fiber from remote_ready_queue, iteration 0 remote_empty: 1, current_epoch: 186866416, push_epoch: 186866415, next:0

In our application, we have 47 Queues, and then in the Consumer Service, we have 30 Workers per Queue for BullMQ Pro, concurrency is 1000, Batch Size is 10.

So I am not sure if the issue is the BullMQ Lua Scripting, which gets behind, because of the locks. Or is there anything else we can do?

Expected behavior
Be able to process 90,000 messages in less than 10 seconds.

Environment (please complete the following information):

  • Kernel: # Linux dragonfly-bullmq-bridge-processor-0 5.10.220-209.869.amzn2.aarch64 #1 SMP Wed Jul 17 15:10:20 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
  • Containerized?: Kubernetes
  • Dragonfly Version: 1.21.4

Resources

│     Limits:                                                                                                                                                                                                                        │
│       cpu:     47                                                                                                                                                                                                                  │
│       memory:  89Gi                                                                                                                                                                                                                │
│     Requests:                                                                                                                                                                                                                      │
│       cpu:      47                                                                                                                                                                                                                 │
│       memory:   89Gi      

Configuration

        - --alsologtostderr
        - --lock_on_hashtags
        - --cluster_mode=emulated
        - --maxclients=1000000
        - --maxmemory=0
        - --dbnum=1
        - --max_client_iobuf_len=1000000000
        - --max_multi_bulk_len=1000000000
        - --shard_repl_backlog_len=1024000000
        - --replication_stream_timeout=5000
        - --proactor_threads=4
        - --conn_io_threads=2
        - --lua_auto_async=true
        ```

AWS Instance Type: `c7g.12xlarge`
@wernermorgenstern wernermorgenstern added the bug Something isn't working label Sep 3, 2024
@chakaz
Copy link
Collaborator

chakaz commented Sep 3, 2024

Hi @wernermorgenstern
A few questions to help us help you:

  1. Can you tell how many of the Dragonfly threads are active?
  2. Why did you set --conn_io_threads=2?
  3. What queue names are you using? Could it be that you only use a single {hashtag} common to all (many) queues?

@wernermorgenstern
Copy link
Author

@chakaz ,

  1. How can I tell how many threads are active? Is there a command, which I can run, when I connect to the container in EKS?
  2. I was trying to play around with that. I can remove it, if that is the issue. Even before I set that though, we still had the same errors.
  3. Queue names are: {bridge-processor-1}, {bridge-processor-2}, and so on.

@chakaz
Copy link
Collaborator

chakaz commented Sep 4, 2024

@wernermorgenstern

  1. I meant via some monitoring tools like htop, I want to see that indeed all threads are active instead of a case in which a single thread handles all queues
  2. I would remove --conn_io_threads unless hard data shows that it's preferable in your case. In previous benchmark I've made, it's not recommended in the general case. By using this flag you are in fact limiting the data-handling threads to 4-2 = 2 threads
  3. That's good. I was trying to eliminate a case which all used a single {hashtag}. That's not the case for you.

Some follow up questions:
4. I see that you use c7g.12xlarge which is quite beefy with 48 vCPUs, but on the other hand you're running Dragonfly with only 4 threads. Is that because you attempt to run other programs on this server? Generally Dragonfly is designed to run as a standalone server, and I wonder if in your case other loads degrade Dragondly's?
5. Related to above, you could try setting --proactor_affinity_mode=off to un-pin Dragonfly from specific CPUs. This is usually not a problem, unless there are, again, other loads on the same (virtual) machine.
6. Could you try running your tests against a standalone Dragonfly running in a c7g.xlarge machine (which has 4 vCPUs), and see if you get different numbers?

Thanks

@wernermorgenstern
Copy link
Author

wernermorgenstern commented Sep 4, 2024 via email

@chakaz
Copy link
Collaborator

chakaz commented Sep 4, 2024

@wernermorgenstern you passed the following flags:

--proactor_threads=4
--conn_io_threads=2

The first means Dragonfly will only ever use 4 threads (except for an idle main() thread which is not important). Out of these 4 threads, --conn_io_threads=2 means 2 threads will only do I/O (handle connections), which leaves 2 threads for data handling.

I'd start by not passing any of these flags and measure your performance..

@wernermorgenstern
Copy link
Author

So I made the recommended changes:

Here is the log when it starts up:

│ I20240904 14:26:11.813076     1 dfly_main.cc:640] Starting dragonfly df-v1.22.0-67117ff081b60f28a49c4c5db383a62953f71f3e                                                                                                           │
│ I20240904 14:26:11.813205     1 dfly_main.cc:684] maxmemory has not been specified. Deciding myself....                                                                                                                            │
│ I20240904 14:26:11.813216     1 dfly_main.cc:693] Found 89.00GiB available memory. Setting maxmemory to 71.20GiB                                                                                                                   │
│ I20240904 14:26:11.813663    12 uring_proactor.cc:180] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048                                                                                                        │
│ I20240904 14:26:11.927672     1 proactor_pool.cc:147] Running 47 io threads                                                                                                                                                        │
│ I20240904 14:26:11.932770     1 server_family.cc:783] Host OS: Linux 5.10.220-209.869.amzn2.aarch64 aarch64 with 47 threads                                                                                                        │
│ I20240904 14:26:11.933192     1 snapshot_storage.cc:112] Load snapshot: Searching for snapshot in directory: "/data"                                                                                                               │
│ W20240904 14:26:11.933254     1 server_family.cc:888] Load snapshot: No snapshot found                                                                                                                                             │
│ I20240904 14:26:11.937389    13 listener_interface.cc:101] sock[97] AcceptServer - listening on port 6379   

@wernermorgenstern
Copy link
Author

It shows Running 47 io threads

What does this log line mean:
uring_proactor.cc:180] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048

@wernermorgenstern
Copy link
Author

@chakaz ,
so I am doing another load test.
We have 30 queues in the same BullMQ Pro DragonFly Instance.
Queue Names are: {bridge-processor-1}, and {bridge-processor-2}, and so on.

Right now, in DragonFly, we have 47 threads.
However, when I run htop in the DragonFly Container, I see only about 15 threads. where CPU is above 10%. The rest are idle, and run at 2%

One thread though runs at 93%, and a few run at 20% up to 30%.

So shouldn't the thread allocation be based on the Key, between the Hashes?
For example, if I have 30 queues (unique based on the Key Name), and 30 threads, shouldn't each queue run in its own thread?
It looks like a DragonFly thread handles more than one queue, instead of one only.

@chakaz
Copy link
Collaborator

chakaz commented Sep 4, 2024

@wernermorgenstern
Keys (and {hashtags} when using --lock_on_hashtags) are assigned to specific threads based on their hash. This is part of our shared-nothing architecture.
So if you use 30 queues (which are 30 keys with regards to thread ownership), you will likely have 2 or more accidentally on the same thread. This is likely what you see on your benchmark.

If this is a pain point, you could try to use --shard_round_robin_prefix=bridge-processor, which will assign keys starting with bridge-processor in a round-robin fashion instead of based on their hashes. This is not without downsides, though, so it's not a general advice. I wrote about this topic here:
https://www.dragonflydb.io/blog/running-bullmq-with-dragonfly-part-2-optimization
(search for subtitle "Round Robin Key Placement")

A few notes / questions:

  1. If all this Dragonfly server does is handle these 30 queues (no other loads), then I don't see a reason to run on a machine with more than 30 CPUs. There might not be a machine with exactly 30 CPUs, but I'm just pointing out that some CPUs will be (close to) idle, except for connection I/O handling
  2. What read/write QPS do you see when you do your benchmarks? If you'll look at the blog post above, you'll see that on c7i.2xlarge (much smaller than what you're using) we reached a peak of 250k reads/s. This is of course when only reading, so not production conditions, but still I'd expect you to experience higher than 90k on such a strong machine. If you'll have the load spread across all threads close to equally, I'd recommend looking at the client code (i.e. the code that uses BullMQ), perhaps the bottleneck is there? Try to add more connections or writers/processors maybe.

@wernermorgenstern
Copy link
Author

@chakaz , I will look into that article.
A few questions:

  1. For the benchmark mentioned in the post, reaching 250k reads/s, is that with BullMQ? I am a bit unclear on that. Also, what about writes? So adding jobs to the queue, and then consuming and processing them. Does the benchmark take that into account?
  2. For the Round-Robin Placement, so if our queue names contain {} around the queue name, and they have a number (like I mentioned before), what should --shard_round_robin_prefix=xx be? (the xx)? Should it be --shard_round_robin_prefix=bridge-processor? Should I then still enable --lock_on_hashtags?

A note: On this DragonFly Instance, it only contains the queue. No other data is stored, or processed

@chakaz
Copy link
Collaborator

chakaz commented Sep 4, 2024

  1. For the benchmark mentioned in the post, reaching 250k reads/s, is that with BullMQ? I am a bit unclear on that.

Yes, we benchmarked BullMQ with Dragonfly.

Also, what about writes? So adding jobs to the queue, and then consuming and processing them. Does the benchmark take that into account?

We mention this in the beginning of the post, but I now realize I was mistaken in my previous comment here.
In the benchmark we only test adding jobs to the queue. This is to reduce noise that might happen from attempting to read from an empty queue. It's a benchmark that's designed to be as "clean" as possible, not to replicate production conditions. Again, it's all in the post. Sorry.

  1. For the Round-Robin Placement, so if our queue names contain {} around the queue name, and they have a number (like I mentioned before), what should --shard_round_robin_prefix=xx be? (the xx)? Should it be --shard_round_robin_prefix=bridge-processor?

Yes

Should I then still enable --lock_on_hashtags?

Yes

@wernermorgenstern
Copy link
Author

@chakaz , ok, great. Thank you very much for your help.

One more question (documentation is a bit unclear)
for --shard_round_robin_prefix=bridge-processor, can that contain only one Prefix?
or can it contain multiple prefixes (like if we have multiple high-performance queues running on the same DragonFly Instance?

@chakaz
Copy link
Collaborator

chakaz commented Sep 4, 2024

I'm afraid that, at least currently, one can specify only one prefix. Perhaps you could rename the queues such that high performance queues have their own prefix, different from other queues.

@chakaz
Copy link
Collaborator

chakaz commented Sep 24, 2024

I'll close this issue now, please let me know if there's anything else.

@chakaz chakaz closed this as completed Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants