-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iperf3 single-stream low bandwidth with small message sizes (1KB, 1500B, 2000B, etc.) #1078
Comments
@noamsto, please re-run the test with the following options and let us know the results:
|
Hi @davidBar-On,
|
Hi @noamsto, this is interesting. It is expected that throughput will decrease when message size is decreased. Therefore, it is not clear what iperf2 is doing to keep the same throughput with small messages. Except for overhead in iperf3 internal processing of the TCP messages, there are two directions to investigate:
|
One more thing to try (in addition to the window-size and Wireshark above), is iperf3 burst. See issue #899. Sending packets in burst have less ipref3 internal overhead, so this may help to understand if the difference between iperf2 and iperf3 throughput is related to internal processing. Can you try running the client with |
Hi @davidBar-On,
Another indication that Iperf3 might have an issue here is that netperf reports a much higher BW for the same case as well: |
@noamsto, thanks for the input. As none of the options I suggested help, maybe the issue is related to CPU usage by iperf3. Can you run both client and server with It would also help if you will use latest iperf3 version (3.9). Version 3.5 is from the beginning of 2018, and it would be difficult to evaluate the issue using tests inputs from a relatively old version. |
Hi, @davidBar-On sorry for the long delay. I've tested with version 3.9, still similar behavior, 128k -> ~20Gbps 1500B -> ~2Gbps
Seems like the CPU is not working harder with 1500B (as we expect).
Here I would expect smaller message sizes -> more Interrupts. Maybe Iperf3 is not generating enough work for the CPUs when the message size is small? |
I agree that somehow this is the case. The following two tests may help to get better insight about the issue:
|
Hi, I recently hit this "issue", and had the chance of doing some debugging on the iperf3 implementation. $ iperf3 -v
iperf 3.11 (cJSON 1.7.13)
Linux wonderland.rsevilla.org 6.2.10-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 6 23:30:41 UTC 2023 x86_64
Optional features available: CPU affinity setting, IPv6 flow label, SCTP, TCP congestion algorithm setting, sendfile / zerocopy, socket pacing, authentication, bind to device, support IPv4 don't fragment The main problem with iperf3 in small packet size scenarios is that the iperf3's server implementation performs too many Lines 530 to 534 in 10b1797
These syscall's don't come for free and they have a CPU impact on the process. On the other hand, the client is not that affected by this behavior since Lines 1882 to 1888 in 10b1797
Line 3150 in 10b1797
The above means that the client side will perform a ratio of 1:10 select per write/read unlike the server side where the ratio is 1:1. Running a stupid test with we can probe this behavior. $ iperf3 -l 64B localhost -t 5s -c localhost
Connecting to host localhost, port 5201
[ 5] local ::1 port 44422 connected to ::1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 46.7 MBytes 392 Mbits/sec 0 320 KBytes
[ 5] 1.00-2.00 sec 42.7 MBytes 359 Mbits/sec 0 320 KBytes
[ 5] 2.00-3.00 sec 41.8 MBytes 350 Mbits/sec 0 320 KBytes
[ 5] 3.00-4.00 sec 41.1 MBytes 345 Mbits/sec 0 320 KBytes
[ 5] 4.00-5.00 sec 40.3 MBytes 338 Mbits/sec 0 320 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-5.00 sec 213 MBytes 357 Mbits/sec 0 sender
[ 5] 0.00-5.00 sec 211 MBytes 354 Mbits/sec receiver
iperf Done. And tracing server side $ sudo /usr/share/bcc/tools/syscount -L -p 1640422
Tracing syscalls, printing top 10... Ctrl+C to quit.
[12:17:37]
SYSCALL COUNT TIME (us)
read 3458349 1864759.911
pselect6 3459046 1759740.940
write 23 336.073
accept 2 63.444 As shown above the number of read sycalls is very close to the read ones, and they are adding a latency of ~1.756s to this 5s test. Reducing the number of select syscalls the server side performs should be the way to go to optimize the performance of scenario. |
As a side note, it's possible to improve the server's performance by configuring the select's timeout argument to Default values: $ taskset -c 1 iperf3 -l 64B localhost -t 30s -c localhost
Connecting to host localhost, port 5201
[ 5] local ::1 port 33346 connected to ::1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 60.8 MBytes 510 Mbits/sec 0 320 KBytes
[ 5] 1.00-2.00 sec 55.5 MBytes 466 Mbits/sec 0 320 KBytes
[ 5] 2.00-3.00 sec 54.7 MBytes 459 Mbits/sec 0 320 KBytes
[ 5] 3.00-4.00 sec 53.4 MBytes 448 Mbits/sec 0 320 KBytes
[ 5] 4.00-5.00 sec 52.3 MBytes 439 Mbits/sec 0 320 KBytes
[ 5] 5.00-6.00 sec 51.9 MBytes 435 Mbits/sec 0 320 KBytes
[ 5] 6.00-7.00 sec 51.7 MBytes 433 Mbits/sec 0 320 KBytes
[ 5] 7.00-8.00 sec 50.5 MBytes 424 Mbits/sec 0 320 KBytes
[ 5] 8.00-9.00 sec 51.5 MBytes 432 Mbits/sec 0 320 KBytes
[ 5] 9.00-10.00 sec 51.1 MBytes 429 Mbits/sec 0 320 KBytes
[ 5] 10.00-11.00 sec 51.0 MBytes 428 Mbits/sec 0 320 KBytes
[ 5] 11.00-12.00 sec 48.7 MBytes 408 Mbits/sec 0 320 KBytes
[ 5] 12.00-13.00 sec 46.7 MBytes 392 Mbits/sec 0 320 KBytes
[ 5] 13.00-14.00 sec 48.6 MBytes 408 Mbits/sec 0 320 KBytes
[ 5] 14.00-15.00 sec 49.0 MBytes 411 Mbits/sec 0 320 KBytes
[ 5] 15.00-16.00 sec 48.0 MBytes 403 Mbits/sec 1 320 KBytes
[ 5] 16.00-17.00 sec 49.6 MBytes 416 Mbits/sec 0 320 KBytes
[ 5] 17.00-18.00 sec 46.7 MBytes 392 Mbits/sec 0 320 KBytes
[ 5] 18.00-19.00 sec 49.2 MBytes 413 Mbits/sec 0 320 KBytes
[ 5] 19.00-20.00 sec 48.9 MBytes 410 Mbits/sec 0 320 KBytes
[ 5] 20.00-21.00 sec 48.2 MBytes 404 Mbits/sec 0 320 KBytes
[ 5] 21.00-22.00 sec 46.8 MBytes 393 Mbits/sec 0 320 KBytes
[ 5] 22.00-23.00 sec 46.6 MBytes 391 Mbits/sec 0 320 KBytes
[ 5] 23.00-24.00 sec 47.9 MBytes 402 Mbits/sec 0 320 KBytes
[ 5] 24.00-25.00 sec 48.6 MBytes 407 Mbits/sec 0 320 KBytes
[ 5] 25.00-26.00 sec 47.0 MBytes 394 Mbits/sec 0 320 KBytes
[ 5] 26.00-27.00 sec 47.3 MBytes 397 Mbits/sec 0 320 KBytes
[ 5] 27.00-28.00 sec 47.2 MBytes 396 Mbits/sec 0 320 KBytes
[ 5] 28.00-29.00 sec 44.6 MBytes 374 Mbits/sec 0 320 KBytes
[ 5] 29.00-30.00 sec 46.4 MBytes 389 Mbits/sec 0 320 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 1.46 GBytes 417 Mbits/sec 1 sender
[ 5] 0.00-30.00 sec 1.45 GBytes 416 Mbits/sec receiver With this small patch: $ git diff
diff --git a/src/iperf_server_api.c b/src/iperf_server_api.c
index 18f105d..3c7f637 100644
--- a/src/iperf_server_api.c
+++ b/src/iperf_server_api.c
@@ -516,8 +516,8 @@ iperf_run_server(struct iperf_test *test)
} else if (test->mode != SENDER) { // In non-reverse active mode server ensures data is received
timeout_us = -1;
if (timeout != NULL) {
- used_timeout.tv_sec = timeout->tv_sec;
- used_timeout.tv_usec = timeout->tv_usec;
+ used_timeout.tv_sec = 0;
+ used_timeout.tv_usec = 0;
timeout_us = (timeout->tv_sec * SEC_TO_US) + timeout->tv_usec;
}
if (timeout_us < 0 || timeout_us > rcv_timeout_us) { Client-side taskset -c 1 iperf3 -l 64B localhost -t 30s -c localhost
Connecting to host localhost, port 5201 [ 5] local ::1 port 33844 connected to ::1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 77.3 MBytes 649 Mbits/sec 0 320 KBytes
[ 5] 1.00-2.00 sec 74.1 MBytes 621 Mbits/sec 0 320 KBytes
[ 5] 2.00-3.00 sec 67.6 MBytes 567 Mbits/sec 0 320 KBytes
[ 5] 3.00-4.00 sec 69.5 MBytes 583 Mbits/sec 0 320 KBytes
[ 5] 4.00-5.00 sec 68.8 MBytes 577 Mbits/sec 0 320 KBytes
[ 5] 5.00-6.00 sec 67.2 MBytes 564 Mbits/sec 0 320 KBytes
[ 5] 6.00-7.00 sec 66.8 MBytes 561 Mbits/sec 0 320 KBytes
[ 5] 7.00-8.00 sec 62.9 MBytes 528 Mbits/sec 0 320 KBytes
[ 5] 8.00-9.00 sec 64.4 MBytes 540 Mbits/sec 0 320 KBytes
[ 5] 9.00-10.00 sec 65.1 MBytes 546 Mbits/sec 0 320 KBytes
[ 5] 10.00-11.00 sec 63.4 MBytes 532 Mbits/sec 0 320 KBytes
[ 5] 11.00-12.00 sec 64.6 MBytes 542 Mbits/sec 0 320 KBytes
[ 5] 12.00-13.00 sec 64.1 MBytes 537 Mbits/sec 0 320 KBytes
[ 5] 13.00-14.00 sec 64.2 MBytes 538 Mbits/sec 0 320 KBytes
[ 5] 14.00-15.00 sec 64.0 MBytes 537 Mbits/sec 0 320 KBytes
[ 5] 15.00-16.00 sec 62.1 MBytes 521 Mbits/sec 0 320 KBytes
[ 5] 16.00-17.00 sec 60.4 MBytes 507 Mbits/sec 0 320 KBytes
[ 5] 17.00-18.00 sec 62.2 MBytes 522 Mbits/sec 0 320 KBytes
[ 5] 18.00-19.00 sec 62.4 MBytes 523 Mbits/sec 0 320 KBytes
[ 5] 19.00-20.00 sec 61.5 MBytes 516 Mbits/sec 0 320 KBytes
[ 5] 20.00-21.00 sec 60.9 MBytes 511 Mbits/sec 0 320 KBytes
[ 5] 21.00-22.00 sec 62.7 MBytes 526 Mbits/sec 0 320 KBytes
[ 5] 22.00-23.00 sec 61.7 MBytes 517 Mbits/sec 0 320 KBytes
[ 5] 23.00-24.00 sec 61.9 MBytes 519 Mbits/sec 0 320 KBytes
[ 5] 24.00-25.00 sec 62.4 MBytes 523 Mbits/sec 0 320 KBytes
[ 5] 25.00-26.00 sec 61.8 MBytes 519 Mbits/sec 0 320 KBytes
[ 5] 26.00-27.00 sec 56.1 MBytes 471 Mbits/sec 0 320 KBytes
[ 5] 27.00-28.00 sec 62.0 MBytes 520 Mbits/sec 0 320 KBytes
[ 5] 28.00-29.00 sec 61.8 MBytes 518 Mbits/sec 0 320 KBytes
[ 5] 29.00-30.00 sec 61.5 MBytes 516 Mbits/sec 0 320 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 1.88 GBytes 538 Mbits/sec 0 sender
[ 5] 0.00-30.00 sec 1.88 GBytes 538 Mbits/sec receiver I haven't analyzed the impact this change could have on other workload, so just keep it as an example.
|
Hi @rsevilla87, very good and useful analysis! I tried your suggested change on my computer, and indeed the throughput is increased dramatically (in my case from 70Mbps to 93Mbps). From what you found I think that a "receiving burst" option should be added to iperf3. I.e., that the If you will submit such PR, please note the following:
|
There's some interesting and worthy analysis going on here! I kind of wonder if the multi-threaded iperf3 (on the To wit: According to above, one of the leading factors limiting iperf3 performance is a large number of select(2) calls and their impact on the sending of test data. This comes directly from an early design decision to have iperf3 run as a single thread. Because of this, the iperf3 process can't block in send() or recv() type system calls, because there are multiple sockets that need servicing, as well as various timers. This basically forces the use of select(2) with some timeout values. The multi-threaded iperf3 assigns a different thread to every I/O stream. Because every stream/connection has its own dedicated thread, that thread can be allowed to block and we no longer need to do select(2) calls inside the threads doing I/O. We only use select(2) in the main thread, which manages the control connection and reporting. Note that in general, small messages will still be less efficient than larger ones. That's generally true for almost all I/O. In fact, there are iperf3 use cases that rely on this behavior to simulate different applications' performance. |
@davidBar-On @bmah888 Thanks for your thoughts!, to give you some context,the root of this issue is that I have been trying to characterize network throughput/latency performance in different scenarios by comparing the results from different perf tools like netperf, uperf or iperf3. Keep in mind that the uperf test was also single-threaded I've taken a look at the source code of these tools to find the main differences on the receiver side and these tools are not using select to poll the socket fd. I wonder why iperf3 uses it?, I think the server side could avoid of such amount of select syscalls as read() is an already blocking operation that waits for the socket data to be available |
I believe I found the root cause for the iperf3 low performance with small messages sizes. While iperf3 use the same send and receive messages sizes, iperf2 has different messages length for the client and the server. That is, although the iperf2 client sends 1500 bytes messages, the server receives 128KB (the default size) messages. I believe netperf behavior is similar, based on the 13K "Recv socket size bytes" and the 1500 "Send message size bytes" in its report titles. I tried a version of iperf3 that reads 10 times the message size, i.e. sending 1500 bytes messages and receiving 15,000 bytes messages. Throughput was improved by 35% for a single stream test and over 50% for multi-stream tests. Submitted PR #1691 with a suggested enhancement - TCP receive reads each time "burst * message length" bytes messages. |
Context
Version of iperf3: iperf 3.5
Hardware:
Nvidia ConnectX-6 Dx 100GbE
Intel XXV710 25GbE
Operating system (and distribution, if any):
Red Hat Enterprise Linux Server release 7.5 (Maipo) - kernel 5.10.0-rc4
CentOS Linux release 8.1.1911 (Core) - kernel 5.2.9
Bug Report
We are observing low performance with iperf3 while sending single-stream traffic with small message sizes.
Other benchmarks reports significant better results for same scenario, we assume it's Iperf3 that's limiting the performance we observe.
For example:
Running
iperf3 -c fd12:fd07::200:105 -t 10 -l1500B
we see ~7.9Gbps.Running the same test with 1 stream and 1500B with iperf2 we see 23Gbps which is what we would expect with current system.
cmd:
iperf -c fd12:fd07::200:105 -V -l1500 -P1
Iperrf3 should be able to achieve at least the same performance reported by iperf2 for a single-stream.
Iperf3 small message size performance is low. Iperf3 is the limiting factor.
Send traffic with message size 1500B (On a high performance NIC over 10/25Gbps).
I've noted iperf2 suffer a similiar issue when running with '-i' flag, might be related to the reporting/results gathering flow?
The text was updated successfully, but these errors were encountered: