Fastest possible timeout for rd_kafka_flush() to cleanly shut down a Producer #4021

Quuxplusone · 2022-10-17T19:52:26Z

Quuxplusone
Oct 17, 2022

Under "Proper termination sequence for Producers", the documentation says:

The proper termination sequence for Producers is:

/* 1) Make sure all outstanding requests are transmitted and handled. */
rd_kafka_flush(rk, 60*1000); /* One minute timeout */

/* 2) Destroy the topic and handle objects */
rd_kafka_topic_destroy(rkt);  /* Repeat for all topic objects held */
rd_kafka_destroy(rk);

My question is, where does that 60*1000 one-minute timeout come from? What goes wrong if I lower that timeout to "one second" instead? Is there a concrete cutoff where the behavior suddenly goes wrong?

Worse, is the behavior dependent on any config options? For example, suppose I have set my message.timeout.ms to 700ms; can I get away with a lower rd_kafka_flush timeout then? Or suppose I have set my message.timeout.ms to 300*1000 ms; must I increase the rd_kafka_flush timeout correspondingly?

What I would really like is a 100% foolproof way to shut down the producer, instantly timing out any undelivered messages. There is a recommended approach in the documentation here:

   char tmp[16];
   snprintf(tmp, sizeof(tmp), "%i", SIGIO);  /* Or whatever signal you decide */
   rd_kafka_conf_set(rk_conf, "internal.termination.signal", tmp, errstr, sizeof(errstr));

but I would prefer a synchronous way, such as just saying rd_kafka_flush(rk, 0);. My understanding from the current documentation is that if I just say

rd_kafka_flush(rk, 1);  // one millisecond timeout
rd_kafka_topic_destroy(rkt);
rd_kafka_destroy(rk);

then I will definitely see hangs and/or undefined behavior. Is my understanding accurate?

Update 1

I guess I also don't understand the use of rd_kafka_flush there. According to the header file, rd_kafka_flush(rk, timeout) might return either NO_ERROR or TIMED_OUT, and if it returns TIMED_OUT then it seems like there would still be things in outq that haven't been flushed yet, right? So I don't understand (A) how the example code gets away with not checking the return value of rd_kafka_flush; (B) how rd_kafka_flush is different from rd_kafka_poll; nor (C) whether it actually suffices to call rd_kafka_flush as shown in the example, or if I actually need to call it in a loop until rd_kafka_outq_len reaches zero (which I don't want to do because that might take up to message.timeout.ms milliseconds, which is 5 minutes).

Update 2

Looking at the source code, I see this in rd_kafka_flush():

        /* Wake up all broker threads to trigger the produce_serve() call.
         * If this flush() call finishes before the broker wakes up
         * then no flushing will be performed by that broker thread. */
        rd_kafka_all_brokers_wakeup(rk, RD_KAFKA_BROKER_STATE_UP, "flushing");
        [...]
        return msg_cnt > 0 ? RD_KAFKA_RESP_ERR__TIMED_OUT
                           : RD_KAFKA_RESP_ERR_NO_ERROR;

This strongly indicates to me that (1) no matter what N you put into rd_kafka_flush(rk, N), unless all your brokers happen to wake up within Nms, you will still leak messages such that it is still unsafe to call rd_kafka_destroy(rk) (i.e., technically speaking, rd_kafka_flush is non-blocking and thus "unsafe at any speed"?); and (2) a correct shutdown procedure always involves calling rd_kafka_flush in a loop like this:

    /* 1) Make sure all outstanding requests are transmitted and handled. */
    while (rd_kafka_flush(rk, 100) != RD_KAFKA_RESP_ERR_NO_ERROR) {
        /* In-flight messages remain; continue waiting at 100ms intervals until they've all been delivered or failed.
         * Expect this to take up to message.timeout.ms milliseconds in the worst case. */
    }
    /* 2) Destroy the topic and handle objects */
    rd_kafka_topic_destroy(rkt);  /* Repeat for all topic objects held */
    rd_kafka_destroy(rk);

Does that sound right? I mean, I hope I'm wrong, because I really do want to find a 100% foolproof way of shutting down a Producer within a bounded amount of time — preferably a small amount of time, like less than 1000ms — and it's looking more and more like the answer is "you can't, it always takes at least message.timeout.ms and potentially unboundedly longer."

Quuxplusone · 2022-10-18T21:49:11Z

Quuxplusone
Oct 18, 2022
Author

I ended up doing this:

    /* 1) Make sure all outstanding requests are transmitted (or failed) and handled. */
    rd_kafka_purge(rk, RD_KAFKA_PURGE_F_QUEUE | RD_KAFKA_PURGE_F_INFLIGHT);
    while (rd_kafka_flush(rk, 100) != RD_KAFKA_RESP_ERR_NO_ERROR) {
        // Continue waiting until everything has been handled.
    }
    /* 2) Destroy the topic and handle objects */
    rd_kafka_topic_destroy(rkt);  /* Repeat for all topic objects held */
    rd_kafka_destroy(rk);

Now I'm moving on to figure out the same sequence for a consumer...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fastest possible timeout for rd_kafka_flush() to cleanly shut down a Producer #4021

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Fastest possible timeout for rd_kafka_flush() to cleanly shut down a Producer #4021

Quuxplusone Oct 17, 2022

Update 1

Update 2

Replies: 1 comment

Quuxplusone Oct 18, 2022 Author

Quuxplusone
Oct 17, 2022

Quuxplusone
Oct 18, 2022
Author