Added mutex to MLT code to protect against a race condition in which a TD is sent to the DFO after triggers are disabled #249

bieryAtFnal · 2023-11-22T18:07:01Z

When running automated integration tests in fddaq-v4.2.0 and later systems, I occasionally see complaints from the DFO that it received a TriggerDecision message with the wrong run number. This always happens at the beginning of a 2nd or 3rd run in a single DAQ session.

I believe that the problem is that the MLT has sent a TD after triggers have been disabled. This should not happen, but I believe that a bug is allowing it to happen.

I believe that the sequence goes something like this:

run control sends a 'disable_triggers' command to the MLT, and the MLT replies that it has done so. However, there is a bug/race condition in the MLT code that allows it to send another TD even after is says that it has disabled triggers.
run control sends 'stop' commands to the MLT, DFO, and every other process and module in the system.
the stop command arrives at the DFO very quickly, before it has a chance to read in the new TD message from the MLT
when the next run is started, the DFO finds the stale TD in its connection to the MLT, and it issues the warning message

The main change to the code that I made is to protect the whole block of code in the call_tc_decisio() method that depends on the m_paused variable from changes to that variable from a different thread. With this change, the m_paused variable can probable be made a simple boolean variable instead of an atomic one, but I haven't made that change yet. (we can talk about that or leave it as it is)

There was also a change to the initialization of the m_bitword_check variable. I'm not sure why this isn't just a local variable in the call_tc_decision() method, but if it does need to be a class data member, then it should be initialized before the thread that makes use of it is started.

I've tried to confirm that these changes effectively prevent the complaints from the DFO about stale TD messaes, but the problem happens so rarely, I don't yet have statistics like "it happens once in every N test runs before the change, and it happens zero times in 2N runs after the change". But, I wanted to go ahead and file the PR so that others are aware of this problem and possible solution.

…he use of the m_paused flag (the symptom before the change was a message from the DFO complaining about a TD with the wrong number at the start of a second run in a single DAQ session).

jrklein · 2023-11-22T22:56:29Z

We should be a little carful here: we always will want a TD to get sent and excecuted if it has happened. Imagine, for example, that we had a SN burst just as someone decided to pause the triggers for some other reason. Is the problem here different, in that the TD has not yet ocurred and then triggers are paused, or is it that the TD is not finished when triggers are paused...? This is a bigger problem, perhaps, because the "race condition" may not just be an explicitly logical one, but a physics one: how long do we wait to ensure we haven't lost physics data that is already in the buffers? Sorry if I have misunderstood the exact problem.

bieryAtFnal · 2023-11-24T15:07:51Z

Hi Josh, Good points, and your email ties in nicely with what I’m about to say… My original theory about the cause of the problem was wrong. In further system testing that included my proposed fix, I still saw problems with a TriggerDecision from run N showing up at the DFO in run N+1. And, I think that the real problem is tied into some of the points that you made in your email. My new theory is that the MLT is knowingly/deliberately sending one or more TriggerDecisions at “stop” time. The evidence for this is shown in the TRACE messages that are included at the bottom of my reply. I’m guessing that an early version of the MLT code sent out all “pending” TDs at “stop” time exactly in the spirit of what you’ve described (leave no TD unsent). However, I suspect that when the “disable triggers” command was added to the overall system, the MLT logic was not fully updated to move that sending of pending triggers from “stop" time to “disable triggers” time. But, I believe that it should have been. Clearly, the MLT should not accept the “disable_triggers” command from run control, tell RC that it has successfully made that transition, and then later send out a TD. Maybe the short-term fix is to simply to move the “flush” call from do_stop() to do_pause(). That is something for you to decide. I’ll ask you, Michal, and other trigger folks to come up with the right plan forward. And, yes, there may be discussions, planning, etc. within the DataSelection group to decide how to appropriately wind down the various internal operations at ‘disable_triggers’ time. But, the bottom line is that the MLT should not send TDs to the DFO after it has made the 'disable_triggers’ transition. I’d appreciate it if a solution (intermediate or otherwise) could be implemented on a reasonably short time scale. Thanks, Kurt Here, in reverse time order, are TRACE messages that show the MLT pausing triggers (orange text); the DFO completing the “stop” transition (blue text); the MLT flushing pending TDs, sending a TD to the DFO, and completing the “stop” transition (green text), and the DFO complaining about a TD with the wrong run number at the start of the next run (red text). Recall that run control commands are broadcast to all DAQ applications without any deliberate ordering. So, it is not surprising that the DFO could complete “stop” before the MLT does. That is exactly why the “disable_triggers” command/transition was added (to ensure that the flow of TriggerDecisions are stopped before the DFO is told to stop the run). 13412 11-23 20:17:46.058799 672746 673582 78 DataFlowOrchestrator ERR receive_trigger_decision: DataFlowOrchestrator encountered run number mismatch: recvd (101) != 102 from MLT for trigger_number 462 13418 11-23 20:17:46.058168 672746 673061 120 DataFlowOrchestrator LOG add_callback_impl: Registering callback. 13419 11-23 20:17:46.058004 672746 673061 120 DataFlowOrchestrator LOG add_callback_impl: Registering callback. 13451 11-23 20:17:43.217897 672967 673077 32 ModuleLevelTrigger NFO do_stop: End of run 101 13453 11-23 20:17:43.217178 672967 673077 32 ModuleLevelTrigger LOG do_stop: LivetimeCounter - total deadtime+paused: 2396 13454 11-23 20:17:43.216914 672967 673335 44 ModuleLevelTrigger LOG send_trigger_decisions: Run 101: Received 494 TCs. Sent 462 TDs consisting of 462 TCs. 29 TDs (29 TCs) were created during pause, and 0 TDs (0 TCs) were inhibited. 0 TDs (0 TCs) were dropped. 13455 11-23 20:17:43.211346 672967 673077 32 ModuleLevelTrigger LOG call_tc_decision: Sending a decision with triggernumber 462 timestamp 79554163409484103 number of links 20 based on TC of type 5 13456 11-23 20:17:43.211339 672967 673077 32 ModuleLevelTrigger D03 call_tc_decision: Override?: 1 13457 11-23 20:17:43.211326 672967 673077 32 ModuleLevelTrigger D03 create_decision: HSI passthrough: 0, TC detid: 15, TC type: 5, TC cont number: 1, DECISION trigger type: 1, DECISION timestamp: 79554163409484103, request window begin: 79554163409484103, request window end: 79554163409484135 13458 11-23 20:17:43.211286 672967 673077 32 ModuleLevelTrigger D03 create_decision: earliest TC index: 0 13459 11-23 20:17:43.210549 672967 673077 32 ModuleLevelTrigger D03 flush_td_vectors: Flushing TDs. Size: 1 13516 11-23 20:17:42.295068 672967 673335 44 ModuleLevelTrigger D03 call_tc_decision: Override?: 0 13517 11-23 20:17:42.295065 672967 673335 44 ModuleLevelTrigger D03 create_decision: HSI passthrough: 0, TC detid: 15, TC type: 5, TC cont number: 1, DECISION trigger type: 1, DECISION timestamp: 79554163405562183, request window begin: 79554163405562183, request window end: 79554163405562215 13518 11-23 20:17:42.295064 672967 673335 44 ModuleLevelTrigger D03 create_decision: earliest TC index: 0 13519 11-23 20:17:42.254728 672967 673335 44 ModuleLevelTrigger D03 call_tc_decision: Override?: 0 13520 11-23 20:17:42.254723 672967 673335 44 ModuleLevelTrigger D03 create_decision: HSI passthrough: 0, TC detid: 15, TC type: 5, TC cont number: 1, DECISION trigger type: 1, DECISION timestamp: 79554163402746183, request window begin: 79554163402746183, request window end: 79554163402746215 13521 11-23 20:17:42.254705 672967 673335 44 ModuleLevelTrigger D03 create_decision: earliest TC index: 0 13537 11-23 20:17:42.204360 672967 673335 44 ModuleLevelTrigger D03 call_tc_decision: Override?: 0 13538 11-23 20:17:42.204357 672967 673335 44 ModuleLevelTrigger D03 create_decision: HSI passthrough: 0, TC detid: 15, TC type: 5, TC cont number: 1, DECISION trigger type: 1, DECISION timestamp: 79554163399930183, request window begin: 79554163399930183, request window end: 79554163399930215 13539 11-23 20:17:42.204342 672967 673335 44 ModuleLevelTrigger D03 create_decision: earliest TC index: 0 13549 11-23 20:17:42.194916 672746 673061 120 DataFlowOrchestrator LOG do_stop: dfo successfully stopped 13552 11-23 20:17:42.071710 672967 673077 99 ModuleLevelTrigger D03 do_pause: TS End: 1700792262071710 13553 11-23 20:17:42.071678 672967 673077 99 ModuleLevelTrigger NFO do_pause: Trigger is paused 13554 11-23 20:17:42.071589 672967 673077 99 ModuleLevelTrigger LOG do_pause: ******* Triggers PAUSED! in run 101 ********* On Nov 22, 2023, at 4:56 PM, jrklein ***@***.******@***.***>> wrote: We should be a little carful here: we always will want a TD to get sent and excecuted if it has happened. Imagine, for example, that we had a SN burst just as someone decided to pause the triggers for some other reason. Is the problem here different, in that the TD has not yet ocurred and then triggers are paused, or is it that the TD is not finished when triggers are paused...? This is a bigger problem, perhaps, because the "race condition" may not just be an explicitly logical one, but a physics one: how long do we wait to ensure we haven't lost physics data that is already in the buffers? Sorry if I have misunderstood the exact problem. — Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_DUNE-2DDAQ_trigger_pull_249-23issuecomment-2D1823604324&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=lfJn8IlK95ZGuafYS9SD4g&m=HcxX21odRpsYdUCZb4a6Bn8L8MhCQ_mNDl8-keT2C5fNwgBzu5uJp12fJ0te9mdO&s=p6t3GmhUGrqqqQ4Cs8F5nsPvqsMtRaEMI-D3F74RHC8&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AIVBHCUFOOXRLNFKSK3ORJLYFZ7KRAVCNFSM6AAAAAA7WUPBN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRTGYYDIMZSGQ&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=lfJn8IlK95ZGuafYS9SD4g&m=HcxX21odRpsYdUCZb4a6Bn8L8MhCQ_mNDl8-keT2C5fNwgBzu5uJp12fJ0te9mdO&s=SX3FLv9DQ02SKwj6sXe8esDMdPtyVrTNaI-OIVVF7KU&e=>. You are receiving this because you authored the thread.Message ID: ***@***.***>

bieryAtFnal · 2023-11-24T15:27:34Z

After I filed this PR, I discovered that my theory about what was causing the stale TriggerDecision messages to arrive at the DFO in run N+1 was incorrect. With the proposed code change in this PR, the problem still occurs, so clearly the problem lies somewhere else.

I still feel that the changes in this PR are useful, but maybe we put them on hold since they are not causing any observable problems.

MRiganSUSX · 2023-12-14T21:44:26Z

Hi @bieryAtFnal, apologies for a very late answer.

So I haven't tested this, but I can confirm that the behaviour you have described can (and does) indeed happen.
Since the introduction of TC merging when forming TDs, we have introduced these PendingTD structs. One vector of these (m_pending_tds) holds TDs that were already created but have not yet expired (TDs now have a buffer period - walltime_expiration, during which they wait for potential incoming TCs to get extended before they are sent onwards).
This means, that in unlucky scenario, we can have one (or multiple) pendingTDs that were already created but have not yet expired as the 'stop' command comes in.
The flush_td_vectors() function was created to deal with this scenario.
As you can see, the flush function calls the call_tc_decision adding an override to it:
call_tc_decision(pending_td, true);
this override was meant to allow sending trigger decisions to DFO even in paused stage (ie after receiving the stop command). However, I think this was in the period when I just joined, so I was tasked with creating this override in trigger while Phil was working on the DFO side (ie also allowing some form of override after stop command in the DFO). I assume this has never materialized, apologies for that!

There are several options to solve this:

propagate this override to DFO so that it allows the 'usual' flow regardless of the state (ie stop)
add a small timeout to DFO to allow MLT to send these before stop operations take place in the DFO
move this to sooner in the flow (you mentioned the disable_triggers which sounds like the best candidate)
something else...
One caveat is that there may be other (and possibly many) places that would need to be aware of this override, if that is what we go with.

Regardless, if this is indeed the cause (which I'm almost sure it is), to make this more likely to happen for testing, you could use:

"mlt_merge_overlapping_tcs": true,
"mlt_send_timed_out_tds": true,
"mlt_buffer_timeout": <some big number here>,  <-this is in ms and this is the expiration that we wait for other TCs

and combine this with some high rate, ideally overlapping TCs. So either low TPG threshold or use of the custom TC maker:

"use_custom_maker": true,
        "ctcm_trigger_intervals": [6250000, 31250000, 15625000],
        "ctcm_trigger_types": [1,2,3],
        "ctcm_timestamp_method": "kSystemClock",

I can test this more to find a scenario where this always happens.
Also happy to help with implementing a solution once we agree on one.

bieryAtFnal · 2023-12-28T21:19:31Z

Hi @MRiganSUSX , thanks for the update and information.

I vote for the third option ("move this to sooner in the flow").

To help with debugging the late-arriving TriggerDecision messages, I create an integtest in the trigger repo to demonstrate the behavior. It is called integtest/td_leakage_between_runs_test.py, and it is available on the kbiery/integtest_for_the_disable_trigger_transition branch.

bieryAtFnal · 2024-01-17T22:23:22Z

I'm closing this Pull Request without merging it or any further action based on the fact that it didn't address the reported problem, as discussed above.

We will continue reporting on progress in this area in trigger Issue 250 and an upcoming PR.

Kurt Biery added 2 commits November 18, 2023 22:07

Added a mutex to the MLT code to protect against race conditions in t…

eb60896

…he use of the m_paused flag (the symptom before the change was a message from the DFO complaining about a TD with the wrong number at the start of a second run in a single DAQ session).

Fixed typo in comment

c8aebe5

bieryAtFnal requested a review from MRiganSUSX November 22, 2023 18:07

bieryAtFnal changed the title ~~Added mutex to MLT code to protect against a race condition in which a TD is sent to the DFO after triggers are paused~~ Added mutex to MLT code to protect against a race condition in which a TD is sent to the DFO after triggers are disabled Nov 26, 2023

bieryAtFnal closed this Jan 17, 2024

bieryAtFnal mentioned this pull request Jan 17, 2024

MLT occasionally sends additional TriggerDecisions to the DFO after it completes the 'disable_triggers' transition #250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added mutex to MLT code to protect against a race condition in which a TD is sent to the DFO after triggers are disabled #249

Added mutex to MLT code to protect against a race condition in which a TD is sent to the DFO after triggers are disabled #249

bieryAtFnal commented Nov 22, 2023 •

edited

Loading

jrklein commented Nov 22, 2023

bieryAtFnal commented Nov 24, 2023 via email •

edited

Loading

bieryAtFnal commented Nov 24, 2023

MRiganSUSX commented Dec 14, 2023 •

edited

Loading

bieryAtFnal commented Dec 28, 2023

bieryAtFnal commented Jan 17, 2024

Added mutex to MLT code to protect against a race condition in which a TD is sent to the DFO after triggers are disabled #249

Added mutex to MLT code to protect against a race condition in which a TD is sent to the DFO after triggers are disabled #249

Conversation

bieryAtFnal commented Nov 22, 2023 • edited Loading

jrklein commented Nov 22, 2023

bieryAtFnal commented Nov 24, 2023 via email • edited Loading

bieryAtFnal commented Nov 24, 2023

MRiganSUSX commented Dec 14, 2023 • edited Loading

bieryAtFnal commented Dec 28, 2023

bieryAtFnal commented Jan 17, 2024

bieryAtFnal commented Nov 22, 2023 •

edited

Loading

bieryAtFnal commented Nov 24, 2023 via email •

edited

Loading

MRiganSUSX commented Dec 14, 2023 •

edited

Loading