-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added mutex to MLT code to protect against a race condition in which a TD is sent to the DFO after triggers are disabled #249
Conversation
…he use of the m_paused flag (the symptom before the change was a message from the DFO complaining about a TD with the wrong number at the start of a second run in a single DAQ session).
We should be a little carful here: we always will want a TD to get sent and excecuted if it has happened. Imagine, for example, that we had a SN burst just as someone decided to pause the triggers for some other reason. Is the problem here different, in that the TD has not yet ocurred and then triggers are paused, or is it that the TD is not finished when triggers are paused...? This is a bigger problem, perhaps, because the "race condition" may not just be an explicitly logical one, but a physics one: how long do we wait to ensure we haven't lost physics data that is already in the buffers? Sorry if I have misunderstood the exact problem. |
Hi Josh,
Good points, and your email ties in nicely with what I’m about to say…
My original theory about the cause of the problem was wrong. In further system testing that included my proposed fix, I still saw problems with a TriggerDecision from run N showing up at the DFO in run N+1. And, I think that the real problem is tied into some of the points that you made in your email.
My new theory is that the MLT is knowingly/deliberately sending one or more TriggerDecisions at “stop” time. The evidence for this is shown in the TRACE messages that are included at the bottom of my reply.
I’m guessing that an early version of the MLT code sent out all “pending” TDs at “stop” time exactly in the spirit of what you’ve described (leave no TD unsent). However, I suspect that when the “disable triggers” command was added to the overall system, the MLT logic was not fully updated to move that sending of pending triggers from “stop" time to “disable triggers” time. But, I believe that it should have been. Clearly, the MLT should not accept the “disable_triggers” command from run control, tell RC that it has successfully made that transition, and then later send out a TD.
Maybe the short-term fix is to simply to move the “flush” call from do_stop() to do_pause(). That is something for you to decide. I’ll ask you, Michal, and other trigger folks to come up with the right plan forward.
And, yes, there may be discussions, planning, etc. within the DataSelection group to decide how to appropriately wind down the various internal operations at ‘disable_triggers’ time. But, the bottom line is that the MLT should not send TDs to the DFO after it has made the 'disable_triggers’ transition.
I’d appreciate it if a solution (intermediate or otherwise) could be implemented on a reasonably short time scale.
Thanks,
Kurt
Here, in reverse time order, are TRACE messages that show the MLT pausing triggers (orange text); the DFO completing the “stop” transition (blue text); the MLT flushing pending TDs, sending a TD to the DFO, and completing the “stop” transition (green text), and the DFO complaining about a TD with the wrong run number at the start of the next run (red text).
Recall that run control commands are broadcast to all DAQ applications without any deliberate ordering. So, it is not surprising that the DFO could complete “stop” before the MLT does. That is exactly why the “disable_triggers” command/transition was added (to ensure that the flow of TriggerDecisions are stopped before the DFO is told to stop the run).
13412 11-23 20:17:46.058799 672746 673582 78 DataFlowOrchestrator ERR receive_trigger_decision: DataFlowOrchestrator encountered run number mismatch: recvd (101) != 102 from MLT for trigger_number 462
13418 11-23 20:17:46.058168 672746 673061 120 DataFlowOrchestrator LOG add_callback_impl: Registering callback.
13419 11-23 20:17:46.058004 672746 673061 120 DataFlowOrchestrator LOG add_callback_impl: Registering callback.
13451 11-23 20:17:43.217897 672967 673077 32 ModuleLevelTrigger NFO do_stop: End of run 101
13453 11-23 20:17:43.217178 672967 673077 32 ModuleLevelTrigger LOG do_stop: LivetimeCounter - total deadtime+paused: 2396
13454 11-23 20:17:43.216914 672967 673335 44 ModuleLevelTrigger LOG send_trigger_decisions: Run 101: Received 494 TCs. Sent 462 TDs consisting of 462 TCs. 29 TDs (29 TCs) were created during pause, and 0 TDs (0 TCs) were inhibited. 0 TDs (0 TCs) were dropped.
13455 11-23 20:17:43.211346 672967 673077 32 ModuleLevelTrigger LOG call_tc_decision: Sending a decision with triggernumber 462 timestamp 79554163409484103 number of links 20 based on TC of type 5
13456 11-23 20:17:43.211339 672967 673077 32 ModuleLevelTrigger D03 call_tc_decision: Override?: 1
13457 11-23 20:17:43.211326 672967 673077 32 ModuleLevelTrigger D03 create_decision: HSI passthrough: 0, TC detid: 15, TC type: 5, TC cont number: 1, DECISION trigger type: 1, DECISION timestamp: 79554163409484103, request window begin: 79554163409484103, request window end: 79554163409484135
13458 11-23 20:17:43.211286 672967 673077 32 ModuleLevelTrigger D03 create_decision: earliest TC index: 0
13459 11-23 20:17:43.210549 672967 673077 32 ModuleLevelTrigger D03 flush_td_vectors: Flushing TDs. Size: 1
13516 11-23 20:17:42.295068 672967 673335 44 ModuleLevelTrigger D03 call_tc_decision: Override?: 0
13517 11-23 20:17:42.295065 672967 673335 44 ModuleLevelTrigger D03 create_decision: HSI passthrough: 0, TC detid: 15, TC type: 5, TC cont number: 1, DECISION trigger type: 1, DECISION timestamp: 79554163405562183, request window begin: 79554163405562183, request window end: 79554163405562215
13518 11-23 20:17:42.295064 672967 673335 44 ModuleLevelTrigger D03 create_decision: earliest TC index: 0
13519 11-23 20:17:42.254728 672967 673335 44 ModuleLevelTrigger D03 call_tc_decision: Override?: 0
13520 11-23 20:17:42.254723 672967 673335 44 ModuleLevelTrigger D03 create_decision: HSI passthrough: 0, TC detid: 15, TC type: 5, TC cont number: 1, DECISION trigger type: 1, DECISION timestamp: 79554163402746183, request window begin: 79554163402746183, request window end: 79554163402746215
13521 11-23 20:17:42.254705 672967 673335 44 ModuleLevelTrigger D03 create_decision: earliest TC index: 0
13537 11-23 20:17:42.204360 672967 673335 44 ModuleLevelTrigger D03 call_tc_decision: Override?: 0
13538 11-23 20:17:42.204357 672967 673335 44 ModuleLevelTrigger D03 create_decision: HSI passthrough: 0, TC detid: 15, TC type: 5, TC cont number: 1, DECISION trigger type: 1, DECISION timestamp: 79554163399930183, request window begin: 79554163399930183, request window end: 79554163399930215
13539 11-23 20:17:42.204342 672967 673335 44 ModuleLevelTrigger D03 create_decision: earliest TC index: 0
13549 11-23 20:17:42.194916 672746 673061 120 DataFlowOrchestrator LOG do_stop: dfo successfully stopped
13552 11-23 20:17:42.071710 672967 673077 99 ModuleLevelTrigger D03 do_pause: TS End: 1700792262071710
13553 11-23 20:17:42.071678 672967 673077 99 ModuleLevelTrigger NFO do_pause: Trigger is paused
13554 11-23 20:17:42.071589 672967 673077 99 ModuleLevelTrigger LOG do_pause: ******* Triggers PAUSED! in run 101 *********
On Nov 22, 2023, at 4:56 PM, jrklein ***@***.******@***.***>> wrote:
We should be a little carful here: we always will want a TD to get sent and excecuted if it has happened. Imagine, for example, that we had a SN burst just as someone decided to pause the triggers for some other reason. Is the problem here different, in that the TD has not yet ocurred and then triggers are paused, or is it that the TD is not finished when triggers are paused...? This is a bigger problem, perhaps, because the "race condition" may not just be an explicitly logical one, but a physics one: how long do we wait to ensure we haven't lost physics data that is already in the buffers? Sorry if I have misunderstood the exact problem.
—
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_DUNE-2DDAQ_trigger_pull_249-23issuecomment-2D1823604324&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=lfJn8IlK95ZGuafYS9SD4g&m=HcxX21odRpsYdUCZb4a6Bn8L8MhCQ_mNDl8-keT2C5fNwgBzu5uJp12fJ0te9mdO&s=p6t3GmhUGrqqqQ4Cs8F5nsPvqsMtRaEMI-D3F74RHC8&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AIVBHCUFOOXRLNFKSK3ORJLYFZ7KRAVCNFSM6AAAAAA7WUPBN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRTGYYDIMZSGQ&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=lfJn8IlK95ZGuafYS9SD4g&m=HcxX21odRpsYdUCZb4a6Bn8L8MhCQ_mNDl8-keT2C5fNwgBzu5uJp12fJ0te9mdO&s=SX3FLv9DQ02SKwj6sXe8esDMdPtyVrTNaI-OIVVF7KU&e=>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
After I filed this PR, I discovered that my theory about what was causing the stale TriggerDecision messages to arrive at the DFO in run N+1 was incorrect. With the proposed code change in this PR, the problem still occurs, so clearly the problem lies somewhere else. I still feel that the changes in this PR are useful, but maybe we put them on hold since they are not causing any observable problems. |
Hi @bieryAtFnal, apologies for a very late answer. So I haven't tested this, but I can confirm that the behaviour you have described can (and does) indeed happen. There are several options to solve this:
Regardless, if this is indeed the cause (which I'm almost sure it is), to make this more likely to happen for testing, you could use:
and combine this with some high rate, ideally overlapping TCs. So either low TPG threshold or use of the custom TC maker:
I can test this more to find a scenario where this always happens. |
Hi @MRiganSUSX , thanks for the update and information. I vote for the third option ("move this to sooner in the flow"). To help with debugging the late-arriving TriggerDecision messages, I create an integtest in the |
I'm closing this Pull Request without merging it or any further action based on the fact that it didn't address the reported problem, as discussed above. We will continue reporting on progress in this area in trigger Issue 250 and an upcoming PR. |
When running automated integration tests in fddaq-v4.2.0 and later systems, I occasionally see complaints from the DFO that it received a TriggerDecision message with the wrong run number. This always happens at the beginning of a 2nd or 3rd run in a single DAQ session.
I believe that the problem is that the MLT has sent a TD after triggers have been disabled. This should not happen, but I believe that a bug is allowing it to happen.
I believe that the sequence goes something like this:
The main change to the code that I made is to protect the whole block of code in the call_tc_decisio() method that depends on the m_paused variable from changes to that variable from a different thread. With this change, the m_paused variable can probable be made a simple boolean variable instead of an atomic one, but I haven't made that change yet. (we can talk about that or leave it as it is)
There was also a change to the initialization of the m_bitword_check variable. I'm not sure why this isn't just a local variable in the call_tc_decision() method, but if it does need to be a class data member, then it should be initialized before the thread that makes use of it is started.
I've tried to confirm that these changes effectively prevent the complaints from the DFO about stale TD messaes, but the problem happens so rarely, I don't yet have statistics like "it happens once in every N test runs before the change, and it happens zero times in 2N runs after the change". But, I wanted to go ahead and file the PR so that others are aware of this problem and possible solution.