Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added mutex to MLT code to protect against a race condition in which a TD is sent to the DFO after triggers are disabled #249

Closed

Conversation

bieryAtFnal
Copy link
Contributor

@bieryAtFnal bieryAtFnal commented Nov 22, 2023

When running automated integration tests in fddaq-v4.2.0 and later systems, I occasionally see complaints from the DFO that it received a TriggerDecision message with the wrong run number. This always happens at the beginning of a 2nd or 3rd run in a single DAQ session.

I believe that the problem is that the MLT has sent a TD after triggers have been disabled. This should not happen, but I believe that a bug is allowing it to happen.

I believe that the sequence goes something like this:

  • run control sends a 'disable_triggers' command to the MLT, and the MLT replies that it has done so. However, there is a bug/race condition in the MLT code that allows it to send another TD even after is says that it has disabled triggers.
  • run control sends 'stop' commands to the MLT, DFO, and every other process and module in the system.
  • the stop command arrives at the DFO very quickly, before it has a chance to read in the new TD message from the MLT
  • when the next run is started, the DFO finds the stale TD in its connection to the MLT, and it issues the warning message

The main change to the code that I made is to protect the whole block of code in the call_tc_decisio() method that depends on the m_paused variable from changes to that variable from a different thread. With this change, the m_paused variable can probable be made a simple boolean variable instead of an atomic one, but I haven't made that change yet. (we can talk about that or leave it as it is)

There was also a change to the initialization of the m_bitword_check variable. I'm not sure why this isn't just a local variable in the call_tc_decision() method, but if it does need to be a class data member, then it should be initialized before the thread that makes use of it is started.

I've tried to confirm that these changes effectively prevent the complaints from the DFO about stale TD messaes, but the problem happens so rarely, I don't yet have statistics like "it happens once in every N test runs before the change, and it happens zero times in 2N runs after the change". But, I wanted to go ahead and file the PR so that others are aware of this problem and possible solution.

Kurt Biery added 2 commits November 18, 2023 22:07
…he use of the m_paused flag (the symptom before the change was a message from the DFO complaining about a TD with the wrong number at the start of a second run in a single DAQ session).
@jrklein
Copy link
Contributor

jrklein commented Nov 22, 2023

We should be a little carful here: we always will want a TD to get sent and excecuted if it has happened. Imagine, for example, that we had a SN burst just as someone decided to pause the triggers for some other reason. Is the problem here different, in that the TD has not yet ocurred and then triggers are paused, or is it that the TD is not finished when triggers are paused...? This is a bigger problem, perhaps, because the "race condition" may not just be an explicitly logical one, but a physics one: how long do we wait to ensure we haven't lost physics data that is already in the buffers? Sorry if I have misunderstood the exact problem.

@bieryAtFnal
Copy link
Contributor Author

bieryAtFnal commented Nov 24, 2023 via email

@bieryAtFnal
Copy link
Contributor Author

After I filed this PR, I discovered that my theory about what was causing the stale TriggerDecision messages to arrive at the DFO in run N+1 was incorrect. With the proposed code change in this PR, the problem still occurs, so clearly the problem lies somewhere else.

I still feel that the changes in this PR are useful, but maybe we put them on hold since they are not causing any observable problems.

@bieryAtFnal bieryAtFnal changed the title Added mutex to MLT code to protect against a race condition in which a TD is sent to the DFO after triggers are paused Added mutex to MLT code to protect against a race condition in which a TD is sent to the DFO after triggers are disabled Nov 26, 2023
@MRiganSUSX
Copy link
Contributor

MRiganSUSX commented Dec 14, 2023

Hi @bieryAtFnal, apologies for a very late answer.

So I haven't tested this, but I can confirm that the behaviour you have described can (and does) indeed happen.
Since the introduction of TC merging when forming TDs, we have introduced these PendingTD structs. One vector of these (m_pending_tds) holds TDs that were already created but have not yet expired (TDs now have a buffer period - walltime_expiration, during which they wait for potential incoming TCs to get extended before they are sent onwards).
This means, that in unlucky scenario, we can have one (or multiple) pendingTDs that were already created but have not yet expired as the 'stop' command comes in.
The flush_td_vectors() function was created to deal with this scenario.
As you can see, the flush function calls the call_tc_decision adding an override to it:
call_tc_decision(pending_td, true);
this override was meant to allow sending trigger decisions to DFO even in paused stage (ie after receiving the stop command). However, I think this was in the period when I just joined, so I was tasked with creating this override in trigger while Phil was working on the DFO side (ie also allowing some form of override after stop command in the DFO). I assume this has never materialized, apologies for that!

There are several options to solve this:

  • propagate this override to DFO so that it allows the 'usual' flow regardless of the state (ie stop)
  • add a small timeout to DFO to allow MLT to send these before stop operations take place in the DFO
  • move this to sooner in the flow (you mentioned the disable_triggers which sounds like the best candidate)
  • something else...
    One caveat is that there may be other (and possibly many) places that would need to be aware of this override, if that is what we go with.

Regardless, if this is indeed the cause (which I'm almost sure it is), to make this more likely to happen for testing, you could use:

"mlt_merge_overlapping_tcs": true,
"mlt_send_timed_out_tds": true,
"mlt_buffer_timeout": <some big number here>,  <-this is in ms and this is the expiration that we wait for other TCs

and combine this with some high rate, ideally overlapping TCs. So either low TPG threshold or use of the custom TC maker:

"use_custom_maker": true,
        "ctcm_trigger_intervals": [6250000, 31250000, 15625000],
        "ctcm_trigger_types": [1,2,3],
        "ctcm_timestamp_method": "kSystemClock",

I can test this more to find a scenario where this always happens.
Also happy to help with implementing a solution once we agree on one.

@bieryAtFnal
Copy link
Contributor Author

Hi @MRiganSUSX , thanks for the update and information.

I vote for the third option ("move this to sooner in the flow").

To help with debugging the late-arriving TriggerDecision messages, I create an integtest in the trigger repo to demonstrate the behavior. It is called integtest/td_leakage_between_runs_test.py, and it is available on the kbiery/integtest_for_the_disable_trigger_transition branch.

@bieryAtFnal
Copy link
Contributor Author

I'm closing this Pull Request without merging it or any further action based on the fact that it didn't address the reported problem, as discussed above.

We will continue reporting on progress in this area in trigger Issue 250 and an upcoming PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants