Unit status not properly synced with long-running generators #1038

PReithofer · 2023-07-31T22:23:00Z

As I understand it, the only place where the unit status can transition to AssignmentState.COMPLETED is in
mephisto/data_model/unit.py in get_status(). This method needs to be called to keep the agent status and unit status in sync. However, when checking if a worker can work on another assignment, this is probably not called. Normally this is not an issue because the operator regularly (indirectly) calls this here:

async def _track_and_kill_runs(self):
...
if not tracked_run.force_shutdown:
    if tracked_run.task_launcher.finished_generators is False:
        # If the run can still generate assignments, it's
        # definitely not done
        continue
    task_run = tracked_run.task_run
    task_run.update_completion_progress(
        task_launcher=tracked_run.task_launcher
    )
    if not task_run.get_is_completed(): # <- here
        continue
...

However, when the using a long-running generator, it can happen that a unit finishes before the generator does.
If you want to try this, on this branch I replaced the static_task_data with the following generator:

def slow_generator():
    import time
    i=0
    while(True):
        i+=1
        time.sleep(3)
        yield {"text": f"This is assignment {i}"}

One quick fix would be to simply move the emptiness check of the generator below get_is_completed see #1039. But a cleaner solution would probably be to either

always sync the states when accessing the assignment status (i.e. also when checking if workers can take on more units) or
directly set this when the agent status changes.

EDIT: this only happens for screening units!

The text was updated successfully, but these errors were encountered:

JackUrb · 2023-08-01T15:32:47Z

Thanks for all the details here @PReithofer - this is an interesting bug you've stumbled on, as the limitations for workers' active tasks requires information in the MephistoDB to be up-to-date, but indeed only a get_status() call actually pushes that status update.

Your quick fix in #1039 seems OK to me, though it definitely brings up questions on how we manage state syncing correctly. Indeed I've left this up to the _track_and_kill_runs thread, but it may be time to rethink, we haven't used the generator pattern extensively.

meta-paul added bug Something isn't working question Further information is requested and removed question Further information is requested labels Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit status not properly synced with long-running generators #1038

Unit status not properly synced with long-running generators #1038

PReithofer commented Jul 31, 2023 •

edited

Loading

JackUrb commented Aug 1, 2023

Unit status not properly synced with long-running generators #1038

Unit status not properly synced with long-running generators #1038

Comments

PReithofer commented Jul 31, 2023 • edited Loading

JackUrb commented Aug 1, 2023

PReithofer commented Jul 31, 2023 •

edited

Loading