Skip to content
This repository has been archived by the owner on Jan 2, 2024. It is now read-only.

Feature/new update submission status algorithm #832

Closed

Conversation

toan-quach
Copy link
Member

@toan-quach toan-quach commented Nov 27, 2023

Purpose:

This PR is an attempt to improve the submission status change algorithm performance.

Changes:

Instead of looping over all jobs of a submission entity when 1 job status is updated, we will update the submission status based only on the recently updated job status.

Remaing tasks:

  • Looking into adding check for submission status in test_orchestrator.py

Result

Previously, when running pytest tests/enterprise/core/test_end_to_end.py::test_without_authorization_standalone on my local (the previous test used during our discussion), it took 22.55s to run, after applying this changes, it now takes only 5.39s to run!

Results from running profiling:

------------ Old version ------------

Web capture_27-11-2023_202952_127 0 0 1

------------ New version ------------

Web capture_27-11-2023_202959_127 0 0 1

Copy link

github-actions bot commented Nov 27, 2023

☂️ Python Cov

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
8717 8256 95% 85% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
src/taipy/core/_orchestrator/_orchestrator.py 98% 🟢
src/taipy/core/submission/_submission_converter.py 100% 🟢
src/taipy/core/submission/_submission_manager.py 100% 🟢
src/taipy/core/submission/_submission_model.py 100% 🟢
src/taipy/core/submission/submission.py 94% 🟢
TOTAL 98% 🟢

updated for commit: c3f2a21 by action🐍

Copy link
Contributor

@gmarabout gmarabout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually two independent parts in this PR.

The first is to add the entity_type to the submission.
I fully approve this part ! 👏🏻

The second is to change how the status of the submission is updated. I have comments on how it's done: duplicated status information (the _submission_statusand the various boolean fields), the additional sets to store job IDs per status that looks unused... I added some comments on this.

src/taipy/core/submission/submission.py Show resolved Hide resolved
src/taipy/core/submission/submission.py Outdated Show resolved Hide resolved
src/taipy/core/submission/submission.py Outdated Show resolved Hide resolved
self.__running_jobs: MutableSet[str] = set()
self.__blocked_jobs: MutableSet[str] = set()
self.__pending_jobs: MutableSet[str] = set()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ What is the purpose of these additional set fields? As far as I can see, we add/remove job ids from it in _update_submission_status but we never read from it!
if it is just to track the statuses of the jobs, why not just using one single Dict[JobId, Status]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's just something to hold if we still have jobs running, blocked, or pending or not. As we still need access to other job statuses, if we use a dict, we will need to loop through all job statuses to do this check, while if we keep each status to an equivalent set, we can quickly identify the existence of "running jobs" or "blocked jobs", etc. in O(1) time. Does that make sense to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but is it used anywhere in the code? I can't find any...
If it is not used, why do we need those sets?

Copy link
Contributor

@gmarabout gmarabout Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I have concerns about this implementation...
Let me elaborate, so you can tell me I shouldn't actually be worried :-)

By keeping such "states" in the object without any db persistence will make any Taipy-core service stateful. Example: a pod running in Kubernetes.

I see at least two problems now:

  • If we need to "scale up" the pod (because we need more throughput/horse power), then the state remains in on pod but the other has an inconsistent state. The behaviour might become erratic.
  • If the pod dies for some reason, then the state/data is lost forever.

So either this state is important, and thus we need to somehow persist it, either it's not useful and we should just get rid of it.

let me know if we need to discuss it with more details.

Copy link
Member Author

@toan-quach toan-quach Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm for now, from what I can understand, as long as orchestrator and dispatcher is on the same application, sharing the same memory pool, this shouldn't be an issue. Previously we also rely on this fact when initiating the callback from job to update the submission status:

But you made a very valid point, in the scenario where we want to scale up, separating the dispatcher to a separate application or even, a separate machine, this won't work anymore. I think we should discuss this point further :D and we should also involve @jrobinAV as I think this topic would be interesting for this ticket also https://app.zenhub.com/workspaces/taipy-613ef328affde6000ff77fb3/issues/gh/avaiga/taipy-enterprise/268

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that the dispatcher and orchestrator should always on the same machine, and we only split the jobs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trgiangdo In theory, if you consider Kubernetes as the deployment platform, you shouldn't make such assumptions. Services running on K8s must natively be stateless and restartable without breaking the app (the so-called "cloud native").
And if it can only run once (i.e. a single process that won't scale), then it must also be ready to be restarted on other nodes without notice (K8S does that). It means the state must be easily restored or recomputed. In this case, it would just be emptied...
I don't want to block the PR, but design choices imply things that go beyond the scope of the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can close this topic for now.
To summarize:

  • We acknowledge that there are several places that are stateful that we don't have an approach to serialize them across multiple machines yet (including job callbacks, state of the submission entity, the event notification, etc.)
  • We will not these issues right now and will review them again later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can close the topic.

@toan-quach toan-quach requested a review from gmarabout November 27, 2023 13:36
src/taipy/core/submission/submission.py Show resolved Hide resolved
src/taipy/core/submission/submission.py Show resolved Hide resolved
src/taipy/core/submission/submission.py Show resolved Hide resolved
src/taipy/core/submission/submission.py Show resolved Hide resolved
src/taipy/core/submission/submission.py Show resolved Hide resolved
src/taipy/core/submission/submission.py Show resolved Hide resolved
tests/core/submission/test_submission.py Show resolved Hide resolved
tests/core/submission/test_submission.py Show resolved Hide resolved
if job.is_abandoned():
abandoned = True
if canceled:
def _update_submission_status(self, job: Job):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concern about the the possibility to call this method for the same job for 2 (or more) status changes in the wrong order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, the status of a job will be finished updated by the orchestrator before handing it over to the dispatcher so no update to job status shall create a collision. The only scenario this might happen is when we try to cancel a job as this action will be trigger directly from the main thread while the job could be running in the subthread. But we have various conditions to check if we can cancel a job or not, if it's running then we can't, if it's not running, I think this process can still be triggered safely when we update the status of job to canceled

@jrobinAV
Copy link
Member

jrobinAV commented Dec 4, 2023

The code is already merged in taipy repo, right?
Can we close this PR?

@toan-quach
Copy link
Member Author

@jrobinAV yep we can! Sorry for that!

@toan-quach toan-quach closed this Dec 4, 2023
@toan-quach toan-quach deleted the feature/new-update-submission-status-algorithm branch December 4, 2023 13:15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants