-
Notifications
You must be signed in to change notification settings - Fork 17
Feature/new update submission status algorithm #832
Feature/new update submission status algorithm #832
Conversation
☂️ Python Cov
Overall Coverage
New FilesNo new covered files... Modified Files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are actually two independent parts in this PR.
The first is to add the entity_type
to the submission.
I fully approve this part ! 👏🏻
The second is to change how the status of the submission is updated. I have comments on how it's done: duplicated status information (the _submission_status
and the various boolean fields), the additional sets to store job IDs per status that looks unused... I added some comments on this.
self.__running_jobs: MutableSet[str] = set() | ||
self.__blocked_jobs: MutableSet[str] = set() | ||
self.__pending_jobs: MutableSet[str] = set() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ What is the purpose of these additional set
fields? As far as I can see, we add/remove job ids from it in _update_submission_status
but we never read from it!
if it is just to track the statuses of the jobs, why not just using one single Dict[JobId, Status]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it's just something to hold if we still have jobs running, blocked, or pending or not. As we still need access to other job statuses, if we use a dict, we will need to loop through all job statuses to do this check, while if we keep each status to an equivalent set, we can quickly identify the existence of "running jobs" or "blocked jobs", etc. in O(1) time. Does that make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but is it used anywhere in the code? I can't find any...
If it is not used, why do we need those sets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I have concerns about this implementation...
Let me elaborate, so you can tell me I shouldn't actually be worried :-)
By keeping such "states" in the object without any db persistence will make any Taipy-core service stateful. Example: a pod running in Kubernetes.
I see at least two problems now:
- If we need to "scale up" the pod (because we need more throughput/horse power), then the state remains in on pod but the other has an inconsistent state. The behaviour might become erratic.
- If the pod dies for some reason, then the state/data is lost forever.
So either this state is important, and thus we need to somehow persist it, either it's not useful and we should just get rid of it.
let me know if we need to discuss it with more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm for now, from what I can understand, as long as orchestrator and dispatcher is on the same application, sharing the same memory pool, this shouldn't be an issue. Previously we also rely on this fact when initiating the callback from job to update the submission status:
- To run callbacks https://github.com/Avaiga/taipy-core/blob/develop/src/taipy/core/job/job.py#L31 this function relies on callbacks store in memory and not in disk/db
- Assigning callbacks to jobs https://github.com/Avaiga/taipy-core/blob/develop/src/taipy/core/_orchestrator/_orchestrator.py#L81 we never store these callbacks all the way until after we have finished running the job in https://github.com/Avaiga/taipy-core/blob/develop/src/taipy/core/_orchestrator/_dispatcher/_job_dispatcher.py#L124 where we set everything and consequently also saving the callbacks of the job into the disk/db
But you made a very valid point, in the scenario where we want to scale up, separating the dispatcher to a separate application or even, a separate machine, this won't work anymore. I think we should discuss this point further :D and we should also involve @jrobinAV as I think this topic would be interesting for this ticket also https://app.zenhub.com/workspaces/taipy-613ef328affde6000ff77fb3/issues/gh/avaiga/taipy-enterprise/268
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that the dispatcher and orchestrator should always on the same machine, and we only split the jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trgiangdo In theory, if you consider Kubernetes as the deployment platform, you shouldn't make such assumptions. Services running on K8s must natively be stateless and restartable without breaking the app (the so-called "cloud native").
And if it can only run once (i.e. a single process that won't scale), then it must also be ready to be restarted on other nodes without notice (K8S does that). It means the state must be easily restored or recomputed. In this case, it would just be emptied...
I don't want to block the PR, but design choices imply things that go beyond the scope of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we can close this topic for now.
To summarize:
- We acknowledge that there are several places that are stateful that we don't have an approach to serialize them across multiple machines yet (including job callbacks, state of the submission entity, the event notification, etc.)
- We will not these issues right now and will review them again later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can close the topic.
if job.is_abandoned(): | ||
abandoned = True | ||
if canceled: | ||
def _update_submission_status(self, job: Job): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concern about the the possibility to call this method for the same job for 2 (or more) status changes in the wrong order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, the status of a job will be finished updated by the orchestrator before handing it over to the dispatcher so no update to job status shall create a collision. The only scenario this might happen is when we try to cancel a job as this action will be trigger directly from the main thread while the job could be running in the subthread. But we have various conditions to check if we can cancel a job or not, if it's running then we can't, if it's not running, I think this process can still be triggered safely when we update the status of job to canceled
The code is already merged in taipy repo, right? |
@jrobinAV yep we can! Sorry for that! |
Purpose:
This PR is an attempt to improve the submission status change algorithm performance.
Changes:
Instead of looping over all jobs of a submission entity when 1 job status is updated, we will update the submission status based only on the recently updated job status.
Remaing tasks:
test_orchestrator.py
Result
Previously, when running
pytest tests/enterprise/core/test_end_to_end.py::test_without_authorization_standalone
on my local (the previous test used during our discussion), it took 22.55s to run, after applying this changes, it now takes only 5.39s to run!Results from running profiling:
------------ Old version ------------
------------ New version ------------