fix: time out of range #409

fudongyingluck · 2023-06-30T14:02:14Z

We found that when current time greater than nextExecutionTime, the TimeValue in threadPool.schedule will throw an IllegalArgumentException as following

java.lang.IllegalArgumentException: duration cannot be negative, was given [-2965077933106]
        at org.elasticsearch.common.unit.TimeValue.<init>(TimeValue.java:52) ~[elasticsearch-core-7.10.2.jar:7.10.2]
        at com.amazon.opendistroforelasticsearch.jobscheduler.scheduler.JobScheduler.reschedule(JobScheduler.java:190) ~[?:?]
        at com.amazon.opendistroforelasticsearch.jobscheduler.scheduler.JobScheduler.lambda$reschedule$0(JobScheduler.java:177) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) ~[elasticsearch-7.10.2.jar:7.10.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]

Then the job will not be scheduled anymore.
This change fixes this, by setting the nextExecutionTime to current time.

Thanks my colleague @kkewwei solve this out.

Signed-off-by: fudongying <fudongying@bytedance.com>
Signed-off-by: kewei.11 <kewei.11@bytedance.com>

Signed-off-by: fudongying <fudongying@bytedance.com>

joshpalis · 2023-06-30T17:23:51Z

Thanks for raising this PR @fudongyingluck. Checks are failing due to stale artifacts, will wait to re-run checks until the next 2.9.0 build is successful

* What went wrong:
Could not determine the dependencies of task ':opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster#fullRestartClusterTask'.
> Server returned HTTP response code: 403 for URL: https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.9.0/8039/linux/x64/tar/builds/opensearch/plugins/opensearch-job-scheduler-2.9.0.0.zip

Signed-off-by: fudongying <fudongying@bytedance.com>

fudongyingluck · 2023-07-03T12:11:15Z

@joshpalis we found that dirty data stale in the memory after exception. If the index migrates back to this node after we migrate it to another, then the job loses again. So we add the second commit to deal with this condition.

codecov · 2023-07-06T00:49:13Z

Codecov Report

Merging #409 (6defabc) into main (0132436) will increase coverage by 0.42%.
The diff coverage is 91.66%.

@@             Coverage Diff              @@
##               main     #409      +/-   ##
============================================
+ Coverage     28.77%   29.19%   +0.42%     
- Complexity       97       98       +1     
============================================
  Files            22       22              
  Lines          1178     1185       +7     
  Branches        109      109              
============================================
+ Hits            339      346       +7     
  Misses          818      818              
  Partials         21       21

Impacted Files	Coverage Δ
...pensearch/jobscheduler/scheduler/JobScheduler.java	`74.73% <91.66%> (+2.00%)`	⬆️

fudongyingluck · 2023-07-07T02:51:11Z

Thanks for raising this PR @fudongyingluck. Checks are failing due to stale artifacts, will wait to re-run checks until the next 2.9.0 build is successful

* What went wrong:
Could not determine the dependencies of task ':opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster#fullRestartClusterTask'.
> Server returned HTTP response code: 403 for URL: https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.9.0/8039/linux/x64/tar/builds/opensearch/plugins/opensearch-job-scheduler-2.9.0.0.zip

@joshpalis Checks seem successful now. Is it convenient for you to review those codes ? Really thanks for your time ~

dbwiddis

Thanks for this fix!

In the future, it would be helpful to create an issue reporting the details of the bug and then reference that issue in the PR. In case there's a need to discuss appropriate fixes for the bug that the PR might not handle it keeps the discussions separate.

Fix LGTM with a few nits.

src/main/java/org/opensearch/jobscheduler/scheduler/JobScheduler.java

Signed-off-by: fudongying <fudongying@bytedance.com>

fudongyingluck · 2023-07-10T02:43:29Z

@dbwiddis Really thanks for your time and advice. I'll create a bug issue next time. And The code is changed as your comments in the latest commit.

cwperks

@fudongyingluck This PR looks good to me, but I'd be curious to know how to reproduce the scenario. Would you be able to provide steps for how to recreate the scenario where duration can be negative? I can see in the code its possible if the current instant is after the nextExecutionTime, but does that mean that a job had previously failed to run and the nextExecutionTime was not updated?

How can the situation arise where nextExecutionTime is in the past? Thank you.

fudongyingluck · 2023-07-11T02:34:45Z

@cwperks Good question. We also feel curious when the job disappears until we found logs. The thing is the cloud service which the ES k8s instance runs on, is unavailable for some time. Then the ES instance seems not to run at those times, I don't know how this happened. After about 30m, the ES instance reruns again, and the exception occurs.
I know we should fix the ES instance stalled problem, but this seems more complex, and the online ES instance can't restart for some stability reason. To avoid the problem occurring next time, I raise this PR.

cwperks · 2023-07-11T14:04:45Z

@fudongyingluck Thank you for the context!

* fix: time out of range Signed-off-by: fudongying <fudongying@bytedance.com> * fix: deschedule failed after schedule exception Signed-off-by: fudongying <fudongying@bytedance.com> * chore: dbwiddis's comments Signed-off-by: fudongying <fudongying@bytedance.com> --------- Signed-off-by: fudongying <fudongying@bytedance.com> (cherry picked from commit 9f4ec67)

* fix: time out of range Signed-off-by: fudongying <fudongying@bytedance.com> * fix: deschedule failed after schedule exception Signed-off-by: fudongying <fudongying@bytedance.com> * chore: dbwiddis's comments Signed-off-by: fudongying <fudongying@bytedance.com> --------- Signed-off-by: fudongying <fudongying@bytedance.com> (cherry picked from commit 9f4ec67) Signed-off-by: Joshua Palis <jpalis@amazon.com>

* fix: time out of range * fix: deschedule failed after schedule exception * chore: dbwiddis's comments --------- (cherry picked from commit 9f4ec67) Signed-off-by: fudongying <fudongying@bytedance.com> Signed-off-by: Joshua Palis <jpalis@amazon.com> Co-authored-by: fudongying <30896830+fudongyingluck@users.noreply.github.com>

fix: time out of range

eeb86a6

Signed-off-by: fudongying <fudongying@bytedance.com>

fudongyingluck requested review from joshpalis, saratvemulapalli, dbwiddis, kaituo and vibrantvarun as code owners June 30, 2023 14:02

fix: deschedule failed after schedule exception

6defabc

Signed-off-by: fudongying <fudongying@bytedance.com>

fudongyingluck force-pushed the fixTime branch from dbe886f to 6defabc Compare July 3, 2023 11:17

dbwiddis reviewed Jul 9, 2023

View reviewed changes

chore: dbwiddis's comments

c7fadee

Signed-off-by: fudongying <fudongying@bytedance.com>

dbwiddis approved these changes Jul 10, 2023

View reviewed changes

cwperks reviewed Jul 10, 2023

View reviewed changes

cwperks approved these changes Jul 11, 2023

View reviewed changes

joshpalis approved these changes Jul 11, 2023

View reviewed changes

joshpalis added the backport 2.x label Jul 11, 2023

joshpalis merged commit 9f4ec67 into opensearch-project:main Jul 11, 2023
11 checks passed

opensearch-trigger-bot bot mentioned this pull request Jul 11, 2023

[Backport 2.x] fix: time out of range #419

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: time out of range #409

fix: time out of range #409

fudongyingluck commented Jun 30, 2023 •

edited

Loading

joshpalis commented Jun 30, 2023

fudongyingluck commented Jul 3, 2023 •

edited

Loading

codecov bot commented Jul 6, 2023 •

edited

Loading

fudongyingluck commented Jul 7, 2023

dbwiddis left a comment

fudongyingluck commented Jul 10, 2023

cwperks left a comment

fudongyingluck commented Jul 11, 2023

cwperks commented Jul 11, 2023

fix: time out of range #409

fix: time out of range #409

Conversation

fudongyingluck commented Jun 30, 2023 • edited Loading

joshpalis commented Jun 30, 2023

fudongyingluck commented Jul 3, 2023 • edited Loading

codecov bot commented Jul 6, 2023 • edited Loading

Codecov Report

fudongyingluck commented Jul 7, 2023

dbwiddis left a comment

Choose a reason for hiding this comment

fudongyingluck commented Jul 10, 2023

cwperks left a comment

Choose a reason for hiding this comment

fudongyingluck commented Jul 11, 2023

cwperks commented Jul 11, 2023

fudongyingluck commented Jun 30, 2023 •

edited

Loading

fudongyingluck commented Jul 3, 2023 •

edited

Loading

codecov bot commented Jul 6, 2023 •

edited

Loading