Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for CI build timeouts #76454

Closed
jkotas opened this issue Sep 30, 2022 · 16 comments
Closed

Tracking issue for CI build timeouts #76454

jkotas opened this issue Sep 30, 2022 · 16 comments
Labels
area-Infrastructure blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented Sep 30, 2022

 {
    "ErrorMessage" : "##[error]The operation was canceled.",
    "BuildRetry": false
 }

See dotnet/arcade#11072 for more context.

Report

Build Definition Step Name Console log Pull Request
2409598 dotnet-runtime Build product Log
610058 dotnet/runtime Send to Helix Log #97792
610152 dotnet/runtime Send to Helix Log
610011 dotnet/runtime Send tests to Helix (Unix) Log #99992
609972 dotnet/runtime Send tests to Helix (Unix) Log #100014
609923 dotnet/runtime Send to Helix Log
609966 dotnet/runtime Send to Helix Log #100018
609950 dotnet/runtime Send to Helix Log #99608
609925 dotnet/runtime Send to Helix Log #100016
609818 dotnet/runtime Send tests to Helix (Unix) Log #100011
609824 dotnet/runtime Send tests to Helix (Unix) Log #100012
609832 dotnet/runtime Send tests to Helix (Unix) Log #99859
609788 dotnet/runtime Send tests to Helix (Unix) Log #99953
609791 dotnet/runtime Send to Helix Log #99909
609762 dotnet/runtime Send to Helix Log #99371
2409365 dotnet-runtime Send job to Helix (Windows) Log
609735 dotnet/runtime Send to Helix Log
609729 dotnet/runtime Send to Helix Log
609736 dotnet/runtime Send to Helix Log
609730 dotnet/runtime Send to Helix Log
609709 dotnet/runtime Send to Helix Log
609711 dotnet/runtime Send to Helix Log
609742 dotnet/runtime Send to Helix Log
609715 dotnet/runtime Send to Helix Log
609716 dotnet/runtime Send to Helix Log
609720 dotnet/runtime Send to Helix Log
609701 dotnet/runtime Send tests to Helix (Windows) Log #99778
609693 dotnet/runtime Send tests to Helix (Windows) Log #100004
609667 dotnet/runtime Send to Helix Log #99608
609662 dotnet/runtime Send to Helix Log #100005
609653 dotnet/runtime Send to Helix Log #100001
609637 dotnet/runtime Send to Helix Log #100002
609610 dotnet/runtime Send to Helix Log #99972
609607 dotnet/runtime Send to Helix Log #100000
609604 dotnet/runtime Send tests to Helix (Unix) Log #99999
609600 dotnet/runtime Send to Helix Log #99998
609196 dotnet/runtime Send tests to Helix (Unix) Log #99985
609587 dotnet/runtime Send tests to Helix (Unix) Log
609327 dotnet/runtime Send to Helix Log
609031 dotnet/runtime Send tests to Helix (Unix) Log #99688
608673 dotnet/runtime Send to Helix Log #99836
609589 dotnet/runtime Send to Helix Log
609577 dotnet/runtime Send to Helix Log
609546 dotnet/runtime Send tests to Helix (Unix) Log #99996
608721 dotnet/runtime Send tests to Helix (Unix) Log #99956
608715 dotnet/runtime Send to Helix Log #99873
608540 dotnet/runtime Send tests to Helix (Windows) Log #99908
609321 dotnet/runtime Send to Helix Log
609516 dotnet/runtime Send tests to Helix (Unix) Log #99995
2409150 dotnet-runtime Send job to Helix (Windows) Log
609454 dotnet/runtime Send tests to Helix (Unix) Log #99849
609442 dotnet/runtime Send tests to Helix (Unix) Log #99982
609436 dotnet/runtime Send tests to Helix (Unix) Log #97529
609405 dotnet/runtime Send to Helix Log #99183
608686 dotnet/runtime Send to Helix Log #99973
609391 dotnet/runtime Send to Helix Log #99992
609368 dotnet/runtime Send tests to Helix (Windows) Log #99991
608833 dotnet/runtime Send to Helix Log
609361 dotnet/runtime Send to Helix Log #99990
609316 dotnet/runtime Send tests to Helix (Unix) Log
609355 dotnet/runtime Send to Helix Log #99662
609318 dotnet/runtime Send tests to Helix (Unix) Log
609323 dotnet/runtime Send tests to Helix (Unix) Log
609345 dotnet/runtime Send tests to Helix (Unix) Log #99784
609324 dotnet/runtime Send tests to Helix (Unix) Log
2408902 dotnet-runtime Send job to Helix (Windows) Log
609315 dotnet/runtime Send tests to Helix (Unix) Log
609332 dotnet/runtime Send tests to Helix (Unix) Log #99970
609329 dotnet/runtime Send to Helix Log #99989
609312 dotnet/runtime Send tests to Helix (Unix) Log
2408912 dotnet-runtime Build product Log
609326 dotnet/runtime Send to Helix Log
609320 dotnet/runtime Send to Helix Log
609229 dotnet/runtime Send to Helix Log #99987
609199 dotnet/runtime Send tests to Helix (Unix) Log #99986
609077 dotnet/runtime Send tests to Helix (Unix) Log
609073 dotnet/runtime Send tests to Helix (Unix) Log
2408699 dotnet-runtime Send job to Helix (Windows) Log
609163 dotnet/runtime Send to Helix Log #99983
608351 dotnet/runtime Send tests to Helix (Unix) Log #99958
609089 dotnet/runtime Send to Helix Log #99584
609081 dotnet/runtime Send to Helix Log
609092 dotnet/runtime Send to Helix Log #99584
609062 dotnet/runtime Send to Helix Log #99981
609059 dotnet/runtime Send to Helix Log #99981
609069 dotnet/runtime Send tests to Helix (Unix) Log
2408649 dotnet-runtime Send job to Helix (Windows) Log
608477 dotnet/runtime Send to Helix Log #99889
609048 dotnet/runtime Build native test components Log #99980
609053 dotnet/runtime Build product Log #99980
609003 dotnet/runtime Send to Helix Log #99909
609026 dotnet/runtime Send to Helix Log #99781
609080 dotnet/runtime Send to Helix Log
608992 dotnet/runtime Send to Helix Log #99183
608987 dotnet/runtime Send tests to Helix (Unix) Log #99566
608889 dotnet/runtime Send tests to Helix (Windows) Log #99468
608897 dotnet/runtime Send to Helix Log
608843 dotnet/runtime Send to Helix Log #99955
608820 dotnet/runtime Send to Helix Log #99976
608830 dotnet/runtime Send tests to Helix (Unix) Log
Displaying 100 of 670 results

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
98 258 662

Known issue validation

Build: 🔎
Result validation: ⚠️ Provided build not found. Provide a valid build in the "Build: 🔎" line.
Validation performed at: 11/30/2023 9:39:02 PM UTC

@jkotas jkotas added area-Infrastructure Known Build Error Use this to report build issues in the .NET Helix tab labels Sep 30, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 30, 2022
@ghost
Copy link

ghost commented Sep 30, 2022

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Issue Details
 {
    "ErrorMessage" : "The operation was canceled.",
    "BuildRetry": false
 }

See dotnet/arcade#11072 for more context.

Author: jkotas
Assignees: -
Labels:

area-Infrastructure, Known Build Error

Milestone: -

@jkotas jkotas added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Oct 1, 2022
@jeffschwMSFT jeffschwMSFT added this to the 8.0.0 milestone Oct 4, 2022
@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Oct 4, 2022
hoyosjs added a commit that referenced this issue Feb 22, 2023
Reverts version bump component of #81164

This PR caused heavy managed build regressions hitting all PR builds. See #82458 and #76454
@kg
Copy link
Member

kg commented Feb 23, 2023

#82476 can be ignored, it's intentionally running tests in a slower mode to flush out failures

@janvorli
Copy link
Member

I have noticed that the osx-x64 Release NativeAOT timeouted in my PR #85366, but from its log, it is obvious that it has completed running (https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_apis/build/builds/252395/logs/1833). The leg took few seconds over two hours. Maybe we could bump the timeout just a bit.

@jkotas
Copy link
Member Author

jkotas commented Apr 27, 2023

I have noticed that the osx-x64 Release NativeAOT timeouted in my PR #85366, but from its log, it is obvious that it has completed running (https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_apis/build/builds/252395/logs/1833)

The log shows that the leg was waiting for 1+ hour for a Helix machine. This happens when we are sending more work to a Helix queue than what it has capacity for. Increasing the timeouts is not going to fix it.

@janvorli
Copy link
Member

Ah, I can see it now, I've missed that.

@hoyosjs
Copy link
Member

hoyosjs commented Apr 27, 2023

Looking at this, our queue depth is not looking great for OSX 12.

image

Should we be trying to spill some work to 13 and 11?

cc: @dotnet/runtime-infrastructure @dotnet/dnceng

@hoyosjs
Copy link
Member

hoyosjs commented Apr 27, 2023

Hmm, These are the largest work producers:

image

With PRs like #85389 #85343 being a ton of the work.

@hoyosjs
Copy link
Member

hoyosjs commented Apr 27, 2023

image

The runtime time distribution seems to be consistent over the last couple weeks. We are 263043/307320 workitems sent over the last 3 days to the queue. That being said. It looks like in terms of workitem compute time, it looks like SDK is sending very long lived workitems what makes SDK take so much compute time is the different branches.

@adamsitnik
Copy link
Member

The following two CI legs have been failing due to timeouts for a at least two weeks:

  • Libraries Test Run release mono linux x64 Debug
  • Libraries Test Run release coreclr linux x64 Debug

@vcsjones is facing it in #87099 as well.

Is anyone investigating this issue?

@lambdageek
Copy link
Member

lambdageek commented Jun 16, 2023

Seeing the same legs time out on as @adamsitnik, too (Libraries Test Run release {mono,coreclr} linux x64 Debug)

I see something like this in both of the console logs:

  Sending Job to (Centos.8.Amd64.Open)Ubuntu.1804.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:centos-stream8-helix...
  Sent Helix Job; see work items at https://helix.dot.net/api/jobs/f0271bc6-76fb-445b-a33f-c02c71caa5fb/workitems?api-version=2019-06-17
  Sending Job to (AlmaLinux.8.Amd64.Open)RedHat.7.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:almalinux-8-helix-amd64...
  Sent Helix Job; see work items at https://helix.dot.net/api/jobs/c66c9c1c-8014-44d0-a94d-9aa1697c740d/workitems?api-version=2019-06-17
  Sending Job to (Debian.11.Amd64.Open)Ubuntu.1804.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-11-helix-amd64...
  Sent Helix Job; see work items at https://helix.dot.net/api/jobs/16e4db17-152c-4b75-af4a-d6e4e924cb10/workitems?api-version=2019-06-17
  Sending Job to Ubuntu.1804.Amd64.Open...
  Sent Helix Job; see work items at https://helix.dot.net/api/jobs/a7884197-1f40-4f66-bcdd-ae3b24af1100/workitems?api-version=2019-06-17
  Waiting for completion of job a7884197-1f40-4f66-bcdd-ae3b24af1100 on Ubuntu.1804.Amd64.Open (Details: https://helix.dot.net/api/jobs/a7884197-1f40-4f66-bcdd-ae3b24af1100/details?api-version=2019-06-17 )
  Waiting for completion of job c66c9c1c-8014-44d0-a94d-9aa1697c740d on (AlmaLinux.8.Amd64.Open)RedHat.7.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:almalinux-8-helix-amd64 (Details: https://helix.dot.net/api/jobs/c66c9c1c-8014-44d0-a94d-9aa1697c740d/details?api-version=2019-06-17 )
  Waiting for completion of job f0271bc6-76fb-445b-a33f-c02c71caa5fb on (Centos.8.Amd64.Open)Ubuntu.1804.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:centos-stream8-helix (Details: https://helix.dot.net/api/jobs/f0271bc6-76fb-445b-a33f-c02c71caa5fb/details?api-version=2019-06-17 )
  Waiting for completion of job 16e4db17-152c-4b75-af4a-d6e4e924cb10 on (Debian.11.Amd64.Open)Ubuntu.1804.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-11-helix-amd64 (Details: https://helix.dot.net/api/jobs/16e4db17-152c-4b75-af4a-d6e4e924cb10/details?api-version=2019-06-17 )
  Job f0271bc6-76fb-445b-a33f-c02c71caa5fb on (Centos.8.Amd64.Open)Ubuntu.1804.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:centos-stream8-helix is completed with 243 finished work items.
  Job 16e4db17-152c-4b75-af4a-d6e4e924cb10 on (Debian.11.Amd64.Open)Ubuntu.1804.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-11-helix-amd64 is completed with 243 finished work items.
  Job a7884197-1f40-4f66-bcdd-ae3b24af1100 on Ubuntu.1804.Amd64.Open is completed with 243 finished work items.
##[error]The operation was canceled.

On both legs, it looks like it's the (AlmaLinux.8.Amd64.Open) job that didn't finish.
I wonder if it is the same distro on other runs, too.

@lambdageek
Copy link
Member

lambdageek commented Jun 16, 2023

There's another issue about the recent AlmaLinux timeouts #87667 - the AlmaLinux leg has been removed by #87668

@agocke agocke modified the milestones: 8.0.0, 9.0.0 Sep 5, 2023
@build-analysis build-analysis bot removed this from the 9.0.0 milestone Nov 15, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Nov 15, 2023
@akoeplinger akoeplinger added this to the 9.0.0 milestone Nov 24, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Nov 24, 2023
@rzikm
Copy link
Member

rzikm commented Nov 30, 2023

This issue seems to match also generic test timeout failures, such as

System.Net.Http.HttpRequestException : Requesting HTTP version 3.0 with version policy RequestVersionExact while unable to establish HTTP/3 connection.
---- System.OperationCanceledException : The operation was canceled.

is that the intent here?

@jkotas
Copy link
Member Author

jkotas commented Nov 30, 2023

is that the intent here?

It is not the indent. I have tweaked the matching string.

@JulieLeeMSFT JulieLeeMSFT removed the Known Build Error Use this to report build issues in the .NET Helix tab label Mar 20, 2024
@JulieLeeMSFT
Copy link
Member

Removing Known Build Error label because it is marking Build Analysis green. We want developers either to wait until infra timeout issue is resolved and rerun their tests or to use their judgement to use the escape path to merge.

@JulieLeeMSFT
Copy link
Member

Known Build Error label is useful to get stats on the infra issues such as timeout. On the other hand, it allows PRs to get merged whle tests are not fully run. So, it is something we need to think about how we should handle such cases.

@jkotas
Copy link
Member Author

jkotas commented Mar 20, 2024

Known Build Error label is useful to get stats on the infra issues such as timeout

This was the sole purpose of this tracking issue. If we do not have it marked as Known Build Error, it can be closed.

@jkotas jkotas closed this as completed Mar 20, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Apr 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
Development

No branches or pull requests