Some AWS CI updates #910

IAlibay · 2024-07-29T12:06:51Z

Checklist

Added a news entry

Developers certificate of origin

I certify that this contribution is covered by the MIT License here and the Developer Certificate of Origin at https://developercertificate.org/.

codecov · 2024-07-29T12:13:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.57%. Comparing base (a9d8fb5) to head (aaa83eb).
Report is 127 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #910   +/-   ##
=======================================
  Coverage   92.57%   92.57%           
=======================================
  Files         134      134           
  Lines        9917     9917           
=======================================
  Hits         9181     9181           
  Misses        736      736

Flag	Coverage Δ
fast-tests	`92.57% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

mikemhenry · 2024-07-29T15:06:56Z

the c4.xlarege seemed fine IMHO = 2 failed, 855 passed, 40 skipped, 1 xfailed, 2 xpassed, 3594 warnings, 3 rerun in 4059.65s (1:07:39) = the tests ran super fast on it

IAlibay · 2024-07-30T02:01:01Z

the c4.xlarege seemed fine IMHO = 2 failed, 855 passed, 40 skipped, 1 xfailed, 2 xpassed, 3594 warnings, 3 rerun in 4059.65s (1:07:39) = the tests ran super fast on it

@mikemhenry I'm so confused, are we looking at the same thing?

I'm seeing the c4.xlarge jobs at > 5h on your branch (4h+ on tests)

Also - fun story, c4.xlarge is more expensive per hour than c6 or c7 🙈

mikemhenry · 2024-07-31T17:22:32Z

@IAlibay it is this one https://github.com/OpenFreeEnergy/openfe/actions/runs/10143940924/job/28046616317#step:10:2709 which is the one from your PR where you changed to the faster instance -- I just got my wires crossed and copied and pasted what I thought was a c4 run with with the CPU affinity set manually. And it looks like as long as this instance isn't 4x the price (lol its not) then it saves us money spending more on the instance that takes less time.

I think at this point if you got a GPU quota bump, it is time to test on those since it looks like we have everything working CPU wise, and unless we want to use AWS for CPU tests, I don't think its worth optimizing this.

IAlibay · 2024-07-31T17:28:52Z

then it saves us money spending more on the instance that takes less time.

@mikemhenry just to point out, I think the main win in the other PR is removing coverage not the instance type (needs testing, but seen the same thing locally).

mikemhenry · 2024-08-02T20:57:44Z

testing commit 036e087 here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10221938633

mikemhenry · 2024-08-02T20:59:40Z

botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. okay @IAlibay looks like you need to either make me an IAM account so I can request a quota increase for the G family of instances (4 vCPUs is all we need) or you will need to do it, I will see if we can run on the P family

mikemhenry · 2024-08-02T21:01:44Z

testing commit 4687c52 here https://github.com/OpenFreeEnergy/openfe/actions/runs/10221976998

mikemhenry · 2024-08-02T21:06:04Z

botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. looks like we didn't end up getting our quota request approved or something for the P family either?

ethanholz · 2024-08-05T14:19:51Z

botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. okay @IAlibay looks like you need to either make me an IAM account so I can request a quota increase for the G family of instances (4 vCPUs is all we need) or you will need to do it, I will see if we can run on the P family

This is odd, I would double check the quota amount on the G-instance types. I would advise against using the p2.xlarge as they are not compatible with the Deep Learning AMI we recommend and thus you will need a custom AMI (Eco Infra built one but the cost-benefit did not seem worthwhile).

mikemhenry · 2024-08-05T19:04:46Z

I don't have access to the AWS account so I am just doing what I can on the guess and check side of things in terms of which GPU quota we might have

and yes we want to avoid using a custom AMI, in my experience they are easy to setup but hard to keep up to date

IAlibay · 2024-08-07T12:54:31Z

@mikemhenry looks like all my quota increases were in us-east-2 not us-east-1, which would explain things.

mikemhenry · 2024-08-09T16:37:44Z

okay giving this another shot here https://github.com/OpenFreeEnergy/openfe/actions/runs/10322623363

ethanholz · 2024-08-09T16:40:47Z

You may need to update the AMI since they are region specific.

mikemhenry · 2024-08-09T16:46:19Z

ah at least it is a public ami so I should be able to fix it even though I don't have access

…a Driver GPU AMI' AMI)

mikemhenry · 2024-08-09T16:58:50Z

okay test here https://github.com/OpenFreeEnergy/openfe/actions/runs/10322863477

mikemhenry · 2024-08-09T18:58:59Z

New test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10324157429/job/28583139693

Didn't have the shell setup correctly

mikemhenry · 2024-08-09T19:07:42Z

@IAlibay Do we want to do the integration tests as well? I was thinking of at least turning them on so we can get some timing data (after the current test finishes).

Recommended shell fix [OpenFE#910](OpenFreeEnergy/openfe#910 (comment)) is a more realistic option.

mikemhenry · 2024-08-09T20:43:47Z

Sweet, it ran in 1h 27m 20s so one minute longer than it took to run on a CPU only runner, but at least now we are checking our GPU code paths for ~75 cents

mikemhenry · 2024-08-09T20:44:45Z

I am going to run it again but this time do our integration tests as well to see how long they take

mikemhenry · 2024-08-09T22:14:58Z

longer tests running here https://github.com/OpenFreeEnergy/openfe/actions/runs/10326372977

IAlibay · 2024-08-09T22:39:46Z

The integration tests are really the ones that have any kind of GPU testing. In many ways we might want to only run those.

mikemhenry · 2024-08-09T22:47:01Z

Yah I was thinking about is it worth only running GPU tests or all the tests? We might have to tweak our tests to support that option, and it will save us some money, but like, if it took an hour to do, how many CI runs do we need to break even?

mikemhenry · 2024-08-12T16:23:03Z

Long tests ran in 1h 53m 39s

 ============================= slowest 10 durations =============================
1899.24s call     openfe/tests/protocols/test_openmm_rfe_slow.py::test_run_eg5_sim
916.34s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
842.12s call     openfe/tests/protocols/test_openmmutils.py::TestOFFPartialCharge::test_am1bcc_conformer_nochange
824.47s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
762.52s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
712.34s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[independent]
658.25s call     openfe/tests/protocols/test_openmm_afe_slow.py::test_openmm_run_engine[CPU]
630.85s setup    openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::TestTyk2XmlRegression::test_particles
555.15s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
537.59s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
= 861 passed, 37 skipped, 1 xfailed, 2 xpassed, 5557 warnings in 6553.91s (1:49:13) =

It looks like most of the skips were from missing openeye and espaloma, we should probably install those packages so we can run everything

thoughts @IAlibay

mikemhenry · 2024-08-14T15:13:57Z

Added openeye and espaloma to the env, running test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10390261526

mikemhenry · 2024-08-14T15:26:25Z

typo in channel name, running test here:
https://github.com/OpenFreeEnergy/openfe/actions/runs/10390460059

mikemhenry · 2024-08-14T15:36:15Z

Ran into this issue:

  critical libmamba Multiple errors occured:
      Download error (28) Timeout was reached [https://conda.anaconda.org/jaimergp/label/unsupported-cudatoolkit-shim/noarch/repodata.json.zst]
      Operation too slow. Less than 30 bytes/sec transferred the last 60 seconds
      Subdir jaimergp/label/unsupported-cudatoolkit-shim/noarch not loaded!

I removed the channel from our env.yaml, I will check to make sure it doesn't cause a large download in the other jobs.

New test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10390626657

mikemhenry · 2024-08-15T23:32:16Z

some tests are skipped:

2024-08-14T16:20:29.4649694Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_dict_roundtrip 
2024-08-14T16:20:29.4660788Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_dict_roundtrip_clear_registry 
2024-08-14T16:20:29.4676526Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_dict_roundtrip_clear_registry 
2024-08-14T16:20:29.4690414Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_keyed_dict_roundtrip 
2024-08-14T16:20:29.4705426Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_keyed_dict_roundtrip 
2024-08-14T16:20:29.4715536Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_shallow_dict_roundtrip 
2024-08-14T16:20:29.4730663Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_shallow_dict_roundtrip 
2024-08-14T16:20:29.4740311Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_key_stable 
2024-08-14T16:20:29.4755475Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_key_stable 
2024-08-14T16:20:29.4765931Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_repr 
2024-08-14T16:20:29.4779914Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_repr

Will have to look into that but it took like 2 hours to run

2024-08-14T17:46:31.3838640Z ============================= slowest 10 durations =============================
2024-08-14T17:46:31.3840434Z 2023.28s call     openfe/tests/protocols/test_openmm_rfe_slow.py::test_run_eg5_sim
2024-08-14T17:46:31.3842885Z 883.92s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
2024-08-14T17:46:31.3845828Z 740.35s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
2024-08-14T17:46:31.3848191Z 696.44s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[sams]
2024-08-14T17:46:31.3850752Z 691.12s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
2024-08-14T17:46:31.3853070Z 681.27s call     openfe/tests/protocols/test_openmm_afe_slow.py::test_openmm_run_engine[CPU]
2024-08-14T17:46:31.3855546Z 651.47s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
2024-08-14T17:46:31.3858084Z 577.98s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
2024-08-14T17:46:31.3860668Z 538.88s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-0-True-14-1-3]
2024-08-14T17:46:31.3863590Z 531.53s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0--1-False-11-3-1]
2024-08-14T17:46:31.3865818Z = 872 passed, 24 skipped, 4 xfailed, 5 xpassed, 5718 warnings in 6914.35s (1:55:14) =

I think now we just need to decide how often we want to schedule this run and it will be good to merge!

mikemhenry · 2024-08-24T01:34:16Z

@IAlibay any thoughts on how often we want to run this? It costs us ~$1.06 per run

IAlibay · 2024-08-26T16:23:18Z

@IAlibay any thoughts on how often we want to run this? It costs us ~$1.06 per run

I think we need to have a chat / re-organize things - we're running a ton of CPU-only tests on a GPU runner, which seems like a bit of a waste (especially if we're doing this regularly).

We probably want to move GPU-only tests to a specific set of test files and only run those (e.g. the rfe slow tests).

mikemhenry · 2024-08-26T17:35:43Z

I disagree, running only tests that could take advantage of the GPU will save us some money but:

Requires some dev time to enable that (I don't think there is a way to do this without adding some logic to conftest.py that is something like --only-gpu-tests)
looking at the timing data, the slowest 5 tests take 1 hour and 20 minutes, so trimming the tests down won't save us that much
one of the slow tests is slow because uses the CPU, so then we would have our "normal tests" our "slow gpu tests" and our "slow CPU tests" which means we would have ~3 distinct testing modes to collate.

I will say a lot of this analysis hinges on there being some benefit to running these slow tests on a GPU, the same test openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-1-False-11-1-3] for example:

github: 1250s
aws: 531s

Runs much faster on AWS, but if that is just because the VM is better and not that it is actually taking advantage of the GPU, then perhaps we can cut down our AWS time by quite a bit

IAlibay · 2024-08-26T19:12:40Z

@mikemhenry we should discuss this more, but I don't think the tests you're describing even use the GPU. We would be better served using a c7 instance.

mikemhenry · 2024-08-27T14:21:54Z

Plan is to merge this in and then add some proper GPU tests later

Update gpu-runner.yaml

05d9472

Merge branch 'main' into testing-aws-ci

f7b4bb0

test on a g4dn.xlarge @ 53 cents an hour

036e087

testing on a p2.xlarge @ 90 cents an hour

4687c52

IAlibay added 2 commits August 7, 2024 13:51

Update gpu-runner.yaml

63f6947

Update gpu-runner.yaml

1114298

Merge branch 'main' into testing-aws-ci

a02e729

update AMI to correct region (still the 'Deep Learning Base OSS Nvidi…

01b4d2c

…a Driver GPU AMI' AMI)

fix default shell, yolo that there is no syntax issues

3977907

ethanholz added a commit to omsf-eco-infra/openmm-gpu-test that referenced this pull request Aug 9, 2024

fix(ci): Update to use a different default run

9dc63e2

Recommended shell fix [OpenFE#910](OpenFreeEnergy/openfe#910 (comment)) is a more realistic option.

mikemhenry added 2 commits August 9, 2024 15:11

use the _REAL_ correct default shell

999fb06

Add a timeout of 12 hrs instead of the default 6

2409117

Added oe toolkit + espaloma_charge to GPU env for testing

c452b71

fix channel name

dc516ee

got a download error on the cudatoolkit shim which we don't want anyway

373bfe2

Merge branch 'main' into testing-aws-ci

102e825

Merge branch 'main' into testing-aws-ci

aaa83eb

mikemhenry enabled auto-merge (squash) August 27, 2024 14:22

mikemhenry approved these changes Aug 27, 2024

View reviewed changes

mikemhenry merged commit c4951c0 into main Aug 27, 2024
11 checks passed

mikemhenry deleted the testing-aws-ci branch August 27, 2024 14:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some AWS CI updates #910

Some AWS CI updates #910

IAlibay commented Jul 29, 2024

codecov bot commented Jul 29, 2024 •

edited

Loading

mikemhenry commented Jul 29, 2024

IAlibay commented Jul 30, 2024

mikemhenry commented Jul 31, 2024

IAlibay commented Jul 31, 2024

mikemhenry commented Aug 2, 2024

mikemhenry commented Aug 2, 2024

mikemhenry commented Aug 2, 2024

mikemhenry commented Aug 2, 2024

ethanholz commented Aug 5, 2024

mikemhenry commented Aug 5, 2024

IAlibay commented Aug 7, 2024

mikemhenry commented Aug 9, 2024

ethanholz commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

IAlibay commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 12, 2024

mikemhenry commented Aug 14, 2024

mikemhenry commented Aug 14, 2024

mikemhenry commented Aug 14, 2024

mikemhenry commented Aug 15, 2024

mikemhenry commented Aug 24, 2024

IAlibay commented Aug 26, 2024

mikemhenry commented Aug 26, 2024

IAlibay commented Aug 26, 2024

mikemhenry commented Aug 27, 2024

Some AWS CI updates #910

Some AWS CI updates #910

Conversation

IAlibay commented Jul 29, 2024

Developers certificate of origin

codecov bot commented Jul 29, 2024 • edited Loading

Codecov Report

mikemhenry commented Jul 29, 2024

IAlibay commented Jul 30, 2024

mikemhenry commented Jul 31, 2024

IAlibay commented Jul 31, 2024

mikemhenry commented Aug 2, 2024

mikemhenry commented Aug 2, 2024

mikemhenry commented Aug 2, 2024

mikemhenry commented Aug 2, 2024

ethanholz commented Aug 5, 2024

mikemhenry commented Aug 5, 2024

IAlibay commented Aug 7, 2024

mikemhenry commented Aug 9, 2024

ethanholz commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

IAlibay commented Aug 9, 2024

mikemhenry commented Aug 9, 2024

mikemhenry commented Aug 12, 2024

mikemhenry commented Aug 14, 2024

mikemhenry commented Aug 14, 2024

mikemhenry commented Aug 14, 2024

mikemhenry commented Aug 15, 2024

mikemhenry commented Aug 24, 2024

IAlibay commented Aug 26, 2024

mikemhenry commented Aug 26, 2024

IAlibay commented Aug 26, 2024

mikemhenry commented Aug 27, 2024

codecov bot commented Jul 29, 2024 •

edited

Loading