Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some AWS CI updates #910

Merged
merged 16 commits into from
Aug 27, 2024
Merged

Some AWS CI updates #910

merged 16 commits into from
Aug 27, 2024

Conversation

IAlibay
Copy link
Member

@IAlibay IAlibay commented Jul 29, 2024

Checklist

  • Added a news entry

Developers certificate of origin

Copy link

codecov bot commented Jul 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.57%. Comparing base (a9d8fb5) to head (aaa83eb).
Report is 127 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #910   +/-   ##
=======================================
  Coverage   92.57%   92.57%           
=======================================
  Files         134      134           
  Lines        9917     9917           
=======================================
  Hits         9181     9181           
  Misses        736      736           
Flag Coverage Δ
fast-tests 92.57% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@mikemhenry
Copy link
Contributor

the c4.xlarege seemed fine IMHO = 2 failed, 855 passed, 40 skipped, 1 xfailed, 2 xpassed, 3594 warnings, 3 rerun in 4059.65s (1:07:39) = the tests ran super fast on it

@IAlibay
Copy link
Member Author

IAlibay commented Jul 30, 2024

the c4.xlarege seemed fine IMHO = 2 failed, 855 passed, 40 skipped, 1 xfailed, 2 xpassed, 3594 warnings, 3 rerun in 4059.65s (1:07:39) = the tests ran super fast on it

@mikemhenry I'm so confused, are we looking at the same thing?

I'm seeing the c4.xlarge jobs at > 5h on your branch (4h+ on tests)

image

Also - fun story, c4.xlarge is more expensive per hour than c6 or c7 🙈

image

@mikemhenry
Copy link
Contributor

@IAlibay it is this one https://github.com/OpenFreeEnergy/openfe/actions/runs/10143940924/job/28046616317#step:10:2709 which is the one from your PR where you changed to the faster instance -- I just got my wires crossed and copied and pasted what I thought was a c4 run with with the CPU affinity set manually. And it looks like as long as this instance isn't 4x the price (lol its not) then it saves us money spending more on the instance that takes less time.

I think at this point if you got a GPU quota bump, it is time to test on those since it looks like we have everything working CPU wise, and unless we want to use AWS for CPU tests, I don't think its worth optimizing this.

@IAlibay
Copy link
Member Author

IAlibay commented Jul 31, 2024

then it saves us money spending more on the instance that takes less time.

@mikemhenry just to point out, I think the main win in the other PR is removing coverage not the instance type (needs testing, but seen the same thing locally).

@mikemhenry
Copy link
Contributor

@mikemhenry
Copy link
Contributor

botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. okay @IAlibay looks like you need to either make me an IAM account so I can request a quota increase for the G family of instances (4 vCPUs is all we need) or you will need to do it, I will see if we can run on the P family

@mikemhenry
Copy link
Contributor

@mikemhenry
Copy link
Contributor

botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. looks like we didn't end up getting our quota request approved or something for the P family either?

@ethanholz
Copy link

botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. okay @IAlibay looks like you need to either make me an IAM account so I can request a quota increase for the G family of instances (4 vCPUs is all we need) or you will need to do it, I will see if we can run on the P family

This is odd, I would double check the quota amount on the G-instance types. I would advise against using the p2.xlarge as they are not compatible with the Deep Learning AMI we recommend and thus you will need a custom AMI (Eco Infra built one but the cost-benefit did not seem worthwhile).

@mikemhenry
Copy link
Contributor

I don't have access to the AWS account so I am just doing what I can on the guess and check side of things in terms of which GPU quota we might have

and yes we want to avoid using a custom AMI, in my experience they are easy to setup but hard to keep up to date

@IAlibay
Copy link
Member Author

IAlibay commented Aug 7, 2024

@mikemhenry looks like all my quota increases were in us-east-2 not us-east-1, which would explain things.

@mikemhenry
Copy link
Contributor

okay giving this another shot here https://github.com/OpenFreeEnergy/openfe/actions/runs/10322623363

@ethanholz
Copy link

You may need to update the AMI since they are region specific.

@mikemhenry
Copy link
Contributor

ah at least it is a public ami so I should be able to fix it even though I don't have access

@mikemhenry
Copy link
Contributor

okay test here https://github.com/OpenFreeEnergy/openfe/actions/runs/10322863477

@mikemhenry
Copy link
Contributor

New test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10324157429/job/28583139693

Didn't have the shell setup correctly

@mikemhenry
Copy link
Contributor

@IAlibay Do we want to do the integration tests as well? I was thinking of at least turning them on so we can get some timing data (after the current test finishes).

ethanholz added a commit to omsf-eco-infra/openmm-gpu-test that referenced this pull request Aug 9, 2024
Recommended shell fix [OpenFE#910](OpenFreeEnergy/openfe#910 (comment)) is a more realistic option.
@mikemhenry
Copy link
Contributor

Sweet, it ran in 1h 27m 20s so one minute longer than it took to run on a CPU only runner, but at least now we are checking our GPU code paths for ~75 cents

@mikemhenry
Copy link
Contributor

I am going to run it again but this time do our integration tests as well to see how long they take

@mikemhenry
Copy link
Contributor

@IAlibay
Copy link
Member Author

IAlibay commented Aug 9, 2024

The integration tests are really the ones that have any kind of GPU testing. In many ways we might want to only run those.

@mikemhenry
Copy link
Contributor

Yah I was thinking about is it worth only running GPU tests or all the tests? We might have to tweak our tests to support that option, and it will save us some money, but like, if it took an hour to do, how many CI runs do we need to break even?

@mikemhenry
Copy link
Contributor

Long tests ran in 1h 53m 39s

 ============================= slowest 10 durations =============================
1899.24s call     openfe/tests/protocols/test_openmm_rfe_slow.py::test_run_eg5_sim
916.34s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
842.12s call     openfe/tests/protocols/test_openmmutils.py::TestOFFPartialCharge::test_am1bcc_conformer_nochange
824.47s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
762.52s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
712.34s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[independent]
658.25s call     openfe/tests/protocols/test_openmm_afe_slow.py::test_openmm_run_engine[CPU]
630.85s setup    openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::TestTyk2XmlRegression::test_particles
555.15s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
537.59s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
= 861 passed, 37 skipped, 1 xfailed, 2 xpassed, 5557 warnings in 6553.91s (1:49:13) =

It looks like most of the skips were from missing openeye and espaloma, we should probably install those packages so we can run everything

thoughts @IAlibay

@mikemhenry
Copy link
Contributor

Added openeye and espaloma to the env, running test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10390261526

@mikemhenry
Copy link
Contributor

typo in channel name, running test here:
https://github.com/OpenFreeEnergy/openfe/actions/runs/10390460059

@mikemhenry
Copy link
Contributor

Ran into this issue:

  critical libmamba Multiple errors occured:
      Download error (28) Timeout was reached [https://conda.anaconda.org/jaimergp/label/unsupported-cudatoolkit-shim/noarch/repodata.json.zst]
      Operation too slow. Less than 30 bytes/sec transferred the last 60 seconds
      Subdir jaimergp/label/unsupported-cudatoolkit-shim/noarch not loaded!

I removed the channel from our env.yaml, I will check to make sure it doesn't cause a large download in the other jobs.

New test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10390626657

@mikemhenry
Copy link
Contributor

some tests are skipped:

2024-08-14T16:20:29.4649694Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_dict_roundtrip 
2024-08-14T16:20:29.4660788Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_dict_roundtrip_clear_registry 
2024-08-14T16:20:29.4676526Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_dict_roundtrip_clear_registry 
2024-08-14T16:20:29.4690414Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_keyed_dict_roundtrip 
2024-08-14T16:20:29.4705426Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_keyed_dict_roundtrip 
2024-08-14T16:20:29.4715536Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_shallow_dict_roundtrip 
2024-08-14T16:20:29.4730663Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_shallow_dict_roundtrip 
2024-08-14T16:20:29.4740311Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_key_stable 
2024-08-14T16:20:29.4755475Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_key_stable 
2024-08-14T16:20:29.4765931Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_repr 
2024-08-14T16:20:29.4779914Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_repr 

Will have to look into that but it took like 2 hours to run

2024-08-14T17:46:31.3838640Z ============================= slowest 10 durations =============================
2024-08-14T17:46:31.3840434Z 2023.28s call     openfe/tests/protocols/test_openmm_rfe_slow.py::test_run_eg5_sim
2024-08-14T17:46:31.3842885Z 883.92s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
2024-08-14T17:46:31.3845828Z 740.35s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
2024-08-14T17:46:31.3848191Z 696.44s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[sams]
2024-08-14T17:46:31.3850752Z 691.12s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
2024-08-14T17:46:31.3853070Z 681.27s call     openfe/tests/protocols/test_openmm_afe_slow.py::test_openmm_run_engine[CPU]
2024-08-14T17:46:31.3855546Z 651.47s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
2024-08-14T17:46:31.3858084Z 577.98s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
2024-08-14T17:46:31.3860668Z 538.88s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-0-True-14-1-3]
2024-08-14T17:46:31.3863590Z 531.53s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0--1-False-11-3-1]
2024-08-14T17:46:31.3865818Z = 872 passed, 24 skipped, 4 xfailed, 5 xpassed, 5718 warnings in 6914.35s (1:55:14) =

I think now we just need to decide how often we want to schedule this run and it will be good to merge!

@mikemhenry
Copy link
Contributor

@IAlibay any thoughts on how often we want to run this? It costs us ~$1.06 per run

@IAlibay
Copy link
Member Author

IAlibay commented Aug 26, 2024

@IAlibay any thoughts on how often we want to run this? It costs us ~$1.06 per run

I think we need to have a chat / re-organize things - we're running a ton of CPU-only tests on a GPU runner, which seems like a bit of a waste (especially if we're doing this regularly).

We probably want to move GPU-only tests to a specific set of test files and only run those (e.g. the rfe slow tests).

@mikemhenry
Copy link
Contributor

I disagree, running only tests that could take advantage of the GPU will save us some money but:

  1. Requires some dev time to enable that (I don't think there is a way to do this without adding some logic to conftest.py that is something like --only-gpu-tests)
  2. looking at the timing data, the slowest 5 tests take 1 hour and 20 minutes, so trimming the tests down won't save us that much
  3. one of the slow tests is slow because uses the CPU, so then we would have our "normal tests" our "slow gpu tests" and our "slow CPU tests" which means we would have ~3 distinct testing modes to collate.

I will say a lot of this analysis hinges on there being some benefit to running these slow tests on a GPU, the same test openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-1-False-11-1-3] for example:

github: 1250s
aws: 531s

Runs much faster on AWS, but if that is just because the VM is better and not that it is actually taking advantage of the GPU, then perhaps we can cut down our AWS time by quite a bit

@IAlibay
Copy link
Member Author

IAlibay commented Aug 26, 2024

@mikemhenry we should discuss this more, but I don't think the tests you're describing even use the GPU. We would be better served using a c7 instance.

@mikemhenry
Copy link
Contributor

Plan is to merge this in and then add some proper GPU tests later

@mikemhenry mikemhenry enabled auto-merge (squash) August 27, 2024 14:22
@mikemhenry mikemhenry merged commit c4951c0 into main Aug 27, 2024
11 checks passed
@mikemhenry mikemhenry deleted the testing-aws-ci branch August 27, 2024 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants