-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some AWS CI updates #910
Some AWS CI updates #910
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #910 +/- ##
=======================================
Coverage 92.57% 92.57%
=======================================
Files 134 134
Lines 9917 9917
=======================================
Hits 9181 9181
Misses 736 736
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
the |
@mikemhenry I'm so confused, are we looking at the same thing? I'm seeing the c4.xlarge jobs at > 5h on your branch (4h+ on tests) Also - fun story, c4.xlarge is more expensive per hour than c6 or c7 🙈 |
@IAlibay it is this one https://github.com/OpenFreeEnergy/openfe/actions/runs/10143940924/job/28046616317#step:10:2709 which is the one from your PR where you changed to the faster instance -- I just got my wires crossed and copied and pasted what I thought was a c4 run with with the CPU affinity set manually. And it looks like as long as this instance isn't 4x the price (lol its not) then it saves us money spending more on the instance that takes less time. I think at this point if you got a GPU quota bump, it is time to test on those since it looks like we have everything working CPU wise, and unless we want to use AWS for CPU tests, I don't think its worth optimizing this. |
@mikemhenry just to point out, I think the main win in the other PR is removing coverage not the instance type (needs testing, but seen the same thing locally). |
testing commit 036e087 here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10221938633 |
|
testing commit 4687c52 here https://github.com/OpenFreeEnergy/openfe/actions/runs/10221976998 |
|
This is odd, I would double check the quota amount on the G-instance types. I would advise against using the p2.xlarge as they are not compatible with the Deep Learning AMI we recommend and thus you will need a custom AMI (Eco Infra built one but the cost-benefit did not seem worthwhile). |
I don't have access to the AWS account so I am just doing what I can on the guess and check side of things in terms of which GPU quota we might have and yes we want to avoid using a custom AMI, in my experience they are easy to setup but hard to keep up to date |
@mikemhenry looks like all my quota increases were in |
okay giving this another shot here https://github.com/OpenFreeEnergy/openfe/actions/runs/10322623363 |
You may need to update the AMI since they are region specific. |
ah at least it is a public ami so I should be able to fix it even though I don't have access |
…a Driver GPU AMI' AMI)
New test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10324157429/job/28583139693 Didn't have the shell setup correctly |
@IAlibay Do we want to do the integration tests as well? I was thinking of at least turning them on so we can get some timing data (after the current test finishes). |
Recommended shell fix [OpenFE#910](OpenFreeEnergy/openfe#910 (comment)) is a more realistic option.
Sweet, it ran in 1h 27m 20s so one minute longer than it took to run on a CPU only runner, but at least now we are checking our GPU code paths for ~75 cents |
I am going to run it again but this time do our integration tests as well to see how long they take |
longer tests running here https://github.com/OpenFreeEnergy/openfe/actions/runs/10326372977 |
The integration tests are really the ones that have any kind of GPU testing. In many ways we might want to only run those. |
Yah I was thinking about is it worth only running GPU tests or all the tests? We might have to tweak our tests to support that option, and it will save us some money, but like, if it took an hour to do, how many CI runs do we need to break even? |
Long tests ran in 1h 53m 39s
It looks like most of the skips were from missing thoughts @IAlibay |
Added openeye and espaloma to the env, running test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10390261526 |
typo in channel name, running test here: |
Ran into this issue:
I removed the channel from our env.yaml, I will check to make sure it doesn't cause a large download in the other jobs. New test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10390626657 |
some tests are skipped:
Will have to look into that but it took like 2 hours to run
I think now we just need to decide how often we want to schedule this run and it will be good to merge! |
@IAlibay any thoughts on how often we want to run this? It costs us ~$1.06 per run |
I think we need to have a chat / re-organize things - we're running a ton of CPU-only tests on a GPU runner, which seems like a bit of a waste (especially if we're doing this regularly). We probably want to move GPU-only tests to a specific set of test files and only run those (e.g. the rfe slow tests). |
I disagree, running only tests that could take advantage of the GPU will save us some money but:
I will say a lot of this analysis hinges on there being some benefit to running these slow tests on a GPU, the same test
Runs much faster on AWS, but if that is just because the VM is better and not that it is actually taking advantage of the GPU, then perhaps we can cut down our AWS time by quite a bit |
@mikemhenry we should discuss this more, but I don't think the tests you're describing even use the GPU. We would be better served using a c7 instance. |
Plan is to merge this in and then add some proper GPU tests later |
Checklist
news
entryDevelopers certificate of origin