-
Notifications
You must be signed in to change notification settings - Fork 0
Sync meeting 2023 08 08
Kenneth Hoste edited this page Aug 8, 2023
·
1 revision
- Monthly, every 2nd Tuesday of the month at 10:00 CE(S)T
- Notes of previous meetings at https://github.com/multixscale/meetings/wiki
attending: ...
-
overview of MultiXscale planning
-
quarterly report 2023Q2
- ...
-
WP status updates
- [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
- [UGent] T1.1 Stable (EESSI) - due M12+M24
- new Stratum-0 hardware at @ RUG
- still work-in-progress
- will hopefully be in place soon so we can switch to eessi.io domain
- performance issues with Stratum-1 replica server @ RUG to be figured out (issue #151)
- we don't actually need to wait until new Stratum-0 is in place to start to build compat layer for
eessi.io
- EasyBuild v4.8.0 was released on 7 July 2023
- incl. foss/2023a toolchain, software updates + new software, some bug fixes related to EESSI
- Building EESSI 2023.06 software layer
- incl. GROMACS and QuantumESPRESSO with
foss/2021a
- all software installations are built and deployed by the bot (cfr. T5.3), working well!
- contributors who have access to Slurm cluster where bot is running can access build logs for failing builds in shared directory (which are automatically copied there by the bot)
- several installations blocked by failing test suite on (only)
aarch64/neoverse_v1
(Graviton3)- for FFTW in
foss/2022a
, for OpenBLAS infoss/2022b
, for numpy in SciPy-bundle v2021.10 with foss/2021b - no problem on
aarch64/neoverse_n1
(Graviton2) - we need to come up with a workflow to deal with situations like this
- initial proposal here
- for FFTW in
- incl. GROMACS and QuantumESPRESSO with
- draft contribution policy for adding software EESSI
- should only be used under certain conditions, for example less than 1% of tests failing
- see PR #108
- needs some tweaking based on provided feedback - more feedback welcome! (Caspar?, Eli?)
- should maybe be extended with workflow for dealing with failing tests
- for Espresso, we need a CPU-only easyconfig
- cfr. https://github.com/easybuilders/easybuild-easyconfigs/pull/17709/
- UCX-CUDA dep is missing in Espresso easyconfig with CUDA
- would be good to have this in place by Espresso summer school (Oct'23) - https://espressomd.org/wordpress/community-and-support/espresso-summer-school/
- GPU support
- https://github.com/multixscale/planning/issues/1
- dedicated meeting to figure out steps to take, who does what
- ship CUDA compat libraries: where (compat layer?), structure, etc.
- changes to script to launch build container to allow access to GPU
- etc.
- new Stratum-0 hardware at @ RUG
- [RUG] T1.2 Extending support (starts M9, due M30)
- [SURF] T1.3 Test suite - due M12+M24
- TensorFlow test merged (PR #38)
- Support for skipping/running tests based on the GPU vendor (PR #60)
- E.g. for applications that only support NVIDIA GPUs
- Various ReFrame configs Snellius, merged PR #66. Vega, merged PR #62. Vega update, merged PR #76, AWS, not merged, ready for review, PR #53
- CPU autodetect issue respolved on AWS
- Scripts for daily runs of test suite on Vega PR #70, merged and PR #71, merged and PR #77, not merged
- Investigated hyperthreading
- E.g on Vega, 256 hw threads, 128 cores. Thought it would be good to start 128 threads / processes in total, instead of 256. Then bind to the 128 physical cores
- Difficult to achieve and performance was actually higher with 256 hw threads
- Also note: for testing, it is not a problem if performance is not "the best it can be" on a system, as long as it is consistent
- Parking this issue for now #74, might revisit if other applications should really not do hyperthreading
- OSU benchmarks PR #54, WIP
- Meeting on ESPResSo test with Jean-Noel meeting notes
- Got a few python scripts that can serve as a test ('p3m', 'lb' and 'lj' cases). Should scale to at least a few nodes. Todo's:
- ESPResSo for newer toolchain: ESPResSo/4.2.1-foss-2022a-CUDA-11.7.0 (done, but need to PR to EasyBuild)
- Todo: make a CPU only version
- Todo: install, (in EESSI, or otherwise locally), to use when developing ReFrame tests
- Todo: create ReFrame tests out of these
- Got a few python scripts that can serve as a test ('p3m', 'lb' and 'lj' cases). Should scale to at least a few nodes. Todo's:
- Alan: would be good to run tests with injected libraries to improve performance, like we did with AWS
- performance monitoring with ReFrame
- new ReFrame features to specify reference numbers for tests in configuration file?
- [BSC] T1.4 RISC-V (starts M13)
- [SURF] T1.5 Consolidation (starts M25)
- [UGent] T1.1 Stable (EESSI) - due M12+M24
- [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
- [UGent] T5.1 Support portal - due M12
- In progress
- the support portal @ https://gitlab.com/eessi/support
- templates updated
- labels updated - see https://gitlab.com/eessi/support/-/labels + https://gitlab.com/eessi/support/-/issues/3
- wiki updated - see https://gitlab.com/eessi/support/-/wikis/home
- if you find that something is missing open an issue?
- READY to deploy?
- working on adding a support page to EESSI documentation
- there is someting ready to submit as pr
- working on deliverable D5.2 (due M12)
- adding my draft to the overleaf template
- still very much work in progress
- the support portal @ https://gitlab.com/eessi/support
- TODO
- Define initial level of support
- set up periodic rotation for 1st-line support
- In progress
- [SURF] T5.2 Monitoring/testing (starts M9)
- [UiB] T5.3 community contributions (bot) - due M12
- working towards release v0.1 (https://github.com/multixscale/planning/issues/42)
- all major features / parts planned for v0.1 have been implemented
- doing one code polishing pass
- targetting mid/end of August for the release
- contribution policy isn't really part of the v0.1 bot release
- bot being used to build software stack for 2023.06
- bot has been working as intended for building/deploying software for EESSI
- next release: add using test suite (details of how tests are run and when to be defined)
- refactoring of the bot code
- infrastructure for running the bot
- maintaining Slurm cluster in AWS (set up with Cluster-in-the-Cloud) is becoming a bit of a pain
- we should set up new Magic Castle Slurm clusters in AWS + Azure to replace it
- may need to set up separate MC instances for x86_64 and aarch64
- working towards release v0.1 (https://github.com/multixscale/planning/issues/42)
- [UGent] T5.4 support/maintenance (starts M13)
- [UGent] T5.1 Support portal - due M12
- [UB] WP6 Community outreach, education, and training
- First pass of "Elevator Pitch" created, another revision under way
- High-level overview of MultiXscale goals, sell it to the user community to get them interested
- HPCNow! is working on a revision
- Ambassador program to be outlined at NCC/CoE Coffee Break, 7th Sept.
- Some NCCs do seem to be interested in the concept
- mostly useful for EESSI (which is generic), more difficult for scientific WPs (due to required domain expertise)
- Magic Castle tutorial at SC'23 accepted
- EESSI will get a mention as it is one of the available stacks in MC
- Second meeting with CVMFS devels regarding the Q4 workshop
- Have offered to do a "Code of the Month" session with CASTIEL2
- Should this wait until switch
eessi.io
is done?
- Should this wait until switch
- First pass of "Elevator Pitch" created, another revision under way
- [HPCNow] WP7 Dissemination, Exploitation & Communication
- MultiXscale poster sent to European corner at the "European Researchers' night" in Barcelona (https://lanitdelarecerca.cat/european-corner/)
- Planned an EESSI talk for October 9th at BSC (WHPC "Southern Europe" chapter annual meeting)
- Supercomputing'23
- Magic Castle tutorial (Alan)
- booth talks on EESSI?
- first MultiXscale newsletter was sent out
- https://www.multixscale.eu/wp-content/uploads/2023/07/Newsletter-Multixscale-Issue-1-2023.pdf
- maybe PDF is not the best format for newsletter?
- [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
-
Other updates
- Should we start considering a new EESSI tutorial again, incl. adding software to EESSI
- Start fixing dates for next MultiXscale GA meeting
- right after EuroHPC Summit in Belgium (Mon-Thu 18-21 March 2024)
- Thu+Fri 21+22 March 2024 in Ghent
- check for opportunities for present at EuroHPC Summit
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-07-11
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-06-13
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-05-09
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-04-11
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-03-14
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-02-14
- https://github.com/multixscale/meetings/wiki/sync-meeting-2023-01-10
TO COPY-PASTE
- overview of MultiXscale planning
- WP status updates
- [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
- [UGent] T1.1 Stable (EESSI) - due M12+M24
- ...
- [RUG] T1.2 Extending support (starts M9, due M30)
- [SURF] T1.3 Test suite - due M12+M24
- ...
- [BSC] T1.4 RISC-V (starts M13)
- [SURF] T1.5 Consolidation (starts M25)
- [UGent] T1.1 Stable (EESSI) - due M12+M24
- [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
- [UGent] T5.1 Support portal - due M12
- ...
- [SURF] T5.2 Monitoring/testing (starts M9)
- [UiB] T5.3 community contributions (bot) - due M12
- ...
- [UGent] T5.4 support/maintenance (starts M13)
- [UGent] T5.1 Support portal - due M12
- [UB] WP6 Community outreach, education, and training
- ...
- [HPCNow] WP7 Dissemination, Exploitation & Communication
- ...
- [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies