-
Notifications
You must be signed in to change notification settings - Fork 0
Sync meeting 2024 11 12
Caspar van Leeuwen edited this page Nov 12, 2024
·
1 revision
- Monthly, every 2nd Tuesday of the month at 10:00 CE(S)T
- Notes of previous meetings at https://github.com/multixscale/meetings/wiki
- Tue 10 Dec 2024 10:00 CET (post-mortem of special project review?)
attending:
-
Caspar, Casper, Maxim (SURF)
-
Kenneth (HPC-UGent)
-
Alan (CECAM, UB)
-
Neja (NIC)
-
Bob (RUG)
-
Eli, Helena, Susana (HPCNow!)
-
Thomas, Richard (UiB)
-
Julián (BSC)
-
Special Project Review
- First rehearsal done: Tue 5 november 2024 (9-13)
- Second rehearsal planned: Mo 18 November 2024 (9-13)
- Agenda and presentations at https://ubarcelona-my.sharepoint.com/personal/a_ocais_ub_edu/_layouts/15/onedrive.aspx?ga=1&id=%2Fpersonal%2Fa%5Focais%5Fub%5Fedu%2FDocuments%2FMultiXscale%2FSpecial%20Review%20meeting
- input needed?
- Eli: need to know from scientific side which events are relevant for industry?
- scientific meeting on Thu 14 Nov'24
- Eli: need to know from scientific side which events are relevant for industry?
- demo of dev.eessi.io with ESPResSo by Caspar is prepared
-
Upcoming deliverables (M24):
- we should schedule an initial sync meeting to kickstart the effort on these deliverables
- for D1.3: UB lead - Kenneth will take the lead in practice
- for D6.2: UiB lead - Thomas
- for D7.2: HPCNow lead - Eli
- for D8.5: NIC lead - Neja
- D1.3, M24: Report on stable, shared software stack (UB)
- Who: Kenneth (UGent), Caspar (SURF), Pedro (RUG), UiB (Richard/Thomas), UB (Alan)
- Get started right after review? @Alan will you take initiative and divide work (if needed)?
- TODO:
- GPU support -> see tiger team
- monitoring -> see ongoing effort by RUG
-
compare the performance against on-premise software stacks to identify potential performance limitations
- mostly on Snellius @ SURF, since that's easier
- Who? When? EESSI test suite?
- mostly on Snellius @ SURF, since that's easier
-
stability
-> zero reports so far of EESSI "network" being down
- Who: Kenneth (UGent), Caspar (SURF), Pedro (RUG), UiB (Richard/Thomas), UB (Alan)
- D6.2, M24: Training Activity Technical Support Infrastructure (UiB)
- Lhumos portal
- Magic Castle clusters for training, with EESSI
- HPC Carpentry (?)
- Ambasador program (first event was in October)
- => @ Thomas/Ricard to pull this
- D7.2, M24: Intermediate report on Dissemination, Communication and Exploitation (HPCNow)
- D8.5, M24: Project Data Management Plan - final (NIC)
- Waiting for review, comments on D8.2
- we should schedule an initial sync meeting to kickstart the effort on these deliverables
-
Upcoming milestones (M24):
- update required on Milestone 2, request by review
- shared software stack w/ 1 accel supported + list of dependencies defined in WP2-4
- Milestone 4, M21: First training event supported by the Ambassador Program. [WP1/WP5/WP6] (UB)
- Oct 4th training in Vienna: https://events.vsc.ac.at/event/141
- 2nd Ambassador event: MultiXscale hackaton in Slovenia (Dec'24)
- Milestone 5, M24: WP4 Pre-exascale workload executed on EuroHPC architecture. [WP2/WP3/WP4] (NIC)
- was done through Tilen, see post on MultiXscale website
- update required on Milestone 2, request by review
-
WP status updates
- [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
- [UGent] T1.1 Stable (EESSI) - D1.3 due M24 (Dec'24)
-
dev.eessi.io
: Tiger team is making very good progress. See meeting notes- Caspar: Will show poisseuille demo, based on
ESPResSo/4.2.2-foss-2023a-2ba17de6096933275abec0550981d9122e4e5f28
(i.e. test PR #1) - Done:
- (Check TODOs from last month)
- get ingestions for
dev.eessi.io
working => done - documentation update => open PR, waiting for Kenneth to review/merge
- dev build of GROMACS was built & deployed => done
- Key TODO's:
- WIP by Pedro: install in project-specific prefixes => should be working now
- TODO: Make sure accelerator builds end up in the right prefix (WIP by Pedro)
- Once docs are merged, get scientific software developer to experiment with it => Jean-Noël?
-
What do we REALLY need from this before the project review?
- Nothing, everything we need is there
-
docs PR on
dev.eessi.io
is merged
- Caspar: Will show poisseuille demo, based on
- NVIDIA GPU support Tiger team making really good progress. See meeting notes
- Key results:
- cuDNN ingested, properly stripped to redistributable part
- updated local installation in host_injections to include cuDNN
- enhance script(s) in
software-layer
repo- auto-detect GPU model/architecture (enhance
archdetect
) => DONE - pick up
accel
directive from the bot and change software installation prefix accordingly => DONE
- auto-detect GPU model/architecture (enhance
- (Check TODOs from last month)
- TODO:
- [WIP] updat ing the
SitePackage.lua
for proper GPU support (see PR #798) => Ready for review - Actual GPU nodes in build cluster (now cross-compiling, not running GPU tests) or at least on partner clusters
- service account in place for EESSI at HPC-UGent + Deucalion
- WIP @ Snellius
- maybe also to explore at Vega?
- Adapt bot to accept arguments to allocate/build on GPU nodes
- Decide on and expand combinations of CPU & GPU architectures
- will be determined by where we can get service accounts for EESSI?
- should definitely cover EuroHPC systems
- maybe also look into generic CPU + GPU?
- Re-install GPU software in proper location: ESPResSo (?), LAMMPS, MetalWalls (?), TensorFlow, PyTorch, ...
- => DONE for ESPResSo, LAMMPS
- WIP for TensorFlow, PyTorch
- [WIP] updat ing the
-
proper NVIDIA GPU support is due by M24 (deliverable D1.3). What do we really want in there?
- We are already there: all of the software from MultiXscale that has GPU support is in
software.eessi.io
- Nice to have: an extra AMD Zen4 + CC90 (Hopper) - can be built on Snellius
- Thomas/Richard are experimenting with builds for NVIDIA Grace in experimental platform in Snellius
- would be nice to also work together with JSC/JUPITER on this
- We are already there: all of the software from MultiXscale that has GPU support is in
- we need to plan who will actively contribute, and how [Kenneth,Lara]
- Key results:
- need to review description of Task 1.1, make sure all subtasks are covered
- => need to update project planning (Caspar, Kenneth)
- "we will benchmark software from the shared software stack and compare the performance against on-premise software stacks to identify potential performance limitations, ..."
- ESPResSo + LAMMPS + OpenFOAM + ALL(?) (MultiXscale), GROMACS (BioExcel)
- Who does what, and on which system?
- Satish can do this, has to be done before M24
- "increase stability of the shared software stack ... pro-actively by developing monitoring tools"
-
What do we still need before WE consider this box checked?
- Basic functionality is there, if stratum 1 dies, it gets reported in a slack channel
- TODO: add e-mails, but not needed for M24
- Others still need to be able to access the monitoring server
- Strictly speaking, everything needed for the deliverable is there
- Basic functionality is there, if stratum 1 dies, it gets reported in a slack channel
-
What do we still need before WE consider this box checked?
-
- [RUG] T1.2 Extending support - D1.4 due M30 (June'25)
-
zen4
almost on par with the rest.- Need to symlink everything for 2022b from
zen3
. See https://gitlab.com/eessi/support/-/issues/37#note_2159031831 (@Lara?) - Then merge https://github.com/EESSI/software-layer/pull/766
- Need to symlink everything for 2022b from
- NVIDIA Grace - @Thomas: any update?
- AMD ROCm (see planning issue #31 + support issue #71)
- effort led by Pedro/Bob (RUG)
- Any progress to mention?
- effort led by Pedro/Bob (RUG)
-
- [SURF] T1.3 Test suite - D1.5 due M30 (June'25)
- MetalWalls accepted
-
eessi_mixin
class finished and merged.- Todo: docs & port existing tests
- [BSC] T1.4 RISC-V (due M48, D1.6)
- Julian is working on getting CernVM-FS deployed natively on the RISC-V hardware they have at BSC => Deployed in 3 of the clusters. One is Ubuntu, other is Fedora. WIP: pioneer cluster.
- Formal invitation to present at HiPEAC
- Shared account on cluster, can be used for bot builds
- [SURF] T1.5 Consolidation (starts M25 - Jan'25)
- continuation of effort on EESSI (T1.1, etc.) (not started yet)
- [UGent] T1.1 Stable (EESSI) - D1.3 due M24 (Dec'24)
- [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
- [SURF] T5.2 Monitoring/testing, D5.3 due M30 (June'25)
- Plan to seperate dashboard & database in two separate VMs (security)
- TODO Caspar: contact Vega & Karolina about making test data public. They agreed verbally on EuroHPC User Day 2024
- Short term: Get this running somewhere, e.g. with reverse tunnel, so that we don't have to worry about VM security (but can still open up to our project members at least)
- Richard: Can we containerize the dashboard + database, so that a user could deploy this locally
- Maksim: yes, work in progress. Should be plug-and-play solution once done.
- [UGent] T5.4 support/maintenance - D5.4 due M48 (Dec'26)
- ...
- [SURF] T5.2 Monitoring/testing, D5.3 due M30 (June'25)
- [UB] WP6 Community outreach, education, and training
- deliverables due: D6.2 (M24 - Dec'24), D6.3 (M30 - June'25)
- HPCWire nomination in category "Best HPC Programming Tool or Technology"
- Winner will be made public Mon 18 Nov 18:00 EST (i.e. 19 Nov 00:00 CET) @ SC'24
- Ambassador event (4 Oct 2024)
- recording available on YouTube: https://www.youtube.com/watch?v=Xf6KqSCIAqw
- CECAM webinar (17 Oct 2024)
- https://www.cecam.org/webinar-details/multixscale-cecam-webinar-supporting-development-multiscale-methods-european-environment-scientific-software-installations-eessi
- Only scientific parnters
- Playlist on Lhumos portal: https://alpha.lhumos.org/Player/3/5/67165adbe4b08465348653da/6716609de4b0846534865590
- playlist on YouTube: https://www.youtube.com/playlist?list=PLeIlwLynDBkXQ28ALXeSmhwFlh9KTvLsU
- EuroHPC User Days (22-23 Oct 2024, Amsterdam)
- link to agenda
- attending: Kenneth/Lara (UGent), Thomas/Ricard (UiB), Bob?/Pedro? (RUG), Caspar (SURF)
- Very succesful, EESSI mentioned in many different talks
- Networking with vendors and other CoEs
- Interest from Mare Nostrum 5 to make EESSI available (offline batch nodes)
- Kenneth send an e-mail, no response yet
- Succesful talk by Kenneth, pretty full room, several questions
- Raspberry Pi give away: 5 correct answers, picked one at random
- see also https://www.eessi.io/docs/blog/2024/10/25/eurohpc_user_day_2024
- Netherlands eScience Center (Dutch national center of expertise for research software, ~60 RSEs) got in touch with Bob to give a talk (31 Oct'24, Amsterdam) - Present virtually, not a ton of people (8-10). Clash with another special interest group - Internal people from Netherlands eScience Center
- [Eli/HPCNow!] EESSI Birds-of-a-Feather session accepted at Supercomputing'24 (Atlanta, US)
- can reuse material from BoF session @ ISC'24 in Hamburg
- Do you need anything from anyone here?
- Kenneth/Lara will come as well.
- also Raspberry Pi's as prizes, sponsored by HPCNow!
- one raffle for people joining EESSI Slack
- another raffle for people passing by DoItNow booth for an EESSI demo
- [Pedro] submitted talk for SURF Advanced Computing Days (12 Dec'24, Utrecht)
- Yes
- [Alan] EESSI tutorial at HiPEAC 2025 accepted (20-22 Jan'25)
- we need to start promoting this
- [Lara] Also at HiPEAC: another workshop (about CoEs), Lara will present workshop there
- [HPCNow] WP7 Dissemination, Exploitation & Communication
- podcast interview for EuroHPC podcast
- Sound recording tutorial attended by a few of us
- Plan is to still record, but will be after project review (Dec/Jan)
- Kenneth in touch with NCC Belgium
- Will be added to slides for review meeting that we are working on this
- T7.1 Scientific applications provisioned on demand (lead: HPCNow) (started M13, finished M48)
- PR to AWS for using EESSI with Parallel Cluster => merged 7-11-2024
- Also WIP for the 'paid layer' on top of Parallel Cluster
- Task 7.2 - Dissemination and communication activities (lead: NIC)
- Susana is actively trying to get more followers on Facebook
- Updates ... ?
- Task 7.3 - Sustainability (lead: NIC, started M18, due M42)
- Discussion with Castiel, not really working actively on this task yet
- Task 7.4 - Industry-oriented training activities (lead: HPCNow)
- Updates ... ? => Will be discussed a bit at review in Luxembourg
- podcast interview for EuroHPC podcast
- [NIC] WP8 (Management and Coordination)
- Mail from project officer for 2nd review meeting
- probably end-of-August 2025
- must be in EU country, doesn't have to be Luxembourg
- Ammendment (@Neja / @Alan, can you summarize the key points of what was submitted?)
- @Neja: do we have a response yet? (It's been 60 days, right?) => No response yet, Neja will poll for an update again
- NIC got payment of interest for the fact that the payment for the first reporting period was late
- next General Assembly meeting
- 23-24 Jan'25 in Barcelona/Sitges
- venue is BSC
- coupled to HiPEAC'25 (20-22 Jan 2025)
- We need to promote the workshop at HiPEAC more!
- registration is quite pricey, so we'll need to limit who actually attends?
- Try to setup online participation (Kenneth / Caspar would be interested)
- Should be possible in the room. Maybe use Alan's mic.
- 23-24 Jan'25 in Barcelona/Sitges
- Project review
- (Discussed above)
- Mail from project officer for 2nd review meeting
- [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
- CI/CD call for EuroHPC
- is 100% funded (not 50/50 EU/countries)
- not published yet
- request for success story by CASTIEL2
- status: rounds of editing going on, should be published soon [Neja,Alan,Caspar]
- @Neja: do you know if this has been published by now?