Skip to content

Sync meeting 2024 09 10

Kenneth Hoste edited this page Sep 10, 2024 · 1 revision

MultiXscale WP1+WP5 sync meetings


Next meetings

  • Tue 8 Oct 2024 10:00 CEST
  • Tue 12 Nov 2024 10:00 CET (prep for special project review?)
  • Tue 10 Dec 2024 10:00 CET (post-mortem of special project review?)

Agenda/notes 2024-09-10

attending:

  • Neja (NIC)
  • Alan (UB)
  • Kenneth (UGent)
  • ... (SURF)
  • Thomas & Richard (UiB)
  • Bob, Pedro (RUG)
  • Julián (BSC)
  • Helena, Susana (HPCNow!)
  • notes of previous meeting in 13 August?
  • shared drive moved to https://ubarcelona-my.sharepoint.com (same password)
  • 2024Q3 quarterly report
    • due by next sync meeting (early Oct)
    • reported risk of spending more PMs than planned in 2024Q2
      • not a real issue, we're catching up from underspending in 2023
  • Milestone 3 (M18 - June 2024, lead: UStuttgart)
  • amendment to MultiXscale grant agreement
    • convert PMs to travel budget => need to cut PMs from some tasks
      • T5.1 and T5.3 were finished M12, did we underspend PMs there?
  • WP status updates
    • [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies

      • [UGent] T1.1 Stable (EESSI) - D1.3 due M24 (Dec'24)
        • more software added, now:
          • ~388 different software projects
          • ~731 software installations per CPU target
          • over 6.500 software installations in total (across all CPU targets)
          • particularly relevant for MultiXscale: pystencils, MetalWalls, pyMBE, Extrae (POP tool), ESPResSo for A64FX (Deucalion)
        • dev.eessi.io => see notes + support issue #61
          • dedicated Magic Castle cluster in Azure [Alan,Kenneth]
            • was created, but is not fully operational yet currently
            • not correctly deployed due to incorrect Terraform Cloud token being used
          • TODO:
            • install and configure bot on dedicated MC cluster for dev.eessi.io
            • implement bot/build.sh script in dev.eessi.io repo, leveraging scripts from software-layer repo
            • flesh out first use case: building specific commits of ESPResSo
          • showing good progress on this would be good for upcoming special project review
            • would help to tackle the remark that there's too much of a disconnect between technical and scientific WPs
          • we need to plan who will actively contribute, and how [Kenneth,Pedro?]
        • NVIDIA GPU support => see notes + support issue #59
          • ground work has been done end of 2023 (see EESSI docs)
          • bot was enhanced recently to support accel filter
          • TODO:
            • enhance script(s) in software-layer repo
              • auto-detect GPU model/architecture (enhance archdetect)
              • pick up accel directive from the bot and change software installation prefix accordingly
              • install GPU software in proper location: ESPResSo (?), LAMMPS, MetalWalls (?), TensorFlow, PyTorch, ...
          • proper NVIDIA GPU support is due by M24 (deliverable D1.3)
            • => we shouldn't wait for dev.eessi.io being operational
          • we need to plan who will actively contribute, and how [Kenneth,Lara]
        • need to review description of Task 1.1, make sure all subtasks are covered
        • "we will benchmark software from the shared software stack and compare the performance against on-premise software stacks to identify potential performance limitations, ..."
          • ESPResSo + LAMMPS + OpenFOAM + ALL(?) (MultiXscale), GROMACS (BioExcel)
        • "increase stability of the shared software stack ... pro-actively by developing monitoring tools"
          • proper monitoring for CVMFS network (S0 + S1s)
          • active work-in-progress by RUG, see also meeting notes
      • [RUG] T1.2 Extending support - D1.4 due M30 (June'25)
        • optimized installations for AMD Genoa Zen4 (~64% done) + A64FX (~23% done) are still a work-in-progress
          • Intel Sapphire Rapids & NVIDIA Grace (for JUPITER) to start
        • AMD ROCm (see planning issue #31 + support issue #71)
          • effort led by Pedro/Bob (RUG)
      • [SURF] T1.3 Test suite - D1.5 due M30 (June'25)
        • additional tests implemented: CP2K, LAMMPS, mpi4py, PyTorch
        • WIP: test for MetalWalls
        • various small enhancements
        • due another release?
      • [BSC] T1.4 RISC-V (due M48, D1.6)
      • [SURF] T1.5 Consolidation (starts M25 - Jan'25)
        • continuation of effort on EESSI (T1.1, etc.) (not started yet)
    • [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations

      • [SURF] T5.2 Monitoring/testing, D5.3 due M30 (June'25)
        • dashboard to present test results is work-in-progress @ SURF
      • [UGent] T5.4 support/maintenance - D5.4 due M48 (Dec'26)
        • support portal + bi-weekly rotation working well
        • total: 84 issues (29 open, 55 closed)
    • [UB] WP6 Community outreach, education, and training

      • deliverables due: D6.2 (M24 - Dec'24), D6.3 (M30 - June'25)
      • EESSI as "package manager" backend in Ramble (testing tool by Google)
      • EESSI was used at EUMASTER4HPC event at IT4I (20 Aug 2024) by lecturer Tomas Martinovic as an easy way to use a rich installation of R during his course
        • some trouble with MPI due to missing integration with Slurm
        • => could look into "use cases" section in EESSI docs to cover fixes/workarounds for know problems
          • Alan: some related work on this by Pedro @ HPCNow!
        • we could do a short interview on this with him, and then report on it on MultiXscale website + EESSI blog
      • [Alan] invited speaker for Nordic Industry Days (early Sept'24, Copenhagen)
        • AI focused event
        • TrustLLM, European project
          • Morris Riedel (Univ. of Iceland) is involved, interested in collaboration on EESSI, evaluate performance (of PyTorch for example)
      • [Thomas] presentation @ CernVM workshop on EESSI (16-18 Sept 2024, Geneva)
        • we need a section on the website to have a record of upcoming events like this
        • could be done via news item, see https://www.multixscale.eu/blog-posts/
        • also via social media (Twitter, LinkedIn, etc.)
        • contact Susana/Neja via email to get social media posts on this
      • [Richard] public webinar Introduction to EESSI (3 Oct 2024)
        • would be good if Norwegian NCC could be involved/made aware of this
        • => to be promoted via MultiXscale/EESSI channel
      • [Alan] First ambassador event: "Introduction to EESSI" on 4 Oct 2024 (see news post on MultiXscale website)
        • EESSI will be available on Austrian systems as a result from this
        • number of registered participants?
          • unclear, done via Austrian partner, who will also do post-event survey
          • Neja: need to check which questions will be asked in survey
        • do we need to promote this a bit more, via EESSI/EasyBuild mailing list/Slack, etc.?
          • yes
      • EuroHPC User Days (22-23 Oct 2024, Amsterdam)
        • attending: Kenneth/Lara (UGent), Thomas/Ricard (UiB), Bob?/Pedro? (RUG)
        • paper submitted to get a talk slot
        • in touch with organisation w.r.t. participation in CoE session
          • "Walk-in networking sessions focusing on specific EuroHPC user needs: provide your feedback and get some advice"
        • bring your MultiXscale T-shirt!
      • Netherlands eScience Center (Dutch national center of expertise for research software, ~60 RSEs) got in touch with Bob to give a talk (31 Oct'24, Amsterdam)
        • unclear if that's a public event, but can do a write-up afterwards
      • [Eli/HPCNow!] EESSI Birds-of-a-Feather session accepted at Supercomputing'24 (Atlanta, US)
        • can reuse material from BoF session @ ISC'24 in Hamburg
      • [Pedro] submitted talk for SURF Advanced Computing Days (12 Dec'24, Utrecht)
        • talk not accepted yet
      • [Eli?] EESSI tutorial at HiPEAC 2025 accepted (20-22 Jan'25)
      • [Jean-Noël] Espresso summer school
    • [HPCNow] WP7 Dissemination, Exploitation & Communication

      • T7.1 Scientific applications provisioned on demand (lead: HPCNow) (started M13, finished M48)
        • WIP by Pedro (HPCNow!)
        • working on MPI injection for AWS ParallelCluster (AWS HPC Recipe library)
        • no CUDA support in OpenMPI 4.x in AWS, but it's there in OpenMPI 5.x
        • more or less ready, not available yet at startup
        • has to be tested on various OSs
        • will result in contribution to AWS HPC Recipe library + documentation on this
        • AWS HPC Recipe library will use a script that we control
        • future work: integration of EESSI into Open OnDemand
          • could be a collaboration with them, cfr. discussion at ISC booth
      • Task 7.2 - Dissemination and communication activities (lead: NIC)
        • more EESSI stickers?
        • Neja has more stickers: 1k EESSI + 1k MultiXscale
          • send an email to Neja if you need some shipped
        • Susana: improved/new video to be displayed at EuroHPC booth at SC'24 in Atlanta?
          • no budget for something professional, so we'll need to do it ourselves
          • most important thing is that we give them something to display
          • deadline: 11 Oct'24
          • is there any outdated info in current video?
          • would be nice to include interview with Matej (towards general public)
          • no sound for SC video!
            • but we can add subtitles
        • missing on MultiXscale website: organisation info (SC members, WP leads, etc.)
        • Susana is trying to get more followers on Facebook, now 67 followers
      • Task 7.3 - Sustainability (lead: NIC, started M18, due M42)
        • Legal entity for EESSI needs to be looked into
          • Alan has done some homework on this
          • only for EESSI, not for MultiXscale?
          • need to frame it as a way of making results from MultiXscale sustainable through EESSI, by including the software there
        • subcontracting money available for this
        • we should explore options ourselves a bit first
        • review mentioned that there's "no service" from scientific WPs
          • bit unfair, see trainings (like upcoming CECAM event), etc.
          • see also EESSI in CI context used by pyMBE (on top of Espresso)
      • Task 7.4 - Industry-oriented training activities (lead: HPCNow)
        • CECAM flagship event is also in scope here
        • work done on integrating EESSI in ParallelCluster, could do a webinar/tutorial on this
    • [NIC] WP8 (Management and Coordination)

      • amendment in the works
        • Neja will start looking into that after holiday in July
        • waiting for feedback from IIT on travel budget
        • maybe not everyone will do this because of having to get approval from co-funding agency
          • unclear what would happen if EU approves the change, but co-funding agency doesn't
          • not a problem for UGent/UB since travel budget is entirely in EU budget
        • Milestone 8 would be moved to KPI for T8.3: report on how many applications are installed on how many systems
        • @Neja: all partners should be informed that amendment is in the works and will be submitted soon
      • next General Assembly meeting
      • 2 deliverables were due 5 July'24 (in response to project review)
        • one on co-design (by Alan)
          • focus on collaborating with projects like EUPILOT, EPI, EUPEX (rather than contacting vendors directly)
        • one for scientific WPs
        • both submitted
      • next general MultiXscale meeting
        • Mon 23 Sept 2024, 14:00 CEST
        • discussing amendment?
        • prep for special project review?
          • who will prepare/present what will be discussed

Other topics

  • CI/CD call for EuroHPC
    • is 100% funded (not 50/50 EU/countries)
    • not published yet
  • request for success story by CASTIEL2
    • ideally end of June, by latest at end of August
    • status: rounds of editing going on, should be published soon [Neja,Alan,Caspar]

Notes of previous meetings

Clone this wiki locally