Skip to content

Sync meeting 2024 11 12

Caspar van Leeuwen edited this page Nov 12, 2024 · 1 revision

MultiXscale WP1+WP5 sync meetings


Next meetings

  • Tue 10 Dec 2024 10:00 CET (post-mortem of special project review?)

Agenda/notes 2024-11-12

attending:

  • Caspar, Casper, Maxim (SURF)

  • Kenneth (HPC-UGent)

  • Alan (CECAM, UB)

  • Neja (NIC)

  • Bob (RUG)

  • Eli, Helena, Susana (HPCNow!)

  • Thomas, Richard (UiB)

  • Julián (BSC)

  • Special Project Review

  • Upcoming deliverables (M24):

    • we should schedule an initial sync meeting to kickstart the effort on these deliverables
      • for D1.3: UB lead - Kenneth will take the lead in practice
      • for D6.2: UiB lead - Thomas
      • for D7.2: HPCNow lead - Eli
      • for D8.5: NIC lead - Neja
    • D1.3, M24: Report on stable, shared software stack (UB)
      • Who: Kenneth (UGent), Caspar (SURF), Pedro (RUG), UiB (Richard/Thomas), UB (Alan)
        • Get started right after review? @Alan will you take initiative and divide work (if needed)?
      • TODO:
        • GPU support -> see tiger team
        • monitoring -> see ongoing effort by RUG
        • compare the performance against on-premise software stacks to identify potential performance limitations
          • mostly on Snellius @ SURF, since that's easier
            • Who? When? EESSI test suite?
        • stability -> zero reports so far of EESSI "network" being down
    • D6.2, M24: Training Activity Technical Support Infrastructure (UiB)
      • Lhumos portal
      • Magic Castle clusters for training, with EESSI
      • HPC Carpentry (?)
      • Ambasador program (first event was in October)
      • => @ Thomas/Ricard to pull this
    • D7.2, M24: Intermediate report on Dissemination, Communication and Exploitation (HPCNow)
    • D8.5, M24: Project Data Management Plan - final (NIC)
      • Waiting for review, comments on D8.2
  • Upcoming milestones (M24):

    • update required on Milestone 2, request by review
      • shared software stack w/ 1 accel supported + list of dependencies defined in WP2-4
    • Milestone 4, M21: First training event supported by the Ambassador Program. [WP1/WP5/WP6] (UB)
    • Milestone 5, M24: WP4 Pre-exascale workload executed on EuroHPC architecture. [WP2/WP3/WP4] (NIC)
      • was done through Tilen, see post on MultiXscale website
  • WP status updates

    • [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
      • [UGent] T1.1 Stable (EESSI) - D1.3 due M24 (Dec'24)
        • dev.eessi.io: Tiger team is making very good progress. See meeting notes
          • Caspar: Will show poisseuille demo, based on ESPResSo/4.2.2-foss-2023a-2ba17de6096933275abec0550981d9122e4e5f28 (i.e. test PR #1)
          • Done:
            • (Check TODOs from last month)
            • get ingestions for dev.eessi.io working => done
            • documentation update => open PR, waiting for Kenneth to review/merge
            • dev build of GROMACS was built & deployed => done
          • Key TODO's:
            • WIP by Pedro: install in project-specific prefixes => should be working now
            • TODO: Make sure accelerator builds end up in the right prefix (WIP by Pedro)
            • Once docs are merged, get scientific software developer to experiment with it => Jean-Noël?
          • What do we REALLY need from this before the project review?
        • NVIDIA GPU support Tiger team making really good progress. See meeting notes
          • Key results:
            • cuDNN ingested, properly stripped to redistributable part
            • updated local installation in host_injections to include cuDNN
            • enhance script(s) in software-layer repo
              • auto-detect GPU model/architecture (enhance archdetect) => DONE
              • pick up accel directive from the bot and change software installation prefix accordingly => DONE
            • (Check TODOs from last month)
          • TODO:
            • [WIP] updat ing the SitePackage.lua for proper GPU support (see PR #798) => Ready for review
            • Actual GPU nodes in build cluster (now cross-compiling, not running GPU tests) or at least on partner clusters
              • service account in place for EESSI at HPC-UGent + Deucalion
              • WIP @ Snellius
              • maybe also to explore at Vega?
            • Adapt bot to accept arguments to allocate/build on GPU nodes
            • Decide on and expand combinations of CPU & GPU architectures
              • will be determined by where we can get service accounts for EESSI?
              • should definitely cover EuroHPC systems
              • maybe also look into generic CPU + GPU?
            • Re-install GPU software in proper location: ESPResSo (?), LAMMPS, MetalWalls (?), TensorFlow, PyTorch, ...
              • => DONE for ESPResSo, LAMMPS
              • WIP for TensorFlow, PyTorch
          • proper NVIDIA GPU support is due by M24 (deliverable D1.3). What do we really want in there?
            • We are already there: all of the software from MultiXscale that has GPU support is in software.eessi.io
            • Nice to have: an extra AMD Zen4 + CC90 (Hopper) - can be built on Snellius
            • Thomas/Richard are experimenting with builds for NVIDIA Grace in experimental platform in Snellius
              • would be nice to also work together with JSC/JUPITER on this
          • we need to plan who will actively contribute, and how [Kenneth,Lara]
        • need to review description of Task 1.1, make sure all subtasks are covered
        • "we will benchmark software from the shared software stack and compare the performance against on-premise software stacks to identify potential performance limitations, ..."
          • ESPResSo + LAMMPS + OpenFOAM + ALL(?) (MultiXscale), GROMACS (BioExcel)
          • Who does what, and on which system?
            • Satish can do this, has to be done before M24
        • "increase stability of the shared software stack ... pro-actively by developing monitoring tools"
          • What do we still need before WE consider this box checked?
            • Basic functionality is there, if stratum 1 dies, it gets reported in a slack channel
              • TODO: add e-mails, but not needed for M24
            • Others still need to be able to access the monitoring server
            • Strictly speaking, everything needed for the deliverable is there
      • [RUG] T1.2 Extending support - D1.4 due M30 (June'25)
      • [SURF] T1.3 Test suite - D1.5 due M30 (June'25)
        • MetalWalls accepted
        • eessi_mixin class finished and merged.
          • Todo: docs & port existing tests
      • [BSC] T1.4 RISC-V (due M48, D1.6)
        • Julian is working on getting CernVM-FS deployed natively on the RISC-V hardware they have at BSC => Deployed in 3 of the clusters. One is Ubuntu, other is Fedora. WIP: pioneer cluster.
        • Formal invitation to present at HiPEAC
        • Shared account on cluster, can be used for bot builds
      • [SURF] T1.5 Consolidation (starts M25 - Jan'25)
        • continuation of effort on EESSI (T1.1, etc.) (not started yet)
    • [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
      • [SURF] T5.2 Monitoring/testing, D5.3 due M30 (June'25)
        • Plan to seperate dashboard & database in two separate VMs (security)
        • TODO Caspar: contact Vega & Karolina about making test data public. They agreed verbally on EuroHPC User Day 2024
        • Short term: Get this running somewhere, e.g. with reverse tunnel, so that we don't have to worry about VM security (but can still open up to our project members at least)
        • Richard: Can we containerize the dashboard + database, so that a user could deploy this locally
          • Maksim: yes, work in progress. Should be plug-and-play solution once done.
      • [UGent] T5.4 support/maintenance - D5.4 due M48 (Dec'26)
        • ...
    • [UB] WP6 Community outreach, education, and training
    • [HPCNow] WP7 Dissemination, Exploitation & Communication
      • podcast interview for EuroHPC podcast
        • Sound recording tutorial attended by a few of us
        • Plan is to still record, but will be after project review (Dec/Jan)
        • Kenneth in touch with NCC Belgium
        • Will be added to slides for review meeting that we are working on this
      • T7.1 Scientific applications provisioned on demand (lead: HPCNow) (started M13, finished M48)
      • Task 7.2 - Dissemination and communication activities (lead: NIC)
        • Susana is actively trying to get more followers on Facebook
        • Updates ... ?
      • Task 7.3 - Sustainability (lead: NIC, started M18, due M42)
        • Discussion with Castiel, not really working actively on this task yet
      • Task 7.4 - Industry-oriented training activities (lead: HPCNow)
        • Updates ... ? => Will be discussed a bit at review in Luxembourg
    • [NIC] WP8 (Management and Coordination)
      • Mail from project officer for 2nd review meeting
        • probably end-of-August 2025
        • must be in EU country, doesn't have to be Luxembourg
      • Ammendment (@Neja / @Alan, can you summarize the key points of what was submitted?)
        • @Neja: do we have a response yet? (It's been 60 days, right?) => No response yet, Neja will poll for an update again
      • NIC got payment of interest for the fact that the payment for the first reporting period was late
      • next General Assembly meeting
        • 23-24 Jan'25 in Barcelona/Sitges
          • venue is BSC
          • coupled to HiPEAC'25 (20-22 Jan 2025)
          • We need to promote the workshop at HiPEAC more!
          • registration is quite pricey, so we'll need to limit who actually attends?
          • Try to setup online participation (Kenneth / Caspar would be interested)
            • Should be possible in the room. Maybe use Alan's mic.
      • Project review
        • (Discussed above)

Other topics

  • CI/CD call for EuroHPC
    • is 100% funded (not 50/50 EU/countries)
    • not published yet
  • request for success story by CASTIEL2
    • status: rounds of editing going on, should be published soon [Neja,Alan,Caspar]
    • @Neja: do you know if this has been published by now?
Clone this wiki locally