Skip to content

Sync meeting 2023 08 08

Kenneth Hoste edited this page Aug 8, 2023 · 1 revision

MultiXscale WP1+WP5 sync meetings


Agenda/notes 2023-08-08

attending: ...

  • overview of MultiXscale planning

  • quarterly report 2023Q2

    • ...
  • WP status updates

    • [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
      • [UGent] T1.1 Stable (EESSI) - due M12+M24
        • new Stratum-0 hardware at @ RUG
          • still work-in-progress
          • will hopefully be in place soon so we can switch to eessi.io domain
          • performance issues with Stratum-1 replica server @ RUG to be figured out (issue #151)
          • we don't actually need to wait until new Stratum-0 is in place to start to build compat layer for eessi.io
        • EasyBuild v4.8.0 was released on 7 July 2023
          • incl. foss/2023a toolchain, software updates + new software, some bug fixes related to EESSI
        • Building EESSI 2023.06 software layer
          • incl. GROMACS and QuantumESPRESSO with foss/2021a
          • all software installations are built and deployed by the bot (cfr. T5.3), working well!
            • contributors who have access to Slurm cluster where bot is running can access build logs for failing builds in shared directory (which are automatically copied there by the bot)
          • several installations blocked by failing test suite on (only) aarch64/neoverse_v1 (Graviton3)
            • for FFTW in foss/2022a, for OpenBLAS in foss/2022b, for numpy in SciPy-bundle v2021.10 with foss/2021b
            • no problem on aarch64/neoverse_n1 (Graviton2)
            • we need to come up with a workflow to deal with situations like this
              • initial proposal here
        • draft contribution policy for adding software EESSI - should only be used under certain conditions, for example less than 1% of tests failing
          • see PR #108
          • needs some tweaking based on provided feedback - more feedback welcome! (Caspar?, Eli?)
          • should maybe be extended with workflow for dealing with failing tests
        • for Espresso, we need a CPU-only easyconfig
        • GPU support
          • https://github.com/multixscale/planning/issues/1
          • dedicated meeting to figure out steps to take, who does what
          • ship CUDA compat libraries: where (compat layer?), structure, etc.
          • changes to script to launch build container to allow access to GPU
          • etc.
      • [RUG] T1.2 Extending support (starts M9, due M30)
      • [SURF] T1.3 Test suite - due M12+M24
        • TensorFlow test merged (PR #38)
        • Support for skipping/running tests based on the GPU vendor (PR #60)
          • E.g. for applications that only support NVIDIA GPUs
        • Various ReFrame configs Snellius, merged PR #66. Vega, merged PR #62. Vega update, merged PR #76, AWS, not merged, ready for review, PR #53
          • CPU autodetect issue respolved on AWS
        • Scripts for daily runs of test suite on Vega PR #70, merged and PR #71, merged and PR #77, not merged
        • Investigated hyperthreading
          • E.g on Vega, 256 hw threads, 128 cores. Thought it would be good to start 128 threads / processes in total, instead of 256. Then bind to the 128 physical cores
          • Difficult to achieve and performance was actually higher with 256 hw threads
          • Also note: for testing, it is not a problem if performance is not "the best it can be" on a system, as long as it is consistent
          • Parking this issue for now #74, might revisit if other applications should really not do hyperthreading
        • OSU benchmarks PR #54, WIP
        • Meeting on ESPResSo test with Jean-Noel meeting notes
          • Got a few python scripts that can serve as a test ('p3m', 'lb' and 'lj' cases). Should scale to at least a few nodes. Todo's:
            • ESPResSo for newer toolchain: ESPResSo/4.2.1-foss-2022a-CUDA-11.7.0 (done, but need to PR to EasyBuild)
            • Todo: make a CPU only version
            • Todo: install, (in EESSI, or otherwise locally), to use when developing ReFrame tests
            • Todo: create ReFrame tests out of these
        • Alan: would be good to run tests with injected libraries to improve performance, like we did with AWS
        • performance monitoring with ReFrame
          • new ReFrame features to specify reference numbers for tests in configuration file?
      • [BSC] T1.4 RISC-V (starts M13)
      • [SURF] T1.5 Consolidation (starts M25)
    • [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
      • [UGent] T5.1 Support portal - due M12
      • [SURF] T5.2 Monitoring/testing (starts M9)
      • [UiB] T5.3 community contributions (bot) - due M12
        • working towards release v0.1 (https://github.com/multixscale/planning/issues/42)
          • all major features / parts planned for v0.1 have been implemented
          • doing one code polishing pass
          • targetting mid/end of August for the release
          • contribution policy isn't really part of the v0.1 bot release
        • bot being used to build software stack for 2023.06
          • bot has been working as intended for building/deploying software for EESSI
        • next release: add using test suite (details of how tests are run and when to be defined)
        • refactoring of the bot code
        • infrastructure for running the bot
          • maintaining Slurm cluster in AWS (set up with Cluster-in-the-Cloud) is becoming a bit of a pain
          • we should set up new Magic Castle Slurm clusters in AWS + Azure to replace it
            • may need to set up separate MC instances for x86_64 and aarch64
      • [UGent] T5.4 support/maintenance (starts M13)
    • [UB] WP6 Community outreach, education, and training
      • First pass of "Elevator Pitch" created, another revision under way
        • High-level overview of MultiXscale goals, sell it to the user community to get them interested
        • HPCNow! is working on a revision
      • Ambassador program to be outlined at NCC/CoE Coffee Break, 7th Sept.
        • Some NCCs do seem to be interested in the concept
        • mostly useful for EESSI (which is generic), more difficult for scientific WPs (due to required domain expertise)
      • Magic Castle tutorial at SC'23 accepted
        • EESSI will get a mention as it is one of the available stacks in MC
      • Second meeting with CVMFS devels regarding the Q4 workshop
      • Have offered to do a "Code of the Month" session with CASTIEL2
        • Should this wait until switch eessi.io is done?
    • [HPCNow] WP7 Dissemination, Exploitation & Communication
  • Other updates


Notes of previous meetings


Template for sync meeting notes

TO COPY-PASTE

  • overview of MultiXscale planning
  • WP status updates
    • [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
      • [UGent] T1.1 Stable (EESSI) - due M12+M24
        • ...
      • [RUG] T1.2 Extending support (starts M9, due M30)
      • [SURF] T1.3 Test suite - due M12+M24
        • ...
      • [BSC] T1.4 RISC-V (starts M13)
      • [SURF] T1.5 Consolidation (starts M25)
    • [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
      • [UGent] T5.1 Support portal - due M12
        • ...
      • [SURF] T5.2 Monitoring/testing (starts M9)
      • [UiB] T5.3 community contributions (bot) - due M12
        • ...
      • [UGent] T5.4 support/maintenance (starts M13)
    • [UB] WP6 Community outreach, education, and training
      • ...
    • [HPCNow] WP7 Dissemination, Exploitation & Communication
      • ...
Clone this wiki locally