sync meetings bot T5.3

live notes during meeting: https://hackmd.io/z6hu9gkoR6G1yCichemorA

Snapshot of meeting notes is included below, cut-and-paste from HackMD note

MXS - T5.3 Facilitating community contributions to the central software stack

See end of document for task admin info

Current target: v0.1 release of build-and-deploy bot, see https://github.com/multixscale/planning/issues/42

meeting notes

Thu 23 Nov 2023 (13:00 CET) - bot sync meeting

present: Thomas, Kenneth, Lara, Pedro, Bob, Caspar
contribution policy in place \o/
- https://www.eessi.io/docs/adding_software/contribution_policy
MultiXscale deliverable
- WIP by Thomas, with feedback from Kenneth
- in very good shape, 90% done
- Thomas: revise document (starting Fri 24 nov noon-ish)
- Pedro/Bob: review afterwards- as 3rd reviewer
PRs for test phase in bot
- assumed to be in place in MultiXscale deliverable, so should get merged ASAP
- PR #222 to bot should be done (but is untested)
  - this is the critical PR that should get merged ASAP so we have a new bot release (v0.2.0)
  - logic was copy-paste from build phase, so should be OK
- PR #366 to software-layer is WIP
  - only bot/check-test.sh script is missing
  - can use a dummy bot/check-test.sh script for now (if needed)
cleanup for software.eessi.io PR #229
- pretty trivial PR
next meeting: Wed 13 Dec'23, 10:00-11:00 CET

Thu 09 Nov 2023 (11:00 CET) - bot sync meeting

present: Pedro, Bob, Alan, Kenneth, Thomas, Lara
contribution policy
- recap
  - purpose: enable contributions from anyone by defining objective criteria for what software can be included and how
  - content: set of requirements that are easy to understand, easy to satisfy, easy to verify (some automatically, some manually) // some requirements may not be enforced, but we inform about our intent to do so in the future
  - requirements:
    - a) open source software
    - b) built by the bot
    - c) supported by EasyBuild release
    - d) supported compiler toolchain
    - e) CPU microarchitectures
    - f) versions & toolchains
    - g) testing
- personal view (TR): mostly fine, doesn't need to be perfect, but no software in EESSI/2023.06 should violate it; need it published NOW 😃
- things to change in current draft policy:
  - a) open source software
    - should be reworded to "allowed to redistribute" + strongly prefer open source software
  - b) built by the bot
  - c) supported by EasyBuild
    - an EasyBuild release
    - remove 2nd line
  - d) compiler toolchain
    - latest EasyBuild release
  - e) recent software versions & toolchains
    - sufficiently recent software versions & toolchains
    - software versions & toolchains should be separate sections in policy
      - e) recent software versions
        
        We strongly prefer installing sufficiently recent software versions.
      - f) recent toolchain generations
        
        EESSI administrators determine which toolchains are installed. Software should be added to an existing easystack file.
        
        Open a support request if existing toolchains are not sufficient.
    - connection between compat layer & toolchain generations
      - only install latest N toolchain generations in new compat layer
      - limited overlap across compat layer versions
  - g) CPU targets
    - target CPU microarchitectures?
    - "supported by the EESSI version to which the software is contributed"
  - h) testing
    - very loose, good enough for v0.1 of the policy
    - should be refined to test for applications, user-facing, etc.?
    - hard to make it stricter right now, since bot doesn't support actually testing the installations in different context
    - should mention that tests run by EasyBuild must pass (with exceptions allowed + must be documented in issue)
- "Requirements" section should say "there may be other restrictions, on a case by case"
- for software.eessi.io => stick to 2022b & newer in 2023.06
release v0.1.1
- merge PR #224
- do quick test of current develop branch before tagging v0.1.1 release
D5.1
- started
- should finish a draft by ??? (latest end of next week, task internal comments, project internal review, ...)
- needs contribution policy
- would like to have v0.1.1 (bug fixes) & v0.2.0 (support for basic tests)
aob
- initial support for testing
  - bot side is done (but untested), see https://github.com/EESSI/eessi-bot-software-layer/pull/222
  - software layer PR needs work (missing script): https://github.com/EESSI/software-layer/pull/366
D5.2 (support portal)
- Lara is close to having a finished draft
D1.1 (EESSI)
- Bob & Pedro have a bullet list to kickstart the deliverable
software.eessi.io
- PRs to merge
  - https://github.com/EESSI/filesystem-layer/pull/160
    - needs to be updated with new pubkey for repo
  - (merged) https://github.com/EESSI/filesystem-layer/pull/161
  - https://github.com/EESSI/cvmfs-servers/pull/1
- (Bob) install & use new key
  - and update PR accordingly
- (Bob) update Stratum-1's with new key
- (Thomas) update bot config @ https://github.com/EESSI/bot-configs/tree/main/mc-aws-rocky88-202310/repos
- get EESSI config in cvmfs-config repo (https://github.com/cvmfs-contrib/config-repo)
- (Thomas,Kenneth) install security updates in 2023.06 compat layer
  - document procedure in EESSI support portal wiki?
  - consider ingestined into subdir + doing symlink
- install 2023.06 software layer for software.eessi.io

Tue 24 Oct 2023 (09:00 CEST) - bot sync meeting

present:
contribution policy (docs PR #108)
- still WIP, need to process feedback (Kenneth)
- give subsection number so they're easy to refer to
release v0.1.1
- bug fixes / minor improvements or additions
  - merged PRs
    - #217 script to clean up tarballs of jobs given a PR number
    - #220 omit header in squeue output
    - #221 add instructions on setting Permissions for GitHub App
  - open PRs
    - #224 Upgrade bot to be compatible with PyGithub v2.1.1
      - minor code cleanup suggested by Kenneth
status release v0.2
- open PRs
  - bot: #222 add support to run test suite at the end of a build job
  - software: #366 [WIP] add scripts to run test suite after build job has finished
    - be careful running --sanity-check-only, we know it's not perfect in EasyBuild
    - check-test.sh script is missing
- maybe do in v0.3
  - deploy compat layer with bot?
  - avoid that bot hardcodes anything specific to software layer, like check for success
    - would a bot/deploy.sh be useful?
prioritized issues
- failing to apply PR patch should be reported in PR issue #212

Tue 26 Sept 2023 (09:00 CEST) - bot sync meeting

Tue 12 Sept 2023 (09:00 CEST) - bot sync meeting

present: Bob, Thomas, Kenneth, Richard, Alan
status release v0.1
- all code polishing PRs merged, ready to tag v0.1?
- Thomas will go through README, check if anything is missing (like new configuration settings)
goals for v0.2
- testing step
- deploy compat layer with bot?
  - avoid that bot hardcodes anything specific to software layer, like check for success
new issues
- failing to apply PR patch should be reported in PR issue #212
- fluke download failures issue #213
  - downloading sources first before building anything would already help a lot (early failure on download trouble)
  - just using a shared dir for sources would already help, so first build job is responsible for downloading
  - bot should provide location to a shared directory that can be used

Tue 29 August 2023 (09:00 CEST) - regular bot dev meeting

present: Bob, Kenneth, Caspar, Lara, Thomas
status release v0.1
- last two code polishing PRs #208 and #211
  - #208 merged by KH this morning (yeah)
- clean-up README.md
  - move revised content to EESSI/docs
  - short note about development of bot: main vs develop branches, etc.
next release v0.2
- develop under branch 'develop'
- support for testing (see meeting Aug 14)
- make deployment repository agnostic (change code so it does not analyse result of bot job but rather relies on result file produced by bot/check-build.sh)
  - essential for building compat layer with bot
- merge outstanding PRs [178, 181, 182]
  - these PRs need to be synced with develop branch first (once it's created after 0.1 release)
- provide means for upload script to deposit tarball and metadata file under different directories (use config setting instead of hardcoding it)
other
- SWL PR#317
  - idea: make it easier to investigate build jobs + prevent mistakes/omissions caused by setting up environment manually
  - provides bot/inspect.sh
  - called with path to tarball of last run (temporary storage)
  - could be accompanied with a bot-side script that gets a job id as parameter (side-effect: sensitive information can be removed from PR comments)
  - if merged it would be part of every build job's working directory, hence could be called without any parameters at all
  - for now, focus on the script
    - if someone needs access to a job working dir for running the inspect script, someone with access to bot account can copy it
    - later bot can be enhanced to copy working dir automatically to a specific cluster account via a bot command in PR
  - current implementation of bot/inspect.sh script assumes that tgz is located in job working directory
    - useful so Slurm output file is also available
    - but makes it more complicated to share what's needed to a random contributor (without sharing "sensitive" data)
- add bot/README.md which describes purpose of scripts
- deliverable, ask Alan to set up project on Overleaf

Mon 14 August 2023 (13:00 CEST) - integrate reframe tests into build-and-deploy workflow

present: Caspar, Kenneth, Lara, Thomas

stateDiagram
    direction LR

    created_ws: created
    [*] --> created_ws
    created_ws --> invalid
    created_ws --> valid
    valid --> submitted
    submitted --> running
    build_success: succeeded to build
    running --> build_success
    build_failure: failed to build
    running --> build_failure
    invalid --> updated
    invalid --> closed
    updated --> valid
    updated --> invalid
    build_failure --> updated
    build_failure --> closed
    closed --> [*]
    build_success --> build_tested
    build_tested --> uploaded
    uploaded --> ingest_staged
    ingest_staged --> test_staged
    test_staged --> approved
    test_staged --> rejected
    rejected --> closed
    approved --> ingested
    ingested --> added
    added --> [*]

when do we do testing?
- ...
what kind of testing?
- single node: during build-test-deploy done by bot
- multi node: after ingest (in a test repository)
- test on different host OS / different container
how to test?
- in a container
- native CVMFS
step by step
- start with letting bot running tests in build container (Debian as OS)
- can later also run tests in separate container (totally different OS, like Rocky Linux)
- [Thomas, Kenneth] start with supporting test.sh script that is run after build.sh and check-build.sh in bot-build.slurm job script;
  - test.sh calls test-suite.sh in build container;
- then we can pull it apart, let bot run test.sh script in a separate job (via bot-test.slurm job script), potentially in different contexts (diff. OS, diff. system), ...;
- also check-test.sh script that bot uses to determine result of test(s);
- eventually bot should feed information into test.sh script that can be used to only run certain tests;
  - tests in EESSI test suite should be tagged with software names for easy filtering;
  - "generic" tests like eb --sanity-check-only;
  - EESSI demo scripts;
checking whether generic build is actually generic
- run test in build container on older system (via QEMU?)
setting up the test environment
- writable overlay (like in build container)
  - may be slow due to bookkeeping done by fuse-overlayfs;
  - downside is that added software is writable, could lead to false positives in tests
  - make overlay read-only?
  - we can actually re-use the overlay that was created in the build step (but then we're not checking the tarball that will be ingested);
- other option could be to bind mount of module files + software directories
  - more difficult, unclear if there's benefit;
- yet another option: could do a "local" ingest in a test CVMFS repository in the bot account, and then test with that repo;
  - how do we keep in sync with production EESSI repo?
  - seems slow...
development setup
- devel branch in bot repo to start developing;
  - set up separate "devel" instance of the bot that can be used from software-layer PRs;
  - work on test step here
- production instance of the bot runs main branch (latest release of the bot, soon v0.1);
other stuff for v0.2.x
- canceling build jobs?
  - not easy to find appropriate comment to update in PR
- better rebuild support?
  - short term we can just manually remove on Stratum-0 first
- result of deploy step is still EESSI specific in bot, should be fixed;
  - is important for building compat layer with the bot
other
- set up environment to manually replicate a build failure that the bot hits is quite painful now

Mon 14 August 2023 (10:00 CEST)

Ideally: run tests with other container, with other OS than in build container
- I.e. container with different OS, create writeable overlay => extract tarball => run test

Tue 1 August 2023 (13:30 CEST)

present: Thomas, Kenneth, Lara, Caspar (excused: Bob)
experiences with bot in recent weeks
- bot is working as expected
  - some issues with FFTW for foss/2022a and foss/2022b, but that's orthogonal to the bot itself
  - event-based aspect is very helpful, since that makes it responsive
- need access to bot account to figure out what went wrong
  - build logs are copied to /mnt/shared for failing build jobs
  - cfr. https://github.com/EESSI/software-layer/pull/302 + https://github.com/EESSI/software-layer/pull/305
  - this worked well enough for Lara to diagnose problems, but not to reproduce the problem manually
- procedure to resume working directory from tarballs kept by bot not well documented
  - only --resume option is documented at http://www.eessi.io/docs/getting_access/eessi_container/#resuming-a-previous-session
- feature request to be able to cancel build jobs
  - cfr. https://github.com/EESSI/eessi-bot-software-layer/issues/190
- problems with re-installing ReFrame
  - cfr. https://github.com/EESSI/software-layer/pull/311 + https://github.com/EESSI/software-layer/issues/312
problems with failing tests for OpenBLAS/FFTW/numpy due to numerical errors
- OpenBLAS for foss/2022b: https://github.com/EESSI/software-layer/pull/309
planning v0.1
- https://github.com/multixscale/planning/issues/42
- only thing left for bot implementation itself is "code polishing"
  - single sweep across each Python file, remove commented out code, add missing docstrings, etc.
  - real refactoring would require more tests first
different types of contributors
- maintainers: full access
- trusted contributors: can trigger builds, get access to build logs, etc.
- external contributors: can only open PRs, need help from others to trigger builds, figure out broken builds, etc.
  - should provide an easy way to allow them to reproduce the build as done by the bot
proposal for contribution policy
- https://github.com/EESSI/docs/pull/108
- documented policy should be made more restrictive initially
  - with "internal" exceptions for project members, for now

Tue 18 July 2023 (09:00 CEST)

on leave: Thomas + Kenneth + Lara => skip meeting

Tue 4 July 2023 (09:00 CEST)

present: Kenneth, Lara, Bob (excused: Thomas)
merged PRs:
- PR#174 move check result to target repo (bot/check-result.sh)
- https://github.com/EESSI/eessi-bot-software-layer/pull/187
- https://github.com/EESSI/eessi-bot-software-layer/pull/189 + https://github.com/EESSI/software-layer/pull/275
- https://github.com/EESSI/software-layer/pull/278
docs
- https://www.eessi.io/docs/software_layer/adding_software
- https://www.eessi.io/docs/bot
bot/check-build.sh should not declare success on empty tarballs (issue #294)
open problems
- job manager crashes every now and then, just restarting should work
  - https://github.com/EESSI/eessi-bot-software-layer/issues/142 (expired token?)
  - https://github.com/EESSI/eessi-bot-software-layer/issues/191 (self-inflicted?)
  - https://github.com/EESSI/eessi-bot-software-layer/issues/193 (temporary problem with GitHub?)
- no way to access logs when build failed
  - bot should copy build output to shared directory
  - bot/build.sh script should copy EasyBuild log file to working dir
- overload on staging PRs
  - see also https://github.com/EESSI/eessi-bot-software-layer/issues/192
planning
- v0.1 release (https://github.com/multixscale/planning/issues/42)
  - only contribution policy is missing
    - must be open source software
      - can eventually be verified automatically via SPDX license identifiers
    - must be built with bot - no human intervention
    - must be supported in EasyBuild release
      - can't use --from-pr to pull in new easyconfigs
    - only toolchains still supported in EasyBuild can be used
      - cfr. https://docs.easybuild.io/deprecated-easyconfigs/#deprecated_easyconfigs_toolchains + future policy https://github.com/easybuilders/easybuild/issues/872
    - should work on all CPU targets (exceptions allowed)
      - can only exclude specific CPU targets if problem can't be fixed with reasonable effort
    - should prefer recent software versions & toolchains
    - should be able to test via EESSI test suite
      - for example via --sanity-check-only

Tue 20 June 2023 (09:00 CEST)

present: Bob, Kenneth, Lara, Thomas
MultiXscale planning
- https://github.com/orgs/multixscale/projects/1/views/14
- targets for v0.1 of bot: https://github.com/multixscale/planning/issues/42
  - should revisit this
  - [June 13] shifted date for v0.1 to ~ September
  - [June 20] no change
bot development progress
- merged PRs:
  - PR#172 bot commands (help, show_config & build)
  - PR#183 don't use non-existent fatal_error function in bot-build.slurm script
  - PR#185 fix for correctly handling PR comment id
- closed PRs:
  - PR#127 providing an overview of a PR's status in its description ... stalled
  - PR#163 first outline of class to represent and manipulate pull request comments ... stalled
- open PRs:
  - PR#174 move check result to target repo (bot/check-result.sh)
    - reviewed
    - requires https://github.com/EESSI/compatibility-layer/pull/179 for compat layer builds
      - needs an update
    - requires https://github.com/EESSI/software-layer/pull/241 for software layer builds
      - reviewed
  - (Bob) PR#182 add PR comment id to metadata file uploaded to S3 ... attempt to make ingestion script running on Stratum-0 more efficient (see https://github.com/EESSI/filesystem-layer/pull/146)
    - turned out that ingestion script may require additional changes for more efficient handling of tarballs found in S3 bucket
      - => https://github.com/EESSI/filesystem-layer/issues/150
    - essentially S3 bucket could be restructured as follows
      - {bucket}/tarballs/... directory tree containing all tarballs as of now
      - {bucket}/STATE/... directory tree with identical hierarchy but only containing metadata files
      - STATE $\in {new, staged, pr_opened, approved, rejected, ingested}$
      - ingestion script only needs to scan directories new, staged, pr_opened and approved and perform actions already defined, at the end it would move only the metadata file to a different directory (depending on the next state)
    - maybe rather do checkout of staging repo and do local file lookups only
    - related PR for ingestion script: https://github.com/EESSI/filesystem-layer/pull/146
  - (Kenneth) PR#181 default hardcoded comments (if no templates were defined in app.cfg)
  - (Thomas) PR#178 replay GitHub event locally ... clean implementation, not sure yet for what we can use it for (maybe to run some tests)
set up new S3 bucket for 2023.06: eessi-staging-2023.06
- to facilitate comparison when new ingestion procedure is implemented and used in NESSI
- need to update upload + ingestion scripts
- also need to update Lambda function that sends Slack message (s3-staging-notifier-python)
bot/check-result.sh being added in PR #241 is not entirely correct, see https://github.com/EESSI/software-layer/issues/266

meeting notes

Tue 6 June 2023 (09:00 CEST)

skipped due to sickness

Tue 23 May 2023 (09:00 CEST)

present: Bob, Kenneth, Lara, Thomas
discuss bot progress open PRs, ongoing work
quick discussion of SWL progress and need for follow up meeting
MultiXscale planning
- https://github.com/orgs/multixscale/projects/1/views/14
- targets for v0.1 of bot: https://github.com/multixscale/planning/issues/42
  - should revisit this
  - [May 22] added comment regarding polishing of code
bot development progress
- merged PRs:
  - PR#171 Bot comments when labels are set without permission
  - PR#177 some fixes for letting the job manager handle non bot jobs correctly
  - PR#179 unit tests for determine_new_jobs() and determine_finished_jobs()
- closed PRs:
  - PR#85 tool to resubmit build job locally ... outdated
- open PRs:
  - (Kenneth review + follow-up w/ Thomas) PR#172 support for bot commands
    - need documentation in README.md or better document
    - see for example https://github.com/NorESSI/software-layer/pull/110#issuecomment-1556654869
  - draft PR#174 move check result to target repo (bot/check-result.sh) ... mainly a draft because matching PRs to compat layer and software layer need some more work
    - https://github.com/EESSI/compatibility-layer/pull/179
    - https://github.com/EESSI/software-layer/pull/241
    - check-result script should somehow indicate pass/fail?
  - (Bob review) PR#182 add PR comment id to metadata file uploaded to S3 ... attempt to make ingestion script running on Stratum-0 more efficient (see https://github.com/EESSI/filesystem-layer/pull/146)
    - turned out that ingestion script may require additional changes for more efficient handling of tarballs found in S3 bucket
      - => https://github.com/EESSI/filesystem-layer/issues/150
    - essentially S3 bucket could be restructured as follows
      - {bucket}/tarballs/... directory tree containing all tarballs as of now
      - {bucket}/STATE/... directory tree with identical hierarchy but only containing metadata files
      - STATE $\in {new, staged, approved, rejected, ingested, unknown}$
      - ingestion script only needs to scan directories new, staged and approved and perform actions already defined, at the end it would move a metafile to a different directory (depending on the next state)
    - maybe rather do checkout of staging repo and do local file lookups only
  - (Kenneth) PR#181 default hardcoded comments (if no templates were defined in app.cfg)
  - (Thomas) PR#178 replay GitHub event locally ... clean implementation, not sure yet for what we can use it for (maybe to run some tests)
SWL progress
- GCC/10.3.0 tool chain (see EB foss toolchains)
  - ingested OpenMPI/4.1.1
  - built FFTW/3.3.9
  - built OpenBLAS/0.3.15 and OpenBLAS/0.3.15; the former uses PR#17 for the easyconfig, the latter uses an easyconfig added to the repo (plus it also handles the build for GENERIC CPU targets)
  - not started/depending on OpenBLAS FlexiBLAS/3.0.4
  - not started ScaLAPACK/2.1.0
- GCC/11.3.0 + GCC/12.x problems
  - failing sanity check due to missing RPATH
  - maybe caused by https://github.com/easybuilders/easybuild-easyblocks/pull/2921 ?
- hope to use EasyBuild 4.7.2 for EESSI software-layer 2023.04
- we need a better way to deal with open EasyBuild PRs

Tue 09 May 2023 (09:00 CEST)

present: Bob, Kenneth (Thomas in full day workshop)

Tue 25 April 2023 (09:00 CEST)

MultiXscale planning
- https://github.com/orgs/multixscale/projects/1/views/14
- targets for v0.1 of bot: https://github.com/multixscale/planning/issues/42
  - should revisit this
discussion
- test & development infrastructure
  - use a different CernVM-FS infrastructure (different S0 & S1)
  - thus we can use the same directory structure and use the exact same builds for testing
- access to logs
  - has become technically more feasible with Terje's work (see https://github.com/terjekv/github-authorized-keys)
  - could maintain a team of builders on GitHub
  - team members have ssh access to a "log" server where log files are deposited
  - server could just be any VM with sufficient storage
  - instead of ssh access, maybe only scp / sftp is allowed?
- could Terje's work be used to let the bot run build jobs under the "account" of someone who opened a PR or someone who sent a bot command to the bot?
  - could work on AWS, not on most NESSI instances
- TODOs for v0.1 release
  - finish/merge open PRs
  - document how to use bot
  - create contribution policy
  - prepare/conduct tutorial/workshop for how to use bot
  - code cleanup (no refactoring or functional improvements such as error handling, just cleaning up code)
  - possibly start writing deliverable (outline structure, write parts that are done)
status
- merged
  - fix check for missing installations (software-layer PR #237)
  - only tar up software directories for which modules were generated (software-layer PR #239)
  - add bot comments in app.cfg.example (bot PR #170)
    - some necessary changes to follow-up on, see bot issue #173 and bot issue #176
- ready to get merged
  - [Kenneth] bot comments when labels are set without permission (bot PR #171)
    - PR was updated to implement request changes
- reviewed, waiting for requested changes
  - n/a
- being reviewed
  - n/a
- ready to get reviewed
  - [Kenneth] support for sending commands to bot instances via PR comments (bot PR #172)
    - [Thomas] double check if description of PR is up-to-date
  - some fixes for letting the job manager handle non bot jobs correctly bot PR #177
  - improve check_missing_installations.sh software-layer PR #244
- drafts
  - move check result to target repo (bot PR #174)
    - see also compat-layer PR #179
    - see also software-layer PR #241
    - want to improve interface between bot and target repos such that the bot does not do any processing of what should be added to a PR comment $\longrightarrow$ most flexible approach, anything we want to add can be added in scripts provided by a target repository
    - plan to use this also for bot/pre-build-analysis.sh (see below)
- work-in-progress
  - using pr_comment_id in tasks/deploy.py $\longrightarrow$ adding it to metadata uploaded to S3
  - preparing PRs for improvements to EESSI/filesystem-layer PR#90
    - comments added to PR when tarball is ingested
    - improving performance/efficiency of ingestion
  - bot/pre-build-analysis.sh runs inside a job but before the bot/build.sh script
    - could provide a list of easyconfigs to be built (both formatted to be added to a PR comment and as a plain list)
    - the result could be used by a maintainer and also list could then be use
  - capture/show progress of a bot job
    - bot would check/use contents of a file _bot_job.progress to update job progress information
    - considering to use EB hooks to update that file
    - could be useful for large/long running jobs to provide an update of what is currently being built
- next
  - @bot get log/tmp (difficult to manage access properly)
    - can work with permissions just like for build/deploy? (get_log permission)
    - maybe doable with Terje's work
  - start engage with SURF on integrating testing
  - should deploy step of bot be split from actual uploading of tarball to S3?
    - COPY discussion from Slack
    - best way to avoid that AWS token doesn't get leaked is make sure that bot doesn't have AWS token at all
    - bot could copy tarball to a 'deploy' directory
    - totally separate cron job could pick up new tarballs from 'deploy' dir, and upload them to S3 bucket for ingestion
- stalled
  - MAYBE not necessary:@bot enable/disable (relatively easy?)
    - would allow to instruct the bot to only build for a subset of targets before triggering a build (only */generic, etc.)
    - would drop this (bot: build is good enough)
    - KH: what about building for compat layer?
      - just one x86_64 and aarch64 is enough? that can be dealt with a "bot: build" command instead of adding a bot:build label?
      - yes, bot: build arch:SOME_ARCHITECTURE inst:SOME_INSTANCE repo:SOME_REPOSITORY would be enough
longer discussion about securing the bot environment (copied from Slack)
- can we work with permissions just like for build/deploy? (get_log_permission)
- How would it help to define another set of permissions? The problem seems to be to share any data produced by a bot job with the one who created the PR (who wants to get a piece of software added to EESSI). We cannot email gigabytes of data nor can we provide direct access to the bot account on a build system. What we would need is a storage space (ideally to be created by the bot on demand) where the bot can deposit such data and configure it such that the “owner” of the PR can access it. The access should be configurable by the bot eg by using the PR’s owner public SSH key stored on GH.
- It’s mainly about making sure there’s no way that secrets can be leaked, like GitHub token, AWS token, etc. It’s not easy to make 100% sure that that’s impossible, especially with bot/build.sh being there. So, we could restrict things so only "approved" contributors can talk to the bot.
- To avoid a malicious person try extracting a token while nobody is looking. That’s what "get_log" permission could help to achieve. The GBs of stuff to share: that could be dealt with by letting the bot upload stuff to an S3 bucket or so?
- With "get_log" permissions, we can at least control who can ask to upload build logs to a public place, which limits what a malicious person can do?
- In a first iteration, we can also let the bot copy the EB log to a public directory on the AWS cluster. If you don’t have an account there, tough luck. That’s very restrictive, but it’s better than not being able to get access to the log at all. And then we can explore alternatives.
- Maybe external contributors should not get access to any bot job data at all? It should be easy to make a contribution (prepare and open a PR), but maybe not a good idea to let anyone control the bot or obtain data of the bot? If the bot cannot build it, the contributor could get instructions on how to reproduce this on her own infrastructure - essentially using eessi_container.sh with write access and then building the software.
- bot/build.sh is actually relatively well secured (by transferring configuration information via cfg/job.cfg). What’s missing is an assessment of how a job is submitted, that is, what environment settings may leak from the bot into a job. If that can be excluded, it would be good. One more thing to check/ensure is that no credential information is stored on a shared filesystem (between nodes where the bot is running and nodes where jobs are running) plus ensuring that code running inside a bot job cannot ssh into a node where the bot is running.
- Even if we secure access to logs, etc. a malicious user could simply deposit credentials in the software being built and later access it when it has been made available via CernVM-FS. So, more secure way forward seems to be to fully isolate a build job from the bot (and the credentials it needs).
- While adding more complexity splitting the deploy step in two phases may be worth the hassle (as sketched: 1. copy tarball to directory, 2. upload by separate process). One might come up with a similar setup for the build step (1. setup job in directory, 2. submit job by separate process). Thus, only information on disk could be leaked and we can probably much better control that.
- We could submit the build jobs with a different account that the one running the bot... That should prevent quite a bit of stuff.
- So, that would mean we i) run the job manager with a different account, ii) let the job manager submit the jobs, rather than the bot itself?

Tue 11 April 2023 (09:00 CEST)

MultiXscale planning
- https://github.com/orgs/multixscale/projects/1/views/14
- targets for v0.1 of bot: https://github.com/multixscale/planning/issues/42
  - should revisit this
status
- merged
  - add issue comment id to metadata (bot PR #164)
  - add bot comments in app.cfg.example (bot PR #170)
    - some necessary changes to follow-up on, see bot issue #173
  - restore PATHs only after last run of pip installed eb (software-layer PR #238)
  - update to usage information and examples for using flag terminator (docs PR #103)
  - Script for automated ingestion of tarballs (filesystem-layer PR #90)
- ready to get merged
  - [Kenneth] bot comments when labels are set without permission (bot PR #171)
    - only lacks an update of the PR with main
  - [Kenneth] fix check for missing installations (software-layer PR #237)
  - only tar up software directories for which modules were generated (software-layer PR #239)
- reviewed, waiting for requested changes
- being reviewed
  - ...
- ready to get reviewed
  - [Kenneth] support for sending commands to bot instances via PR comments (bot PR #172)
  - move check result to target repo (bot PR #174)
    - see also compat-layer PR #179
- drafts
  - ...
- work-in-progress
  - using pr_comment_id in tasks/deploy.py $\longrightarrow$ adding it to metadata uploaded to S3
  - preparing PRs for improvements to EESSI/filesystem-layer PR#90
    - comments added to PR when tarball is ingested
    - improving performance/efficiency of ingestion
- next
  - fix SUCCESS/FAILURE messages when building with EB v4.7.0
    - use a bot/check-result.sh script, called from scripts/bot-build.slurm (job script run by bot)
    - bot expects a file _bot_jobJOB_ID.result with .ini format
      [RESULT] summary = SUCCESS | FAILURE | UNKNOWN details = line 1 of details line 2 of details ... line n of details artefacts = line 1 of artefacts (eg tarball) ...
    - other information could be job_id, job_runtime, resource_usage, ...
  - @bot rebuild (relatively easy)
    - bot: rebuild
    - implemented via bot PR #172
  - @bot enable/disable (relatively easy?)
    - would allow to instruct the bot to only build for a subset of targets before triggering a build (only */generic, etc.)
    - would drop this (bot: build is good enough)
    - KH: what about building for compat layer?
      - just one x86_64 and aarch64 is enough? that can be dealt with a "bot: build" command instead of adding a bot:build label?
      - yes, bot: build arch:SOME_ARCHITECTURE inst:SOME_INSTANCE repo:SOME_REPOSITORY would be enough
  - @bot get log/tmp (difficult to manage access properly)
    - can work with permissions just like for build/deploy? (get_log permission)
  - start engage with SURF on integrating testing
  - should deploy step of bot be split from actual uploading of tarball to S3?
    - best way to avoid that AWS token doesn't get leaked is make sure that bot doesn't have AWS token at all
    - bot could copy tarball to a 'deploy' directory
    - totally separate cron job could pick up new tarballs from 'deploy' dir, and upload them to S3 bucket for ingestion

Tue 28 Mar 2023 (09:00 CEST)

status
- merged
  - EESSI/eessi-bot-software-layer PR#155 bot side code for bot/build.sh
  - EESSI/eessi-bot-software-layer PR#169 bug fix, see issue 165
  - EESSI/software-layer PR#233 bot/build.sh
  - EESSI/software-layer PR#240 configuration file for bot instance on AWS
- ready to get merged?
  - none
- being reviewed
  - [Jonas] EESSI/eessi-bot-software-layer PR#164 add issue comment id to metadata (incl unit tests for two functions)
- ready to get reviewed
  - [Kenneth] EESSI/docs PR#103 update to usage information and examples for using flag terminator (related to PR#233)
  - [Thomas] EESSI/eessi-bot-software-layer PR#170 make PR comment updates configurable via app.cfg
  - [Kenneth] EESSI/software-layer PR#237 fix for check_missing_installations.sh (required for EB v4.7.0)
  - [Kenneth] EESSI/software-layer PR#238 restore PATHs only after last run of pip installed eb (required when building for a new architecture)
  - [Kenneth] EESSI/software-layer PR#239 only tar up software directories for which modules were generated (fixes issue 225, see particularly comment in #225)
- drafts
  - none
- work-in-progress
  - PR#163 first outline of class to represent and manipulate pull request comments
    - to handle PR comments more efficiently (fewer GitHub API calls)
    - can also be used to support asking bot to rebuild for a particular architecture
  - using pr_comment_id in tasks/deploy.py $\longrightarrow$ adding it to metadata uploaded to S3
  - preparing PRs for improvements to EESSI/filesystem-layer PR#90
    - comments added to PR when tarball is ingested
- next
  - fix SUCCESS/FAILURE messages when building with EB v4.7.0
    - use a bot/build-check.sh script, call if from Slurm build job script used by bot;
    - should touch a job.success or job.fail file (or a job.result file), that could have some info on failure;
    - result=success
    - msg=...
    - job_id=12345
    - job_time=123456s
  - @bot rebuild (relatively easy)
    - bot: rebuild
  - @bot enable/disable (relatively easy?)
    - would allow to instruct the bot to only build for a subset of targets before triggering a build (only */generic, etc.)
  - @bot get log/tmp (difficult to manage access properly)
  - start engage with SURF on integrating testing
- misc
  - a lot of work using bot for NESSI and ideas for improvements
Bob's progress on compat layer 2023.03
- big PR #163
  - has been split up into smaller PRs, see #166-#169
  - PR #166 needs fix included in PR #167
  - PR #163 can be synced up with main branch after PRs #166-#169 have been merged
  - tasks/deploy.py in bot needs to be tweaked to strip out hardcoding for software layer

Thu/Fri 23/24 Mar 2023 -- MultiXscale kickoff

plan for bot
- GitHub App (bot) to automate workflow to add software to EESSI #41
  - M06 Minimal viable product v0.1: build-and-deploy workflow, enables contributions #42
  - M09 v0.5: integrate testing in workflow + improved status overview in PRs #43
  - M12 v1.0: stable version, incl. dashboard with overview of all bot instances #44
- Set up infrastructure to build/test/deploy software in EESSI using bot #45
  - depends on (initial) set of supported architectures
- Define contribution policy + checklist of requirements for contributions #46
- Document semi-automated workflow to contributed to EESSI #47
- D5.1 Community contribution policy and GitHub App #48

Fri 17 Mar 2023 (13:00 CET)

working on PRs, Kenneth, Thomas
approved EESSI/software-layer PR#233
reviewed EESSI/eessi-bot-software-layer PR#155
other open PRs
- EESSI/docs PR#103
- EESSI/eessi-bot-software-layer PR#164 add issue comment id to metadata
  - assigned to Jonas
- EESSI/software-layer PR#237 fix check for missing installations
  - assigned to Kenneth
- TODO
  - PR for fixes to create_tarball.sh script
  - retrigger bot for a specific CPU arch with a comment
    - @bot please rebuild for graviton2

Tue 14 Mar 2023 (09:00 CET)

status
- merged
  - DONE EESSI/docs PR#100
  - DONE EESSI/software-layer PR#232
  - Added 24h default time limit (PR #156)
  - Fix for bug in determine_running_jobs in job_manager (PR #153)
  - Renamed upload_to_s3_script to tarball_upload_script (PR #154)
- ready to get merged?
  - EESSI/software-layer PR#233
  - ok after review
  - should get tested together with https://github.com/EESSI/eessi-bot-software-layer/pull/155
  - docs PR https://github.com/EESSI/docs/pull/103
- ready to get reviewed
  - PR#164 add issue comment id to metadata (incl unit tests for two functions)
    - to avoid many requests to GitHub REST API
    - can be reviewed by Jonas (or Bob)
  - PR#103 update to usage information and examples for using flag terminator (related to PR#233])
- almost ready for review? EESSI/eessi-bot-software-layer PR#155
  - should be tested in conjunction with https://github.com/EESSI/software-layer/pull/233
- very early draft
  - PR#163 first outline of class to represent and manipulate pull request comments
    - to handle PR comments more efficiently (fewer GitHub API calls)
    - can also be used to support asking bot to rebuild for a particular architecture
- work-in-progress
  - Jonas is working on making text being used in comments configurable
Bob's progress on compat layer 2023.03
- big PR #163
  - has been split up into smaller PRs, see #166-#169
  - PR #166 needs fix included in PR #167
  - PR #163 can be synced up with main branch after PRs #166-#169 have been merged
  - tasks/deploy.py in bot needs to be tweaked to strip out hardcoding for software layer
ideas for future features
- Now, the “most” desired new feature by software builders seems to be the ability to rebuild a specific job.
- We could make the bot listen to a well-formatted comment like @bot please rebuild for foo ?
- Yes. We need to restructure the PR comments. Would have just one PR per target architecture+target repository. Maybe:
  - Intro "Instance XYZ building for arch ARCH and repo REPO/VERSION ..."
  - Instance config (enabled/disabled, builders, deployers, RAM, local disk, CPU cores)
  - PR assessment (n of k easyconfigs missing, estimated)
  - Last three states (human readable)
  - All states (expandable, human readable), includes list of received commands
  - Log of all updates (json, expandable), includes list of received commands
  - Link to docs
  - command "shell" (here is where we could add commands)
- We could begin making a mockup.
- Could be like a control center in a PR comment. Then at the top of the PR we really may only need an overview to easily see what is done, what not.
- Not everything maybe, and hopefully not everything in a single PR. I’ll try at least.
- NESSI team members also asked for the ability to cancel a job. Or get a bit more detailed status on request.
- From the perspective of someone maintaining the bot network, it would be nice to have an overview of running jobs, available resources (disk space, API request rate), bot instance health (running or not), health of other services (smee server, autoingest script on S0, S3 bucket server), ...
- Was also wondering if a script inspect.sh --job-id JOBID could be useful? You could run it from a working directory or supply the job id, then it would automatically launch the container with the correct parameters. It might also check if the job is still running. Maybe we want to update eessi_container.sh such that we can use the overlay of a build session, but with read-only access.

Tue 07 Mar 2023 (09:00 CET)

work on open PRs, Kenneth, Thomas
DONE EESSI/docs PR#100
DONE EESSI/software-layer PR#232
EESSI/software-layer PR#233
- bot/build.sh using eessi_container.sh
- update EESSI/docs (--help) + example in using --
- address requested changes & suggestions
EESSI/eessi-bot-software-layer PR#155
- support for bot/build.sh in software layer
- discuss if/which tasks should be done, which could be postponed
  - more testing should be done (while a similar version is in use for NESSI for ~2 weeks + this exact version has been tested with a few jobs ... more testing is beneficial)
  - documentation for defining repositories needs to be added
  - currently jobs are generated as follows:
    - iterate over arch_target_map entries (a list of linux/CPU_FAMILY/microarchitecture (abbreviated arch below) to slurm_opts mappings)
      - iterate over repo_target_map[arch] entries (repository identifiers)
    - this results in one job for each combination of arch+repo_id
    - should be the other way around? iterating over entries in repo_target_map (which already contains all combinations arch + repo_id) then just getting slurm_opts from arch_target_map?
    - if there is no matching arch key in repo_target_map or arch_target_map an error should be logged and the bot shall continue (currently results in a crash)

Tue 28 Feb 2023 (09:00 CET)

Thomas+Bob on vacation --> drop

Fri 24 Feb 2023 (13:00 CET)

work on open PRs, Kenneth, Thomas
EESSI/docs PR#100
- warning, not use for production
- [TR] read once to give ok to get it merged
EESSI/software-layer PR#232
- final polishing and testing ... almost ready to get merged

Mon 20 Feb 2023 (11:00 CET)

work on open PRs, Kenneth, Thomas

Tue 14 Feb 2023 (09:00 CET)

status/goals
- [Thomas]
  - finish bot/build.sh work, install version on Saga, Fram and eX3/Fox
    - works, however not prepared as PRs yet, suggested procedure
      - working version of SWL in https://github.com/trz42/software-layer/blob/enhancement/improvements_to_job_env/bot/build.sh
        
        currently used for adding software to NESSI
      - first finish SWL PR#227 fixing tests for '--access rw'
      - then do follow-up PR to sync with what currently runs for NESSI bot network; rework PR#226
      - do separate PR for bot/build.sh from PR#226
        
        requires yq to parse cfg file
      - finally update BOT PR#148 (may require some additional clean-up and documentation)
        
        bot should create a cfg file to pass info to bot/build.sh script
  - determine good first issue for Jonas
    - #107 rename cfg option upload_to_s3_script
      - first develop unit test, see issue #149
    - #133 fix bug in determining running jobs
      - first develop unit test, see issue #150
    - #146 submit job with default time limit
      - first develop unit test, see issue #151
  - try compat-layer PR #163
    - (Thomas) no time
    - (Bob) can trigger bot to build compat layer, fails due to bootstrap trouble (but bot part works fine)
      - code that picks up tarball needs work (separate bot/deploy.sh script)
      - start from https://github.com/EESSI/compatibility-layer/pull/160 and https://github.com/EESSI/gentoo-overlay/pull/84
      - Bob will look into building new compat layer (2023.02)
  - unplanned
    - adding documentation for eessi_container ... plus reworking "Getting access to EESSI" & "Using EESSI" PR#100
      - in progress
      - see also Bob's PR https://github.com/EESSI/docs/pull/69 + https://github.com/EESSI/docs/pull/85
      - later also a "Contributing to EESSI" page
        
        can use https://github.com/NorESSI/software-layer/wiki/Making-a-pull-request-to-NorESSI-software-layer as starting point
  - misc
    - workshop for NESSI colleagues on using the bot for building software for NESSI/EESSI
      - see https://github.com/NorESSI/software-layer/wiki/Making-a-pull-request-to-NorESSI-software-layer
      - hopefully activates more people to use bot which should lead to improvements
    - idea to extend NESSI bot network to AWS cluster
      - using new nessibot account
- [Kenneth]
  - figure out why OpenFOAM build doesn't work (software-layer PR #195)
  - easily retrigger individual builds from PR
    - see also Thomas' PR #85
  - work with Thomas to get his PRs merged
    - bot
      - (upcoming PR, see https://github.com/EESSI/eessi-bot-software-layer/compare/main...trz42:eessi-bot-software-layer:enhancement/bot-build-with-swl-pr216)
    - software-layer
      - https://github.com/EESSI/software-layer/pull/215
        
        merged!
      - https://github.com/EESSI/software-layer/pull/216
        
        merged!
      - (upcoming PR, see https://github.com/trz42/software-layer/tree/enhancement/bot-build-with-swl-216)
      - https://github.com/EESSI/eessi-bot-software-layer/pull/85
- [Bob]
  - test compat-layer PR #163 with bot using bot:build label

Fri 10 Feb 2023 (09:00 CET)

status PR#216
- some tests for --access rw fail, not immediately clear why (no obvious reasons from checking VM setup: disks type and size)
- uncomment 2 tests, merge PR and make follow-up PR for investigating issue
discussed documentation for "Using EESSI", see PR#100
- see suggestions for improvements in PR
- agreed to restructure documentation as follows (also see https://hackmd.io/4IZZuK2fSQeh2TbZCPm9iA)
EESSI pilot repository [Kenneth]
- keep the warning
- point to other pages
Getting access to EESSI
- Native installation [Kenneth]
- Using the EESSI container [Thomas]
  - incl. save/resume
Using EESSI
- Setting up your environment [Thomas]
  - source
  - R --version
  - module avail output (see /pilot)
  - Useful $EESSI env vars
- Running EESSI demos [Kenneth]
Building software for EESSI (follow-up PR, not include in #100)
- Manual procedure (for testing)
  - Using eb --installpath
  - Using read/write access in build container
- Making a contribution to EESSI
  - Contribution policy
  - Workflow
    - Overview
    - EESSI maintainer tasks
    - Contributor tasks
  - Recommendations

Wed 8 Feb 2023 (09:00 CET)

goal: discuss PR#216
TODOs:
- need to address requested changes
- need tests for parameters

snapshot of older notes available at https://github.com/multixscale/meetings/wiki/sync-meetings-bot-T5.3

Tue 14 Feb 2023 (09:00 CET)

goals
- [Thomas]
  - finish bot/build.sh work, install version on Saga, Fram and eX3/Fox
  - determine good first issue for Jonas
  - try compat-layer PR #163
- [Kenneth]
  - figure out why OpenFOAM build doesn't work (software-layer PR #195)
  - easily retrigger individual builds from PR
    - see also Thomas' PR #85
  - work with Thomas to get his PRs merged
    - bot
      - (upcoming PR, see https://github.com/EESSI/eessi-bot-software-layer/compare/main...trz42:eessi-bot-software-layer:enhancement/bot-build-with-swl-pr216)
    - software-layer
      - https://github.com/EESSI/software-layer/pull/215
        
        => sync call at Wed 1 Feb at 13:30 CET
      - https://github.com/EESSI/software-layer/pull/216
      - (upcoming PR, see https://github.com/trz42/software-layer/tree/enhancement/bot-build-with-swl-216)
      - https://github.com/EESSI/eessi-bot-software-layer/pull/85
- [Bob]
  - test compat-layer PR #163 with bot using bot:build label

Tue 31 Jan 2023 (09:00 CET)

agenda
- status (see below)
- overall progress ... how to move on? ... how to avoid that PRs wait too long before getting reviewed (also: how to avoid diverging development branches)?
  - maybe we need a bit more (realistic) goals?
status
- PRs merged
  - #136 (try and except in pr_comments) including several unit tests using fixtures and mocking --> opened issue #145 to revisit tests and improve the code quality
    - cleaning up tests is WIP by Kenneth, needs more work (but not urgent)
  - #144 (run bot/build.sh script in build job script, if it exists)
    - Kenneth started creating bot/build.sh script for software-layer, very WIP still...
  - software-layer #210 (R 4.1.0) - all installs (except ppc64le) built with bot
- PRs open
  - #127 (providing an overview of a PR's status ...)
  - #85 (tool to resubmit a build job locally)
  - software-layer #195 (OpenFOAM v9)
    - first attempt to let bot build it failed (consistently?)
    - looks like a problem with a dependency?
  - compat layer: https://github.com/EESSI/compatibility-layer/pull/163
    - incl. bot/build.sh script
- issues closed
  - none
- issues opened
  - #146 (submit build jobs with default time limit, add configuration option to use different time limit)
  - #145 (improve code quality of unit tests)
- work in progress
  - [Thomas] major rework of jobscript submitted by bot and the interface between the job and the "payload"
    - based on discussion on 19 Jan
    - bot repo:
      - https://github.com/trz42/eessi-bot-software-layer/tree/enhancement/bot-build-with-swl-pr216
    - swl repo:
      - https://github.com/trz42/software-layer/tree/enhancement/bot-build-with-swl-216
      - example PR to be built using the above branch https://github.com/trz42/software-layer/tree/add-CaDiCaL-1.3.0-GCC-9.3.0-NESSI
    - added a little feature preview where one could enable/disable build targets in a pull request, see https://github.com/NorESSI/software-layer/pull/1#issuecomment-1407726812
    - for passing down environment variable into compat layer env, see https://github.com/EESSI/software-layer/pull/220/files
- [Jonas] got bot running on Saga
- other
  - setting up GH project (NorESSI) and bot instances for NESSI to let project members use the bot for building software stack
    - should cover x86_64/generic, x86_64/amd/zen2, x86_64/intel/broadwell, x86_64/intel/skylake_avx512 and 86_64/intel/cascadelake
    - idea is to eventually use HEAD version of EESSI/software-layer and move all repository specific config/settings out of the software-layer repo to the configuration of a bot instance (e.g., a bot instance could build for EESSI or NESSI using the exact same PR)
- discussion
  - need a way to remove files/directories before building (to trigger rebuild) and also for the ingestions to remove from files from cvmfs repository see issue #147
planned goals
- [Thomas]
  - Hafsa's open PRs (#136): finish & merge
    - TODO revisit all C* cases (testing update_comment)
  - code & doc cleanup
  - resubmit PR (#85)
    - mark as draft for now (don't expect others to review)
    - try to let Bob review it so he gets introduced to parts of the code
  - resume work on #127 (PR status overview)
  - try EESSI/software-layer #216 with bot (addressing issues #135, #98, #88, #42)
- [Kenneth]
  - implement end-to-end test
    - should get PR #136 merged first since that includes lots of tests with mocking
  - migrate docs in README to mkdocs website
    - incl. adding high-level bot overview to docs
- [Bob]
  - set up current bot and play with it
    - app.cfg.example should be cleaned up, has some duplicate sections
    - maybe try bot for new compat layer build script
- [Jonas]
  - try bot

Thu 19 Jan 2023 (10:30 CET) - make bot job script use bot/build.sh from repo

present: Bob, Kenneth, Thomas
reason
- can be used for software layer, compat layer, automating build of cluster-specific software stack, ...
- maybe also for testing EB easyconfig PRs?
status w.r.t hardcoded stuff in current bot implementation
- bot side: scripts/eessi-bot-build.slurm
- software-layer: install_software_layer.sh, EESSI-pilot-install-software.sh, run_in_compat_layer.sh
  - ./build_container.sh run /tmp/$USER/EESSI $PWD/install_software_layer.sh
- how to check whether build work is now hardcoded in job manager
  - see "EESSIBotSoftwareLayerJobManager.process_finished_job"
  - maybe job manager should call a bot-check-build.sh script?
  - or bot-build.sh script should touch a SUCCESS/FAIL file?
use bot/build.sh script
- we'll eventually also have other scripts that the bot should use, like test/deploy/create_tarball/...
stuff to communicate from bot to bot/build.sh script
- via a JSON file that bot/build.sh script can pick up?
- fields
  - $CPU_TARGET
  - HTTP proxy to use (if any)
  - for which repo software should be built
    - EESSI, NESSI, test/develop repo, ...
  - for which (EESSI) version software should be built
  - cvmfs_customizations
  - tmpdir to use?
what should bot still do?
- prepare working directory for job
- run some checks (in job script)
  - verify CPU target (since we bypass archspec)
    - to verify whether bot is correctly configured
    - requires latest archspec (or archdetect, or both)
steps
- [KH] update eessi-bot-build.slurm job script to run bot/build.sh if it's there
  - fall back to what it does now if not
- make different bot/build.sh scripts
  - [KH] for current EESSI pilot (software-layer)
  - [TR] for PR#216 (eessi_container.sh), Thomas tries this on local system
  - [BD] Bob has already some script for using bot @ RUG
  - [BD] Bob has one for the compat layer
  - [KH] bot/build.sh for VSC/UGent + for testing easyconfig PRs
other wild ideas
- bot:build label could be made more fine-grained
- examples:
  - bot:build:nessi
  - bot:build:eessi:aarch64/graviton2
    - or maybe better via comment that give specific instructions to bot?

Tue 17 Jan 2023 (09:00 CET)

present: Bob, Kenneth, Thomas
status
- PRs merged
  - #131 (dedicated log)
  - #132 (retry after exception when getting GitHub token)
  - #137 (expose error when read_config fails)
  - #138 (use .diff patch file to set up build job ...)
  - #141 (let bot pass CPU target to ... build job ...)
- PRs open
  - #136 (try and except in pr_comments)
    - using retry library, allows to use a @retry decorator for specific functions
      - see https://pypi.org/project/retry
        
        this one is not actively maintained (last update: May 2016)
      - switch to retry2?, or [pyretry](https://pypi.org/project/py-retry, or https://pypi.org/project/the-retry, ...
    - sleep mocked
    - some code cleanup done
    - TODO revisit all C* cases (testing update_comment)
  - #127 (providing an overview of a PR's status ...)
  - #85 (tool to resubmit a build job locally)
- issues closed
  - #140 (bot build script should make sure correct CPU target is used)
  - #125 (dedicated log method in event handler + job manager)
- issues opened
  - #139 (find alternative for retry package)
  - #142 (job manager crash due to "Bad credentials")
    - may be fixed by retry in PR #136
- practical problems hit when build R 4.1.0 with bot
  - crashing job manager
    - see issue #142, should be fixed by PR #136
  - crashing installation due to fluke (GitHub --from-pr trouble)
    - would be nice to be able to instruct bot to retry for a specific CPU target
  - overview of current status across all CPU targets is difficult => see ongoing work on letting bot create overview in PR description (issue #93 + PR #127)
- other (see goals below + any other work related to bot)
  - related work on EESSI/software-layer
    - merged PR #220 (update script to take into account $EESSI_SOFTWARE_SUBDIR_OVERRIDE ...)
    - merged PR #218 (... script for checking on missing installations ...)
  - lots of work towards more (unit) testing, reading up on mocking, fixtures, etc.
  - Jonas: intro to GH apps, setting up demo GH app, intro to HPC, setting up bot
  - making bot "agnostic" of EESSI
    - let build job script call a "bot-build.sh" script available in the repository
    - will help both for using bot to manage own install stack, but also for letting bot build compat layer
discussion
- adopt strategy to have test implemented before change?
- analyse result of build job
  - number of ec built
  - number of files in tarball
- retry in build script
- bot could retry to build for a target
- general overview of what is being built by all bot instances
- tune number of cpu cores used, amount of memory/disk space needed
  - what components need to be updated, only bot, only eb, both ...
  - collect information about resource usage of bot jobs
- define interface between bot and "software-layer" or "compat-layer" or ...
  - bot-build.sh in target repo
goals
- [Thomas]
  - Hafsa's open PRs (#131, #132, #136): finish & merge
  - code & doc cleanup
  - resubmit PR (#85)
    - mark as draft for now (don't expect others to review)
    - try to let Bob review it so he gets introduced to parts of the code
- [Kenneth]
  - implement end-to-end test
    - should get PR #136 merged first since that includes lots of tests with mocking
  - migrate docs in README to mkdocs website
    - incl. adding high-level bot overview to docs
- [Bob]
  - set up current bot and play with it
    - app.cfg.example should be cleaned up, has some duplicate sections
- [Jonas]
  - intro to bot
building new compat layer
- one script to build compat layer in container with Ansible playbook
- Ansible playbooks don't require root permission anymore, they just assume that /cvmfs is writable

task kickoff meeting (Thu 5 Jan 2023)

present: Bob, Kenneth, Thomas
status bot
- working prototype
  - running bot instances get notified of GitHub events
  - bot submits build jobs when bot:build label is added
  - bot runs upload script to S3 bucket when bot:deploy label is added
  - example: https://github.com/trz42/software-layer/pull/46
  - maybe script to ingest tarballs should be integrated with the bot?
- issues https://github.com/EESSI/eessi-bot-software-layer/issues
- open PRs https://github.com/EESSI/eessi-bot-software-layer/pulls
- actually using current prototype helps a lot to identify things that can be improved
  - ~1000 jobs launched by bot in scope of NESSI project (6 bot instances)
first steps (not fully ordered)
- finish open PRs -- [Thomas]
  - except #129 "pull request overview"
- end-to-end test (code coverage) -- [Kenneth takes another look]
- code+doc cleanup sprint -- [Thomas has a look in parallel]
  - docstring
  - string formatting
  - code consistency
  - README -> mkdocs conversion => out of scope for this sprint
- agree on development practices
  - 2-pairs of eyes rule: never merge your own PR
  - keep docs up-to-date along with changes being made
  - code style: honor the Hound CI
  - keep PRs smallish - relatively easy to review
  - aim for short lifetime of PRs - get them reviewed/merged quickly
  - make sure that changes being made are covered by the tests
    - once we have an end-to-end test
    - do add unit tests for new/tweaked functions
- prepare a first release 0.1.0
  - label issues accordingly
  - should include docs (mkdocs)
  - package and publish on PyPI?
  - most bot instances should be only running released version
  - one bot running "develop" to have live testing of recent changes
- document how to set up/use bot for different environments: production, development, testing
  - purpose
  - scope (EESSI, MXS, personal)
  - setting them up
  - maintaining / operating them
- agree on roadmap for bot
  - what should be done by v0.1.0
    - features that make our life easier
    - end-to-end test
  - also for next releases
    - what makes life of other people easier
  - which features should be implemented by end of 2023 (MXS milestone)
midterm goals
- policy for community contributions
  - LICENSE clear (automatically checkable/verifiable?)
  - easyconfig available -- builds on generic target (maybe guidelines for testing this with eessi_container?)
  - "compatible" with current EESSI version
    - not too many missing installations per PR
  - (ReFrame tests available)
  - (documentation available)
  - same for all dependencies
- build software for next EESSI pilot version via bot
  - do this ourselves to check that bot is working well
- get others "use" the bot via community contributions (PRs to EESSI/software-layer)
  - set up hackathon to get people to open PRs that bot acts on
long term goals
- make bot usable for EESSI (and beyond)
- let bot build entire software stacks quickly and using resources carefully
  - building in parallel, splitting up builds across multiple jobs (via EB)?
- document bot in a (research) paper
  - maybe for JOSS (Journal of Open Source Software - https://joss.theoj.org)
  - or talk/presentation (automation or HPC devroom at FOSDEM?)
next meeting(s)
- every two weeks
- not on Mon/Fri (Thomas), not Wed (Bob)
- Tue 17 Jan'23 - 09:00 CET
- Tue 31 Jan'23 - 09:00 CET
- Tue 14 Feb'23 - 09:00 CET
internal milestones (for organising work) -- maybe rather think in releases
- production & development environment described/set up
- mapping involved components and their interfaces
  - documentation => to be written by Thomas/Kenneth, reviewed by Bob?
dependencies
- access to (HPC systems) (Slurm) resources for running bot instances
  - should cover all CPU (later also GPU) targets used in EESSI
  - running bot instances with personal accounts or robot accounts or build mostly on AWS/Azure
  - also depending who can set the labels
- (new) compatibility layer
- (new) Stratum 0 - dedicated server at RUG + under eessi.io domain

sync meetings bot T5.3

MXS - T5.3 Facilitating community contributions to the central software stack

meeting notes

Thu 23 Nov 2023 (13:00 CET) - bot sync meeting

Thu 09 Nov 2023 (11:00 CET) - bot sync meeting

Tue 24 Oct 2023 (09:00 CEST) - bot sync meeting

Tue 26 Sept 2023 (09:00 CEST) - bot sync meeting

Tue 12 Sept 2023 (09:00 CEST) - bot sync meeting

Tue 29 August 2023 (09:00 CEST) - regular bot dev meeting

Mon 14 August 2023 (13:00 CEST) - integrate reframe tests into build-and-deploy workflow

Mon 14 August 2023 (10:00 CEST)

Tue 1 August 2023 (13:30 CEST)

Tue 18 July 2023 (09:00 CEST)

Tue 4 July 2023 (09:00 CEST)

Tue 20 June 2023 (09:00 CEST)

meeting notes

Tue 6 June 2023 (09:00 CEST)

skipped due to sickness

Tue 23 May 2023 (09:00 CEST)

Tue 09 May 2023 (09:00 CEST)

Tue 25 April 2023 (09:00 CEST)

Tue 11 April 2023 (09:00 CEST)

Tue 28 Mar 2023 (09:00 CEST)

Thu/Fri 23/24 Mar 2023 -- MultiXscale kickoff

Fri 17 Mar 2023 (13:00 CET)

Tue 14 Mar 2023 (09:00 CET)

Tue 07 Mar 2023 (09:00 CET)

Tue 28 Feb 2023 (09:00 CET)

Fri 24 Feb 2023 (13:00 CET)

Mon 20 Feb 2023 (11:00 CET)

Tue 14 Feb 2023 (09:00 CET)

Fri 10 Feb 2023 (09:00 CET)

Wed 8 Feb 2023 (09:00 CET)

Tue 14 Feb 2023 (09:00 CET)

Tue 31 Jan 2023 (09:00 CET)

Thu 19 Jan 2023 (10:30 CET) - make bot job script use bot/build.sh from repo

Tue 17 Jan 2023 (09:00 CET)

task kickoff meeting (Thu 5 Jan 2023)

task admin info

milestones

deliverables

Clone this wiki locally