-
Notifications
You must be signed in to change notification settings - Fork 0
sync meetings bot T5.3
live notes during meeting: https://hackmd.io/z6hu9gkoR6G1yCichemorA
Snapshot of meeting notes is included below, cut-and-paste from HackMD note
See end of document for task admin info
Current target: v0.1 release of build-and-deploy bot, see https://github.com/multixscale/planning/issues/42
- present: Thomas, Kenneth, Lara, Pedro, Bob, Caspar
- contribution policy in place \o/
- MultiXscale deliverable
- WIP by Thomas, with feedback from Kenneth
- in very good shape, 90% done
- Thomas: revise document (starting Fri 24 nov noon-ish)
- Pedro/Bob: review afterwards- as 3rd reviewer
- PRs for test phase in bot
- assumed to be in place in MultiXscale deliverable, so should get merged ASAP
-
PR #222 to bot should be done (but is untested)
- this is the critical PR that should get merged ASAP so we have a new bot release (v0.2.0)
- logic was copy-paste from build phase, so should be OK
-
PR #366 to software-layer is WIP
- only
bot/check-test.sh
script is missing - can use a dummy
bot/check-test.sh
script for now (if needed)
- only
- cleanup for
software.eessi.io
PR #229- pretty trivial PR
- next meeting: Wed 13 Dec'23, 10:00-11:00 CET
-
present: Pedro, Bob, Alan, Kenneth, Thomas, Lara
-
contribution policy
- recap
- purpose: enable contributions from anyone by defining objective criteria for what software can be included and how
- content: set of requirements that are easy to understand, easy to satisfy, easy to verify (some automatically, some manually) // some requirements may not be enforced, but we inform about our intent to do so in the future
- requirements:
- a) open source software
- b) built by the bot
- c) supported by EasyBuild release
- d) supported compiler toolchain
- e) CPU microarchitectures
- f) versions & toolchains
- g) testing
- personal view (TR): mostly fine, doesn't need to be perfect, but no software in EESSI/2023.06 should violate it; need it published NOW 😃
- things to change in current draft policy:
- a) open source software
- should be reworded to "allowed to redistribute" + strongly prefer open source software
- b) built by the bot
- c) supported by EasyBuild
- an EasyBuild release
- remove 2nd line
- d) compiler toolchain
- latest EasyBuild release
- e) recent software versions & toolchains
- sufficiently recent software versions & toolchains
- software versions & toolchains should be separate sections in policy
- e) recent software versions
- We strongly prefer installing sufficiently recent software versions.
- f) recent toolchain generations
- EESSI administrators determine which toolchains are installed. Software should be added to an existing easystack file.
- Open a support request if existing toolchains are not sufficient.
- e) recent software versions
- connection between compat layer & toolchain generations
- only install latest N toolchain generations in new compat layer
- limited overlap across compat layer versions
- g) CPU targets
- target CPU microarchitectures?
- "supported by the EESSI version to which the software is contributed"
- h) testing
- very loose, good enough for v0.1 of the policy
- should be refined to test for applications, user-facing, etc.?
- hard to make it stricter right now, since bot doesn't support actually testing the installations in different context
- should mention that tests run by EasyBuild must pass (with exceptions allowed + must be documented in issue)
- a) open source software
- "Requirements" section should say "there may be other restrictions, on a case by case"
- for software.eessi.io => stick to 2022b & newer in 2023.06
- recap
-
release v0.1.1
- merge PR #224
- do quick test of current
develop
branch before tagging v0.1.1 release
-
D5.1
- started
- should finish a draft by ??? (latest end of next week, task internal comments, project internal review, ...)
- needs contribution policy
- would like to have v0.1.1 (bug fixes) & v0.2.0 (support for basic tests)
-
aob
- initial support for testing
- bot side is done (but untested), see https://github.com/EESSI/eessi-bot-software-layer/pull/222
- software layer PR needs work (missing script): https://github.com/EESSI/software-layer/pull/366
- initial support for testing
-
D5.2 (support portal)
- Lara is close to having a finished draft
-
D1.1 (EESSI)
- Bob & Pedro have a bullet list to kickstart the deliverable
-
software.eessi.io
- PRs to merge
-
https://github.com/EESSI/filesystem-layer/pull/160
- needs to be updated with new pubkey for repo
- (merged) https://github.com/EESSI/filesystem-layer/pull/161
- https://github.com/EESSI/cvmfs-servers/pull/1
-
https://github.com/EESSI/filesystem-layer/pull/160
- (Bob) install & use new key
- and update PR accordingly
- (Bob) update Stratum-1's with new key
- (Thomas) update bot config @ https://github.com/EESSI/bot-configs/tree/main/mc-aws-rocky88-202310/repos
- get EESSI config in cvmfs-config repo (https://github.com/cvmfs-contrib/config-repo)
- (Thomas,Kenneth) install security updates in 2023.06 compat layer
- document procedure in EESSI support portal wiki?
- consider ingestined into subdir + doing symlink
- install 2023.06 software layer for software.eessi.io
- PRs to merge
- present:
- contribution policy (docs PR #108)
- still WIP, need to process feedback (Kenneth)
- give subsection number so they're easy to refer to
- release v0.1.1
- bug fixes / minor improvements or additions
- status release v0.2
- open PRs
- maybe do in v0.3
- deploy compat layer with bot?
- avoid that bot hardcodes anything specific to software layer, like check for success
- would a
bot/deploy.sh
be useful?
- would a
- prioritized issues
- failing to apply PR patch should be reported in PR issue #212
- present: Bob, Thomas, Kenneth, Richard, Alan
- status release v0.1
- all code polishing PRs merged, ready to tag v0.1?
- Thomas will go through README, check if anything is missing (like new configuration settings)
- goals for v0.2
- testing step
- deploy compat layer with bot?
- avoid that bot hardcodes anything specific to software layer, like check for success
- new issues
- failing to apply PR patch should be reported in PR issue #212
- fluke download failures issue #213
- downloading sources first before building anything would already help a lot (early failure on download trouble)
- just using a shared dir for sources would already help, so first build job is responsible for downloading
- bot should provide location to a shared directory that can be used
- present: Bob, Kenneth, Caspar, Lara, Thomas
- status release v0.1
- next release v0.2
- develop under branch 'develop'
- support for testing (see meeting Aug 14)
- make deployment repository agnostic (change code so it does not analyse result of bot job but rather relies on result file produced by bot/check-build.sh)
- essential for building compat layer with bot
- merge outstanding PRs [178, 181, 182]
- these PRs need to be synced with
develop
branch first (once it's created after 0.1 release)
- these PRs need to be synced with
- provide means for upload script to deposit tarball and metadata file under different directories (use config setting instead of hardcoding it)
- other
-
SWL PR#317
- idea: make it easier to investigate build jobs + prevent mistakes/omissions caused by setting up environment manually
- provides
bot/inspect.sh
- called with path to tarball of last run (temporary storage)
- could be accompanied with a bot-side script that gets a job id as parameter (side-effect: sensitive information can be removed from PR comments)
- if merged it would be part of every build job's working directory, hence could be called without any parameters at all
- for now, focus on the script
- if someone needs access to a job working dir for running the inspect script, someone with access to bot account can copy it
- later bot can be enhanced to copy working dir automatically to a specific cluster account via a bot command in PR
- current implementation of bot/inspect.sh script assumes that tgz is located in job working directory
- useful so Slurm output file is also available
- but makes it more complicated to share what's needed to a random contributor (without sharing "sensitive" data)
- add bot/README.md which describes purpose of scripts
- deliverable, ask Alan to set up project on Overleaf
-
SWL PR#317
- present: Caspar, Kenneth, Lara, Thomas
stateDiagram
direction LR
created_ws: created
[*] --> created_ws
created_ws --> invalid
created_ws --> valid
valid --> submitted
submitted --> running
build_success: succeeded to build
running --> build_success
build_failure: failed to build
running --> build_failure
invalid --> updated
invalid --> closed
updated --> valid
updated --> invalid
build_failure --> updated
build_failure --> closed
closed --> [*]
build_success --> build_tested
build_tested --> uploaded
uploaded --> ingest_staged
ingest_staged --> test_staged
test_staged --> approved
test_staged --> rejected
rejected --> closed
approved --> ingested
ingested --> added
added --> [*]
- when do we do testing?
- ...
- what kind of testing?
- single node: during build-test-deploy done by bot
- multi node: after ingest (in a test repository)
- test on different host OS / different container
- how to test?
- in a container
- native CVMFS
- step by step
- start with letting bot running tests in build container (Debian as OS)
- can later also run tests in separate container (totally different OS, like Rocky Linux)
- [Thomas, Kenneth] start with supporting
test.sh
script that is run afterbuild.sh
andcheck-build.sh
inbot-build.slurm
job script;-
test.sh
callstest-suite.sh
in build container;
-
- then we can pull it apart, let bot run
test.sh
script in a separate job (viabot-test.slurm
job script), potentially in different contexts (diff. OS, diff. system), ...; - also
check-test.sh
script that bot uses to determine result of test(s); - eventually bot should feed information into
test.sh
script that can be used to only run certain tests;- tests in EESSI test suite should be tagged with software names for easy filtering;
- "generic" tests like
eb --sanity-check-only
; - EESSI demo scripts;
- checking whether generic build is actually generic
- run test in build container on older system (via QEMU?)
- setting up the test environment
- writable overlay (like in build container)
- may be slow due to bookkeeping done by fuse-overlayfs;
- downside is that added software is writable, could lead to false positives in tests
- make overlay read-only?
- we can actually re-use the overlay that was created in the build step (but then we're not checking the tarball that will be ingested);
- other option could be to bind mount of module files + software directories
- more difficult, unclear if there's benefit;
- yet another option: could do a "local" ingest in a test CVMFS repository in the bot account, and then test with that repo;
- how do we keep in sync with production EESSI repo?
- seems slow...
- writable overlay (like in build container)
- development setup
-
devel
branch in bot repo to start developing;- set up separate "devel" instance of the bot that can be used from software-layer PRs;
- work on test step here
- production instance of the bot runs
main
branch (latest release of the bot, soon v0.1);
-
- other stuff for v0.2.x
- canceling build jobs?
- not easy to find appropriate comment to update in PR
- better rebuild support?
- short term we can just manually remove on Stratum-0 first
- result of deploy step is still EESSI specific in bot, should be fixed;
- is important for building compat layer with the bot
- canceling build jobs?
- other
- set up environment to manually replicate a build failure that the bot hits is quite painful now
- Ideally: run tests with other container, with other OS than in build container
- I.e. container with different OS, create writeable overlay => extract tarball => run test
- present: Thomas, Kenneth, Lara, Caspar (excused: Bob)
- experiences with bot in recent weeks
- bot is working as expected
- some issues with FFTW for foss/2022a and foss/2022b, but that's orthogonal to the bot itself
- event-based aspect is very helpful, since that makes it responsive
- need access to bot account to figure out what went wrong
- build logs are copied to /mnt/shared for failing build jobs
- cfr. https://github.com/EESSI/software-layer/pull/302 + https://github.com/EESSI/software-layer/pull/305
- this worked well enough for Lara to diagnose problems, but not to reproduce the problem manually
- procedure to resume working directory from tarballs kept by bot not well documented
- only
--resume
option is documented at http://www.eessi.io/docs/getting_access/eessi_container/#resuming-a-previous-session
- only
- feature request to be able to cancel build jobs
- problems with re-installing ReFrame
- bot is working as expected
- problems with failing tests for OpenBLAS/FFTW/numpy due to numerical errors
- OpenBLAS for foss/2022b: https://github.com/EESSI/software-layer/pull/309
- planning v0.1
- https://github.com/multixscale/planning/issues/42
- only thing left for bot implementation itself is "code polishing"
- single sweep across each Python file, remove commented out code, add missing docstrings, etc.
- real refactoring would require more tests first
- different types of contributors
- maintainers: full access
- trusted contributors: can trigger builds, get access to build logs, etc.
- external contributors: can only open PRs, need help from others to trigger builds, figure out broken builds, etc.
- should provide an easy way to allow them to reproduce the build as done by the bot
- proposal for contribution policy
- https://github.com/EESSI/docs/pull/108
- documented policy should be made more restrictive initially
- with "internal" exceptions for project members, for now
- on leave: Thomas + Kenneth + Lara => skip meeting
-
present: Kenneth, Lara, Bob (excused: Thomas)
-
merged PRs:
-
PR#174 move check result to target repo (
bot/check-result.sh
) - https://github.com/EESSI/eessi-bot-software-layer/pull/187
- https://github.com/EESSI/eessi-bot-software-layer/pull/189 + https://github.com/EESSI/software-layer/pull/275
- https://github.com/EESSI/software-layer/pull/278
-
PR#174 move check result to target repo (
-
docs
-
bot/check-build.sh should not declare success on empty tarballs (issue #294)
-
open problems
- job manager crashes every now and then, just restarting should work
- https://github.com/EESSI/eessi-bot-software-layer/issues/142 (expired token?)
- https://github.com/EESSI/eessi-bot-software-layer/issues/191 (self-inflicted?)
- https://github.com/EESSI/eessi-bot-software-layer/issues/193 (temporary problem with GitHub?)
- no way to access logs when build failed
- bot should copy build output to shared directory
- bot/build.sh script should copy EasyBuild log file to working dir
- overload on staging PRs
- job manager crashes every now and then, just restarting should work
-
planning
- v0.1 release (https://github.com/multixscale/planning/issues/42)
- only contribution policy is missing
- must be open source software
- can eventually be verified automatically via SPDX license identifiers
- must be built with bot - no human intervention
- must be supported in EasyBuild release
- can't use --from-pr to pull in new easyconfigs
- only toolchains still supported in EasyBuild can be used
- should work on all CPU targets (exceptions allowed)
- can only exclude specific CPU targets if problem can't be fixed with reasonable effort
- should prefer recent software versions & toolchains
- should be able to test via EESSI test suite
- for example via --sanity-check-only
- must be open source software
- only contribution policy is missing
- v0.1 release (https://github.com/multixscale/planning/issues/42)
-
present: Bob, Kenneth, Lara, Thomas
-
MultiXscale planning
- https://github.com/orgs/multixscale/projects/1/views/14
- targets for v0.1 of bot: https://github.com/multixscale/planning/issues/42
- should revisit this
- [June 13] shifted date for v0.1 to ~ September
- [June 20] no change
-
bot development progress
- merged PRs:
- closed PRs:
- open PRs:
-
PR#174 move check result to target repo (
bot/check-result.sh
)- reviewed
- requires https://github.com/EESSI/compatibility-layer/pull/179 for compat layer builds
- needs an update
- requires https://github.com/EESSI/software-layer/pull/241 for software layer builds
- reviewed
- (Bob) PR#182 add PR comment id to metadata file uploaded to S3 ... attempt to make ingestion script running on Stratum-0 more efficient (see https://github.com/EESSI/filesystem-layer/pull/146)
- turned out that ingestion script may require additional changes for more efficient handling of tarballs found in S3 bucket
- essentially S3 bucket could be restructured as follows
- {bucket}/tarballs/... directory tree containing all tarballs as of now
- {bucket}/STATE/... directory tree with identical hierarchy but only containing metadata files
- STATE
$\in {new, staged, pr_opened, approved, rejected, ingested}$ - ingestion script only needs to scan directories new, staged, pr_opened and approved and perform actions already defined, at the end it would move only the metadata file to a different directory (depending on the next state)
- maybe rather do checkout of staging repo and do local file lookups only
- related PR for ingestion script: https://github.com/EESSI/filesystem-layer/pull/146
- (Kenneth) PR#181 default hardcoded comments (if no templates were defined in
app.cfg
) - (Thomas) PR#178 replay GitHub event locally ... clean implementation, not sure yet for what we can use it for (maybe to run some tests)
-
PR#174 move check result to target repo (
-
set up new S3 bucket for 2023.06:
eessi-staging-2023.06
- to facilitate comparison when new ingestion procedure is implemented and used in NESSI
- need to update upload + ingestion scripts
- also need to update Lambda function that sends Slack message (
s3-staging-notifier-python
)
-
bot/check-result.sh being added in PR #241 is not entirely correct, see https://github.com/EESSI/software-layer/issues/266
- present: Bob, Kenneth, Lara, Thomas
- discuss bot progress open PRs, ongoing work
- quick discussion of SWL progress and need for follow up meeting
- MultiXscale planning
- https://github.com/orgs/multixscale/projects/1/views/14
- targets for v0.1 of bot: https://github.com/multixscale/planning/issues/42
- should revisit this
- [May 22] added comment regarding polishing of code
- bot development progress
- merged PRs:
- closed PRs:
- PR#85 tool to resubmit build job locally ... outdated
- open PRs:
- (Kenneth review + follow-up w/ Thomas) PR#172 support for bot commands
- need documentation in README.md or better document
- see for example https://github.com/NorESSI/software-layer/pull/110#issuecomment-1556654869
-
draft PR#174 move check result to target repo (
bot/check-result.sh
) ... mainly a draft because matching PRs to compat layer and software layer need some more work- https://github.com/EESSI/compatibility-layer/pull/179
- https://github.com/EESSI/software-layer/pull/241
- check-result script should somehow indicate pass/fail?
- (Bob review) PR#182 add PR comment id to metadata file uploaded to S3 ... attempt to make ingestion script running on Stratum-0 more efficient (see https://github.com/EESSI/filesystem-layer/pull/146)
- turned out that ingestion script may require additional changes for more efficient handling of tarballs found in S3 bucket
- essentially S3 bucket could be restructured as follows
- {bucket}/tarballs/... directory tree containing all tarballs as of now
- {bucket}/STATE/... directory tree with identical hierarchy but only containing metadata files
- STATE
$\in {new, staged, approved, rejected, ingested, unknown}$ - ingestion script only needs to scan directories new, staged and approved and perform actions already defined, at the end it would move a metafile to a different directory (depending on the next state)
- maybe rather do checkout of staging repo and do local file lookups only
- (Kenneth) PR#181 default hardcoded comments (if no templates were defined in
app.cfg
) - (Thomas) PR#178 replay GitHub event locally ... clean implementation, not sure yet for what we can use it for (maybe to run some tests)
- (Kenneth review + follow-up w/ Thomas) PR#172 support for bot commands
- SWL progress
- GCC/10.3.0 tool chain (see EB foss toolchains)
-
ingested
OpenMPI/4.1.1 -
built
FFTW/3.3.9 -
built
OpenBLAS/0.3.15 and OpenBLAS/0.3.15; the former uses PR#17 for the easyconfig, the latter uses an easyconfig added to the repo (plus it also handles the build forGENERIC
CPU targets) - not started/depending on OpenBLAS FlexiBLAS/3.0.4
- not started ScaLAPACK/2.1.0
-
- GCC/11.3.0 + GCC/12.x problems
- failing sanity check due to missing RPATH
- maybe caused by https://github.com/easybuilders/easybuild-easyblocks/pull/2921 ?
- hope to use EasyBuild 4.7.2 for EESSI software-layer 2023.04
- we need a better way to deal with open EasyBuild PRs
- GCC/10.3.0 tool chain (see EB foss toolchains)
- present: Bob, Kenneth (Thomas in full day workshop)
-
MultiXscale planning
- https://github.com/orgs/multixscale/projects/1/views/14
- targets for v0.1 of bot: https://github.com/multixscale/planning/issues/42
- should revisit this
-
discussion
- test & development infrastructure
- use a different CernVM-FS infrastructure (different S0 & S1)
- thus we can use the same directory structure and use the exact same builds for testing
- access to logs
- has become technically more feasible with Terje's work (see https://github.com/terjekv/github-authorized-keys)
- could maintain a team of builders on GitHub
- team members have ssh access to a "log" server where log files are deposited
- server could just be any VM with sufficient storage
- instead of ssh access, maybe only scp / sftp is allowed?
- could Terje's work be used to let the bot run build jobs under the "account" of someone who opened a PR or someone who sent a bot command to the bot?
- could work on AWS, not on most NESSI instances
- TODOs for v0.1 release
- finish/merge open PRs
- document how to use bot
- create contribution policy
- prepare/conduct tutorial/workshop for how to use bot
- code cleanup (no refactoring or functional improvements such as error handling, just cleaning up code)
- possibly start writing deliverable (outline structure, write parts that are done)
- test & development infrastructure
-
status
-
merged
- fix check for missing installations (software-layer PR #237)
- only tar up software directories for which modules were generated (software-layer PR #239)
- add bot comments in app.cfg.example (bot PR #170)
- some necessary changes to follow-up on, see bot issue #173 and bot issue #176
-
ready to get merged
- [Kenneth] bot comments when labels are set without permission (bot PR #171)
- PR was updated to implement request changes
- [Kenneth] bot comments when labels are set without permission (bot PR #171)
-
reviewed, waiting for requested changes
- n/a
-
being reviewed
- n/a
-
ready to get reviewed
- [Kenneth] support for sending commands to bot instances via PR comments (bot PR #172)
- [Thomas] double check if description of PR is up-to-date
- some fixes for letting the job manager handle non bot jobs correctly bot PR #177
- improve
check_missing_installations.sh
software-layer PR #244
- [Kenneth] support for sending commands to bot instances via PR comments (bot PR #172)
-
drafts
- move check result to target repo (bot PR #174)
- see also compat-layer PR #179
- see also software-layer PR #241
- want to improve interface between bot and target repos such that the bot does not do any processing of what should be added to a PR comment
$\longrightarrow$ most flexible approach, anything we want to add can be added in scripts provided by a target repository - plan to use this also for
bot/pre-build-analysis.sh
(see below)
- move check result to target repo (bot PR #174)
-
work-in-progress
- using
pr_comment_id
intasks/deploy.py
$\longrightarrow$ adding it to metadata uploaded to S3 - preparing PRs for improvements to EESSI/filesystem-layer PR#90
- comments added to PR when tarball is ingested
- improving performance/efficiency of ingestion
-
bot/pre-build-analysis.sh
runs inside a job but before thebot/build.sh
script- could provide a list of easyconfigs to be built (both formatted to be added to a PR comment and as a plain list)
- the result could be used by a maintainer and also list could then be use
- capture/show progress of a bot job
- bot would check/use contents of a file
_bot_job.progress
to update job progress information - considering to use EB hooks to update that file
- could be useful for large/long running jobs to provide an update of what is currently being built
- bot would check/use contents of a file
- using
-
next
-
@bot get log/tmp
(difficult to manage access properly)- can work with permissions just like for build/deploy? (
get_log
permission) - maybe doable with Terje's work
- can work with permissions just like for build/deploy? (
- start engage with SURF on integrating testing
- should deploy step of bot be split from actual uploading of tarball to S3?
- COPY discussion from Slack
- best way to avoid that AWS token doesn't get leaked is make sure that bot doesn't have AWS token at all
- bot could copy tarball to a 'deploy' directory
- totally separate cron job could pick up new tarballs from 'deploy' dir, and upload them to S3 bucket for ingestion
-
-
stalled
- MAYBE not necessary:
@bot enable/disable
(relatively easy?)- would allow to instruct the bot to only build for a subset of targets before triggering a build (only */generic, etc.)
- would drop this (
bot: build
is good enough) - KH: what about building for compat layer?
- just one
x86_64
andaarch64
is enough? that can be dealt with a "bot: build" command instead of adding abot:build
label? - yes,
bot: build arch:SOME_ARCHITECTURE inst:SOME_INSTANCE repo:SOME_REPOSITORY
would be enough
- just one
- MAYBE not necessary:
-
-
longer discussion about securing the bot environment (copied from Slack)
- can we work with permissions just like for build/deploy? (
get_log_permission
) - How would it help to define another set of permissions? The problem seems to be to share any data produced by a bot job with the one who created the PR (who wants to get a piece of software added to EESSI). We cannot email gigabytes of data nor can we provide direct access to the bot account on a build system. What we would need is a storage space (ideally to be created by the bot on demand) where the bot can deposit such data and configure it such that the “owner” of the PR can access it. The access should be configurable by the bot eg by using the PR’s owner public SSH key stored on GH.
- It’s mainly about making sure there’s no way that secrets can be leaked, like GitHub token, AWS token, etc. It’s not easy to make 100% sure that that’s impossible, especially with
bot/build.sh
being there. So, we could restrict things so only "approved" contributors can talk to the bot. - To avoid a malicious person try extracting a token while nobody is looking. That’s what "get_log" permission could help to achieve. The GBs of stuff to share: that could be dealt with by letting the bot upload stuff to an S3 bucket or so?
- With "get_log" permissions, we can at least control who can ask to upload build logs to a public place, which limits what a malicious person can do?
- In a first iteration, we can also let the bot copy the EB log to a public directory on the AWS cluster. If you don’t have an account there, tough luck. That’s very restrictive, but it’s better than not being able to get access to the log at all. And then we can explore alternatives.
- Maybe external contributors should not get access to any bot job data at all? It should be easy to make a contribution (prepare and open a PR), but maybe not a good idea to let anyone control the bot or obtain data of the bot? If the bot cannot build it, the contributor could get instructions on how to reproduce this on her own infrastructure - essentially using
eessi_container.sh
with write access and then building the software. -
bot/build.sh
is actually relatively well secured (by transferring configuration information viacfg/job.cfg
). What’s missing is an assessment of how a job is submitted, that is, what environment settings may leak from the bot into a job. If that can be excluded, it would be good. One more thing to check/ensure is that no credential information is stored on a shared filesystem (between nodes where the bot is running and nodes where jobs are running) plus ensuring that code running inside a bot job cannot ssh into a node where the bot is running. - Even if we secure access to logs, etc. a malicious user could simply deposit credentials in the software being built and later access it when it has been made available via CernVM-FS. So, more secure way forward seems to be to fully isolate a build job from the bot (and the credentials it needs).
- While adding more complexity splitting the deploy step in two phases may be worth the hassle (as sketched: 1. copy tarball to directory, 2. upload by separate process). One might come up with a similar setup for the build step (1. setup job in directory, 2. submit job by separate process). Thus, only information on disk could be leaked and we can probably much better control that.
- We could submit the build jobs with a different account that the one running the bot... That should prevent quite a bit of stuff.
- So, that would mean we i) run the job manager with a different account, ii) let the job manager submit the jobs, rather than the bot itself?
- can we work with permissions just like for build/deploy? (
- MultiXscale planning
- https://github.com/orgs/multixscale/projects/1/views/14
- targets for v0.1 of bot: https://github.com/multixscale/planning/issues/42
- should revisit this
- status
- merged
- add issue comment id to metadata (bot PR #164)
- add bot comments in app.cfg.example (bot PR #170)
- some necessary changes to follow-up on, see bot issue #173
- restore PATHs only after last run of pip installed eb (software-layer PR #238)
- update to usage information and examples for using flag terminator (docs PR #103)
- Script for automated ingestion of tarballs (filesystem-layer PR #90)
- ready to get merged
- [Kenneth] bot comments when labels are set without permission (bot PR #171)
- only lacks an update of the PR with main
- [Kenneth] fix check for missing installations (software-layer PR #237)
- only tar up software directories for which modules were generated (software-layer PR #239)
- [Kenneth] bot comments when labels are set without permission (bot PR #171)
- reviewed, waiting for requested changes
- being reviewed
- ...
- ready to get reviewed
- [Kenneth] support for sending commands to bot instances via PR comments (bot PR #172)
- move check result to target repo (bot PR #174)
- see also compat-layer PR #179
- drafts
- ...
- work-in-progress
- using
pr_comment_id
intasks/deploy.py
$\longrightarrow$ adding it to metadata uploaded to S3 - preparing PRs for improvements to EESSI/filesystem-layer PR#90
- comments added to PR when tarball is ingested
- improving performance/efficiency of ingestion
- using
- next
- fix
SUCCESS
/FAILURE
messages when building with EB v4.7.0- use a
bot/check-result.sh
script, called fromscripts/bot-build.slurm
(job script run by bot) - bot expects a file
_bot_jobJOB_ID.result
with.ini
format[RESULT] summary = SUCCESS | FAILURE | UNKNOWN details = line 1 of details line 2 of details ... line n of details artefacts = line 1 of artefacts (eg tarball) ...
- other information could be
job_id
,job_runtime
,resource_usage
, ...
- use a
-
@bot rebuild
(relatively easy)- bot: rebuild
- implemented via bot PR #172
-
@bot enable/disable
(relatively easy?)- would allow to instruct the bot to only build for a subset of targets before triggering a build (only */generic, etc.)
- would drop this (
bot: build
is good enough) - KH: what about building for compat layer?
- just one
x86_64
andaarch64
is enough? that can be dealt with a "bot: build" command instead of adding abot:build
label? - yes,
bot: build arch:SOME_ARCHITECTURE inst:SOME_INSTANCE repo:SOME_REPOSITORY
would be enough
- just one
-
@bot get log/tmp
(difficult to manage access properly)- can work with permissions just like for build/deploy? (
get_log
permission)
- can work with permissions just like for build/deploy? (
- start engage with SURF on integrating testing
- should deploy step of bot be split from actual uploading of tarball to S3?
- best way to avoid that AWS token doesn't get leaked is make sure that bot doesn't have AWS token at all
- bot could copy tarball to a 'deploy' directory
- totally separate cron job could pick up new tarballs from 'deploy' dir, and upload them to S3 bucket for ingestion
- fix
- merged
-
status
- merged
- ready to get merged?
- none
- being reviewed
- [Jonas] EESSI/eessi-bot-software-layer PR#164 add issue comment id to metadata (incl unit tests for two functions)
- ready to get reviewed
- [Kenneth] EESSI/docs PR#103 update to usage information and examples for using flag terminator (related to PR#233)
-
[Thomas] EESSI/eessi-bot-software-layer PR#170 make PR comment updates configurable via
app.cfg
-
[Kenneth] EESSI/software-layer PR#237 fix for
check_missing_installations.sh
(required for EB v4.7.0) - [Kenneth] EESSI/software-layer PR#238 restore PATHs only after last run of pip installed eb (required when building for a new architecture)
- [Kenneth] EESSI/software-layer PR#239 only tar up software directories for which modules were generated (fixes issue 225, see particularly comment in #225)
- drafts
- none
- work-in-progress
-
PR#163 first outline of class to represent and manipulate pull request comments
- to handle PR comments more efficiently (fewer GitHub API calls)
- can also be used to support asking bot to rebuild for a particular architecture
- using
pr_comment_id
intasks/deploy.py
$\longrightarrow$ adding it to metadata uploaded to S3 - preparing PRs for improvements to EESSI/filesystem-layer PR#90
- comments added to PR when tarball is ingested
-
PR#163 first outline of class to represent and manipulate pull request comments
- next
- fix
SUCCESS
/FAILURE
messages when building with EB v4.7.0- use a
bot/build-check.sh
script, call if from Slurm build job script used by bot; - should touch a
job.success
orjob.fail
file (or ajob.result
file), that could have some info on failure; - result=success
- msg=...
- job_id=12345
- job_time=123456s
- use a
-
@bot rebuild
(relatively easy)- bot: rebuild
-
@bot enable/disable
(relatively easy?)- would allow to instruct the bot to only build for a subset of targets before triggering a build (only */generic, etc.)
-
@bot get log/tmp
(difficult to manage access properly) - start engage with SURF on integrating testing
- fix
- misc
- a lot of work using bot for NESSI and ideas for improvements
-
Bob's progress on compat layer 2023.03
- big PR #163
- has been split up into smaller PRs, see #166-#169
- PR #166 needs fix included in PR #167
- PR #163 can be synced up with
main
branch after PRs #166-#169 have been merged - tasks/deploy.py in bot needs to be tweaked to strip out hardcoding for software layer
- big PR #163
- plan for bot
- GitHub App (bot) to automate workflow to add software to EESSI #41
- Set up infrastructure to build/test/deploy software in EESSI using bot #45
- depends on (initial) set of supported architectures
- Define contribution policy + checklist of requirements for contributions #46
- Document semi-automated workflow to contributed to EESSI #47
- D5.1 Community contribution policy and GitHub App #48
- working on PRs, Kenneth, Thomas
- approved EESSI/software-layer PR#233
- reviewed EESSI/eessi-bot-software-layer PR#155
- other open PRs
- EESSI/docs PR#103
- EESSI/eessi-bot-software-layer PR#164 add issue comment id to metadata
- assigned to Jonas
- EESSI/software-layer PR#237 fix check for missing installations
- assigned to Kenneth
- TODO
- PR for fixes to create_tarball.sh script
- retrigger bot for a specific CPU arch with a comment
@bot please rebuild for graviton2
-
status
- merged
- ready to get merged?
- EESSI/software-layer PR#233
- ok after review
- should get tested together with https://github.com/EESSI/eessi-bot-software-layer/pull/155
- docs PR https://github.com/EESSI/docs/pull/103
- ready to get reviewed
- almost ready for review? EESSI/eessi-bot-software-layer PR#155
- should be tested in conjunction with https://github.com/EESSI/software-layer/pull/233
- very early draft
-
PR#163 first outline of class to represent and manipulate pull request comments
- to handle PR comments more efficiently (fewer GitHub API calls)
- can also be used to support asking bot to rebuild for a particular architecture
-
PR#163 first outline of class to represent and manipulate pull request comments
- work-in-progress
- Jonas is working on making text being used in comments configurable
-
Bob's progress on compat layer 2023.03
- big PR #163
- has been split up into smaller PRs, see #166-#169
- PR #166 needs fix included in PR #167
- PR #163 can be synced up with
main
branch after PRs #166-#169 have been merged - tasks/deploy.py in bot needs to be tweaked to strip out hardcoding for software layer
- big PR #163
-
ideas for future features
- Now, the “most” desired new feature by software builders seems to be the ability to rebuild a specific job.
- We could make the bot listen to a well-formatted comment like
@bot please rebuild for foo
? - Yes. We need to restructure the PR comments. Would have just one PR per target architecture+target repository. Maybe:
- Intro "Instance XYZ building for arch ARCH and repo REPO/VERSION ..."
- Instance config (enabled/disabled, builders, deployers, RAM, local disk, CPU cores)
- PR assessment (n of k easyconfigs missing, estimated)
- Last three states (human readable)
- All states (expandable, human readable), includes list of received commands
- Log of all updates (json, expandable), includes list of received commands
- Link to docs
- command "shell" (here is where we could add commands)
- We could begin making a mockup.
- Could be like a control center in a PR comment. Then at the top of the PR we really may only need an overview to easily see what is done, what not.
- Not everything maybe, and hopefully not everything in a single PR. I’ll try at least.
- NESSI team members also asked for the ability to cancel a job. Or get a bit more detailed status on request.
- From the perspective of someone maintaining the bot network, it would be nice to have an overview of running jobs, available resources (disk space, API request rate), bot instance health (running or not), health of other services (smee server, autoingest script on S0, S3 bucket server), ...
- Was also wondering if a script
inspect.sh --job-id JOBID
could be useful? You could run it from a working directory or supply the job id, then it would automatically launch the container with the correct parameters. It might also check if the job is still running. Maybe we want to updateeessi_container.sh
such that we can use the overlay of a build session, but with read-only access.
- work on open PRs, Kenneth, Thomas
- DONE EESSI/docs PR#100
- DONE EESSI/software-layer PR#232
- EESSI/software-layer PR#233
-
bot/build.sh
usingeessi_container.sh
- update EESSI/docs (--help) + example in using
--
- address requested changes & suggestions
-
- EESSI/eessi-bot-software-layer PR#155
- support for
bot/build.sh
in software layer - discuss if/which tasks should be done, which could be postponed
- more testing should be done (while a similar version is in use for NESSI for ~2 weeks + this exact version has been tested with a few jobs ... more testing is beneficial)
- documentation for defining repositories needs to be added
- currently jobs are generated as follows:
- iterate over
arch_target_map
entries (a list oflinux/CPU_FAMILY/microarchitecture
(abbreviatedarch
below) toslurm_opts
mappings)- iterate over
repo_target_map[arch]
entries (repository identifiers)
- iterate over
- this results in one job for each combination of
arch
+repo_id
- should be the other way around? iterating over entries in
repo_target_map
(which already contains all combinationsarch
+repo_id
) then just gettingslurm_opts
fromarch_target_map
? - if there is no matching
arch
key inrepo_target_map
orarch_target_map
an error should be logged and the bot shall continue (currently results in a crash)
- iterate over
- support for
- Thomas+Bob on vacation --> drop
- work on open PRs, Kenneth, Thomas
- EESSI/docs PR#100
- warning, not use for production
- [TR] read once to give ok to get it merged
- EESSI/software-layer PR#232
- final polishing and testing ... almost ready to get merged
- work on open PRs, Kenneth, Thomas
- status/goals
- [Thomas]
- finish bot/build.sh work, install version on Saga, Fram and eX3/Fox
- works, however not prepared as PRs yet, suggested procedure
- working version of SWL in https://github.com/trz42/software-layer/blob/enhancement/improvements_to_job_env/bot/build.sh
- currently used for adding software to NESSI
- first finish SWL PR#227 fixing tests for '
--access rw
' - then do follow-up PR to sync with what currently runs for NESSI bot network; rework PR#226
- do separate PR for bot/build.sh from PR#226
- requires
yq
to parse cfg file
- requires
- finally update BOT PR#148 (may require some additional clean-up and documentation)
- bot should create a cfg file to pass info to bot/build.sh script
- working version of SWL in https://github.com/trz42/software-layer/blob/enhancement/improvements_to_job_env/bot/build.sh
- works, however not prepared as PRs yet, suggested procedure
- determine good first issue for Jonas
- try compat-layer PR #163
- (Thomas) no time
- (Bob) can trigger bot to build compat layer, fails due to bootstrap trouble (but bot part works fine)
- code that picks up tarball needs work (separate bot/deploy.sh script)
- start from https://github.com/EESSI/compatibility-layer/pull/160 and https://github.com/EESSI/gentoo-overlay/pull/84
- Bob will look into building new compat layer (2023.02)
- unplanned
- adding documentation for eessi_container ... plus reworking "Getting access to EESSI" & "Using EESSI" PR#100
- in progress
- see also Bob's PR https://github.com/EESSI/docs/pull/69 + https://github.com/EESSI/docs/pull/85
- later also a "Contributing to EESSI" page
- can use https://github.com/NorESSI/software-layer/wiki/Making-a-pull-request-to-NorESSI-software-layer as starting point
- adding documentation for eessi_container ... plus reworking "Getting access to EESSI" & "Using EESSI" PR#100
- misc
- workshop for NESSI colleagues on using the bot for building software for NESSI/EESSI
- see https://github.com/NorESSI/software-layer/wiki/Making-a-pull-request-to-NorESSI-software-layer
- hopefully activates more people to use bot which should lead to improvements
- idea to extend NESSI bot network to AWS cluster
- using new
nessibot
account
- using new
- workshop for NESSI colleagues on using the bot for building software for NESSI/EESSI
- finish bot/build.sh work, install version on Saga, Fram and eX3/Fox
- [Kenneth]
- figure out why OpenFOAM build doesn't work (software-layer PR #195)
- easily retrigger individual builds from PR
- see also Thomas' PR #85
- work with Thomas to get his PRs merged
- bot
- software-layer
- [Bob]
- test compat-layer PR #163 with bot using bot:build label
- [Thomas]
-
status PR#216
- some tests for
--access rw
fail, not immediately clear why (no obvious reasons from checking VM setup: disks type and size) - uncomment 2 tests, merge PR and make follow-up PR for investigating issue
- some tests for
-
discussed documentation for "Using EESSI", see PR#100
- see suggestions for improvements in PR
- agreed to restructure documentation as follows (also see https://hackmd.io/4IZZuK2fSQeh2TbZCPm9iA)
-
EESSI pilot repository [Kenneth]
- keep the warning
- point to other pages
-
Getting access to EESSI
- Native installation [Kenneth]
- Using the EESSI container [Thomas]
- incl. save/resume
-
Using EESSI
- Setting up your environment [Thomas]
- source
- R --version
- module avail output (see /pilot)
- Useful $EESSI env vars
- Running EESSI demos [Kenneth]
- Setting up your environment [Thomas]
-
Building software for EESSI (follow-up PR, not include in #100)
- Manual procedure (for testing)
- Using eb --installpath
- Using read/write access in build container
- Making a contribution to EESSI
- Contribution policy
- Workflow
- Overview
- EESSI maintainer tasks
- Contributor tasks
- Recommendations
- Manual procedure (for testing)
- goal: discuss PR#216
- TODOs:
- need to address requested changes
- need tests for parameters
snapshot of older notes available at https://github.com/multixscale/meetings/wiki/sync-meetings-bot-T5.3
- goals
- [Thomas]
- finish bot/build.sh work, install version on Saga, Fram and eX3/Fox
- determine good first issue for Jonas
- try compat-layer PR #163
- [Kenneth]
- figure out why OpenFOAM build doesn't work (software-layer PR #195)
- easily retrigger individual builds from PR
- see also Thomas' PR #85
- work with Thomas to get his PRs merged
- bot
- software-layer
-
https://github.com/EESSI/software-layer/pull/215
- => sync call at Wed 1 Feb at 13:30 CET
- https://github.com/EESSI/software-layer/pull/216
- (upcoming PR, see https://github.com/trz42/software-layer/tree/enhancement/bot-build-with-swl-216)
- https://github.com/EESSI/eessi-bot-software-layer/pull/85
-
https://github.com/EESSI/software-layer/pull/215
- [Bob]
- test compat-layer PR #163 with bot using bot:build label
- [Thomas]
-
agenda
- status (see below)
- overall progress ... how to move on? ... how to avoid that PRs wait too long before getting reviewed (also: how to avoid diverging development branches)?
- maybe we need a bit more (realistic) goals?
-
status
- PRs merged
- #136 (try and except in pr_comments) including several unit tests using fixtures and mocking --> opened issue #145 to revisit tests and improve the code quality
- cleaning up tests is WIP by Kenneth, needs more work (but not urgent)
- #144 (run bot/build.sh script in build job script, if it exists)
- Kenneth started creating bot/build.sh script for software-layer, very WIP still...
- software-layer #210 (R 4.1.0) - all installs (except ppc64le) built with bot
- #136 (try and except in pr_comments) including several unit tests using fixtures and mocking --> opened issue #145 to revisit tests and improve the code quality
- PRs open
- #127 (providing an overview of a PR's status ...)
- #85 (tool to resubmit a build job locally)
- software-layer #195 (OpenFOAM v9)
- first attempt to let bot build it failed (consistently?)
- looks like a problem with a dependency?
- compat layer: https://github.com/EESSI/compatibility-layer/pull/163
- incl. bot/build.sh script
- issues closed
- none
- issues opened
- #146 (submit build jobs with default time limit, add configuration option to use different time limit)
- #145 (improve code quality of unit tests)
- work in progress
- [Thomas] major rework of jobscript submitted by bot and the interface between the job and the "payload"
- based on discussion on 19 Jan
- bot repo:
- swl repo:
- https://github.com/trz42/software-layer/tree/enhancement/bot-build-with-swl-216
- example PR to be built using the above branch https://github.com/trz42/software-layer/tree/add-CaDiCaL-1.3.0-GCC-9.3.0-NESSI
- added a little feature preview where one could enable/disable build targets in a pull request, see https://github.com/NorESSI/software-layer/pull/1#issuecomment-1407726812
- for passing down environment variable into compat layer env, see https://github.com/EESSI/software-layer/pull/220/files
- [Thomas] major rework of jobscript submitted by bot and the interface between the job and the "payload"
- [Jonas] got bot running on Saga
- other
- setting up GH project (NorESSI) and bot instances for NESSI to let project members use the bot for building software stack
- should cover
x86_64/generic
,x86_64/amd/zen2
,x86_64/intel/broadwell
,x86_64/intel/skylake_avx512
and86_64/intel/cascadelake
- idea is to eventually use HEAD version of EESSI/software-layer and move all repository specific config/settings out of the software-layer repo to the configuration of a bot instance (e.g., a bot instance could build for EESSI or NESSI using the exact same PR)
- should cover
- setting up GH project (NorESSI) and bot instances for NESSI to let project members use the bot for building software stack
- discussion
- need a way to remove files/directories before building (to trigger rebuild) and also for the ingestions to remove from files from cvmfs repository see issue #147
- PRs merged
-
planned goals
- [Thomas]
- Hafsa's open PRs (#136): finish & merge
- TODO revisit all C* cases (testing
update_comment
)
- TODO revisit all C* cases (testing
- code & doc cleanup
- resubmit PR (#85)
- mark as draft for now (don't expect others to review)
- try to let Bob review it so he gets introduced to parts of the code
- resume work on #127 (PR status overview)
- try EESSI/software-layer #216 with bot (addressing issues #135, #98, #88, #42)
- Hafsa's open PRs (#136): finish & merge
- [Kenneth]
- implement end-to-end test
- should get PR #136 merged first since that includes lots of tests with mocking
- migrate docs in README to mkdocs website
- incl. adding high-level bot overview to docs
- implement end-to-end test
- [Bob]
- set up current bot and play with it
- app.cfg.example should be cleaned up, has some duplicate sections
- maybe try bot for new compat layer build script
- set up current bot and play with it
- [Jonas]
- try bot
- [Thomas]
- present: Bob, Kenneth, Thomas
- reason
- can be used for software layer, compat layer, automating build of cluster-specific software stack, ...
- maybe also for testing EB easyconfig PRs?
- status w.r.t hardcoded stuff in current bot implementation
- bot side: scripts/eessi-bot-build.slurm
- software-layer: install_software_layer.sh, EESSI-pilot-install-software.sh, run_in_compat_layer.sh
./build_container.sh run /tmp/$USER/EESSI $PWD/install_software_layer.sh
- how to check whether build work is now hardcoded in job manager
- see "EESSIBotSoftwareLayerJobManager.process_finished_job"
- maybe job manager should call a bot-check-build.sh script?
- or bot-build.sh script should touch a SUCCESS/FAIL file?
- use bot/build.sh script
- we'll eventually also have other scripts that the bot should use, like test/deploy/create_tarball/...
- stuff to communicate from bot to bot/build.sh script
- via a JSON file that bot/build.sh script can pick up?
- fields
- $CPU_TARGET
- HTTP proxy to use (if any)
- for which repo software should be built
- EESSI, NESSI, test/develop repo, ...
- for which (EESSI) version software should be built
cvmfs_customizations
- tmpdir to use?
- what should bot still do?
- prepare working directory for job
- run some checks (in job script)
- verify CPU target (since we bypass archspec)
- to verify whether bot is correctly configured
- requires latest archspec (or archdetect, or both)
- verify CPU target (since we bypass archspec)
- steps
- [KH] update eessi-bot-build.slurm job script to run bot/build.sh if it's there
- fall back to what it does now if not
- make different bot/build.sh scripts
- [KH] for current EESSI pilot (software-layer)
- [TR] for PR#216 (eessi_container.sh), Thomas tries this on local system
- [BD] Bob has already some script for using bot @ RUG
- [BD] Bob has one for the compat layer
- [KH] bot/build.sh for VSC/UGent + for testing easyconfig PRs
- [KH] update eessi-bot-build.slurm job script to run bot/build.sh if it's there
- other wild ideas
- bot:build label could be made more fine-grained
- examples:
- bot:build:nessi
- bot:build:eessi:aarch64/graviton2
- or maybe better via comment that give specific instructions to bot?
-
present: Bob, Kenneth, Thomas
-
status
- PRs merged
- #131 (dedicated log)
- #132 (retry after exception when getting GitHub token)
- #137 (expose error when read_config fails)
- #138 (use .diff patch file to set up build job ...)
- #141 (let bot pass CPU target to ... build job ...)
- PRs open
- #136 (try and except in pr_comments)
- using retry library, allows to use a @retry decorator for specific functions
- see https://pypi.org/project/retry
- this one is not actively maintained (last update: May 2016)
- switch to retry2?, or [pyretry](https://pypi.org/project/py-retry, or https://pypi.org/project/the-retry, ...
- see https://pypi.org/project/retry
- sleep mocked
- some code cleanup done
- TODO revisit all C* cases (testing
update_comment
)
- using retry library, allows to use a @retry decorator for specific functions
- #127 (providing an overview of a PR's status ...)
- #85 (tool to resubmit a build job locally)
- #136 (try and except in pr_comments)
- issues closed
- #140 (bot build script should make sure correct CPU target is used)
- #125 (dedicated log method in event handler + job manager)
- issues opened
- #139 (find alternative for retry package)
- #142 (job manager crash due to "Bad credentials")
- may be fixed by retry in PR #136
- practical problems hit when build R 4.1.0 with bot
- crashing job manager
- see issue #142, should be fixed by PR #136
- crashing installation due to fluke (GitHub --from-pr trouble)
- would be nice to be able to instruct bot to retry for a specific CPU target
- overview of current status across all CPU targets is difficult => see ongoing work on letting bot create overview in PR description (issue #93 + PR #127)
- crashing job manager
- other (see goals below + any other work related to bot)
- related work on EESSI/software-layer
- merged PR #220 (update script to take into account
$EESSI_SOFTWARE_SUBDIR_OVERRIDE
...) - merged PR #218 (... script for checking on missing installations ...)
- merged PR #220 (update script to take into account
- lots of work towards more (unit) testing, reading up on mocking, fixtures, etc.
- Jonas: intro to GH apps, setting up demo GH app, intro to HPC, setting up bot
- making bot "agnostic" of EESSI
- let build job script call a "bot-build.sh" script available in the repository
- will help both for using bot to manage own install stack, but also for letting bot build compat layer
- related work on EESSI/software-layer
- PRs merged
-
discussion
- adopt strategy to have test implemented before change?
- analyse result of build job
- number of ec built
- number of files in tarball
- retry in build script
- bot could retry to build for a target
- general overview of what is being built by all bot instances
- tune number of cpu cores used, amount of memory/disk space needed
- what components need to be updated, only bot, only eb, both ...
- collect information about resource usage of bot jobs
- define interface between bot and "software-layer" or "compat-layer" or ...
- bot-build.sh in target repo
-
goals
- [Thomas]
- Hafsa's open PRs (#131, #132, #136): finish & merge
- code & doc cleanup
- resubmit PR (#85)
- mark as draft for now (don't expect others to review)
- try to let Bob review it so he gets introduced to parts of the code
- [Kenneth]
- implement end-to-end test
- should get PR #136 merged first since that includes lots of tests with mocking
- migrate docs in README to mkdocs website
- incl. adding high-level bot overview to docs
- implement end-to-end test
- [Bob]
- set up current bot and play with it
- app.cfg.example should be cleaned up, has some duplicate sections
- set up current bot and play with it
- [Jonas]
- intro to bot
- [Thomas]
-
building new compat layer
- one script to build compat layer in container with Ansible playbook
- Ansible playbooks don't require root permission anymore, they just assume that /cvmfs is writable
-
present: Bob, Kenneth, Thomas
-
status bot
- working prototype
- running bot instances get notified of GitHub events
- bot submits build jobs when bot:build label is added
- bot runs upload script to S3 bucket when bot:deploy label is added
- example: https://github.com/trz42/software-layer/pull/46
- maybe script to ingest tarballs should be integrated with the bot?
- issues https://github.com/EESSI/eessi-bot-software-layer/issues
- open PRs https://github.com/EESSI/eessi-bot-software-layer/pulls
- actually using current prototype helps a lot to identify things that can be improved
- ~1000 jobs launched by bot in scope of NESSI project (6 bot instances)
- working prototype
-
first steps (not fully ordered)
- finish open PRs -- [Thomas]
- except #129 "pull request overview"
- end-to-end test (code coverage) -- [Kenneth takes another look]
- code+doc cleanup sprint -- [Thomas has a look in parallel]
- docstring
- string formatting
- code consistency
- README -> mkdocs conversion => out of scope for this sprint
- agree on development practices
- 2-pairs of eyes rule: never merge your own PR
- keep docs up-to-date along with changes being made
- code style: honor the Hound CI
- keep PRs smallish - relatively easy to review
- aim for short lifetime of PRs - get them reviewed/merged quickly
- make sure that changes being made are covered by the tests
- once we have an end-to-end test
- do add unit tests for new/tweaked functions
- prepare a first release 0.1.0
- label issues accordingly
- should include docs (mkdocs)
- package and publish on PyPI?
- most bot instances should be only running released version
- one bot running "develop" to have live testing of recent changes
- document how to set up/use bot for different environments: production, development, testing
- purpose
- scope (EESSI, MXS, personal)
- setting them up
- maintaining / operating them
- agree on roadmap for bot
- what should be done by v0.1.0
- features that make our life easier
- end-to-end test
- also for next releases
- what makes life of other people easier
- which features should be implemented by end of 2023 (MXS milestone)
- what should be done by v0.1.0
- finish open PRs -- [Thomas]
-
midterm goals
- policy for community contributions
- LICENSE clear (automatically checkable/verifiable?)
- easyconfig available -- builds on generic target (maybe guidelines for testing this with eessi_container?)
- "compatible" with current EESSI version
- not too many missing installations per PR
- (ReFrame tests available)
- (documentation available)
- same for all dependencies
- build software for next EESSI pilot version via bot
- do this ourselves to check that bot is working well
- get others "use" the bot via community contributions (PRs to EESSI/software-layer)
- set up hackathon to get people to open PRs that bot acts on
- policy for community contributions
-
long term goals
- make bot usable for EESSI (and beyond)
- let bot build entire software stacks quickly and using resources carefully
- building in parallel, splitting up builds across multiple jobs (via EB)?
- document bot in a (research) paper
- maybe for JOSS (Journal of Open Source Software - https://joss.theoj.org)
- or talk/presentation (automation or HPC devroom at FOSDEM?)
-
next meeting(s)
- every two weeks
- not on Mon/Fri (Thomas), not Wed (Bob)
- Tue 17 Jan'23 - 09:00 CET
- Tue 31 Jan'23 - 09:00 CET
- Tue 14 Feb'23 - 09:00 CET
-
internal milestones (for organising work) -- maybe rather think in releases
- production & development environment described/set up
- mapping involved components and their interfaces
- documentation => to be written by Thomas/Kenneth, reviewed by Bob?
-
dependencies
- access to (HPC systems) (Slurm) resources for running bot instances
- should cover all CPU (later also GPU) targets used in EESSI
- running bot instances with personal accounts or robot accounts or build mostly on AWS/Azure
- also depending who can set the labels
- (new) compatibility layer
- (new) Stratum 0 - dedicated server at RUG + under eessi.io domain
- access to (HPC systems) (Slurm) resources for running bot instances
- none
- D5.1 Community contribution policy and GitHub App
- due: M12
- partners: Uib (6PM), Ugent (2PM), RIJKSUNIGRON (2PM)
- People involved:
- Bob (RIJKSUNIGRON)
- Kenneth (Ugent)
- Jonas, Thomas (Uib)