-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balsam with Flux Framework? #343
Comments
Hi @vsoch, Sorry, the tutorial materials should have been more clear. Balsam uses a remote server that hosts a database for the user that stores aspects of the workflow (such as Applications, Jobs, etc.). Currently, there is only one Balsam server hosted at ALCF. If you just want to test out Balsam, we can look into getting you an account to access the server. We do have instructions for setting up a server, if that is something you'd be interested in trying. To answer some of your other questions, I'm not familiar with Flux Framework, but if it's a scheduler like SLURM or PBS Pro, what one could do is implement a FluxFrameworkClass (e.g. like the slurm example here). We'd also have to know what launcher you use and implement an AppRun class (like mpiexec here). The Applications and Jobs are stored within the database hosted on the Balsam server. In the tutorial example, the jobs created needed to know what application they were running ( If Balsam is still something you'd like to test out, let us know. |
@cms21 no worries - actually that link to the balsam docs is great, maybe it would be good to add to the repository top right URL (alongside the description?) E.g., here: I think if it was in the README somewhere I missed it! And I think adding a FluxFramework class is a great idea - I won't have time today but I'll add this to my TODO and we can use this issue for tracking and discussion. For some context - I'm wanting to test this out in the Flux Operator and it would make sense to set up the same hosted server, just in Kubernetes! I got pretty far today until I realized we need to do additional work to add the flux class. But I'm having some problems with the container build. Here is the
The log error (it seems to be choking on the path): Error: class uri 'balsam.server.gunicorn_logger.RotatingGunicornLogger' invalid or not found:
[Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/gunicorn/util.py", line 99, in load_class
mod = importlib.import_module('.'.join(components))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 940, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/balsam/balsam/server/__init__.py", line 1, in <module>
from fastapi import HTTPException, status
File "/usr/local/lib/python3.11/site-packages/fastapi/__init__.py", line 7, in <module>
from .applications import FastAPI as FastAPI
File "/usr/local/lib/python3.11/site-packages/fastapi/applications.py", line 15, in <module>
from fastapi import routing
File "/usr/local/lib/python3.11/site-packages/fastapi/routing.py", line 23, in <module>
from fastapi.dependencies.models import Dependant
File "/usr/local/lib/python3.11/site-packages/fastapi/dependencies/models.py", line 3, in <module>
from fastapi.security.base import SecurityBase
File "/usr/local/lib/python3.11/site-packages/fastapi/security/__init__.py", line 1, in <module>
from .api_key import APIKeyCookie as APIKeyCookie
File "/usr/local/lib/python3.11/site-packages/fastapi/security/api_key.py", line 3, in <module>
from fastapi.openapi.models import APIKey, APIKeyIn
File "/usr/local/lib/python3.11/site-packages/fastapi/openapi/models.py", line 103, in <module>
class Schema(BaseModel):
File "/usr/local/lib/python3.11/site-packages/pydantic/main.py", line 292, in __new__
cls.__signature__ = ClassAttribute('__signature__', generate_model_signature(cls.__init__, fields, config))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydantic/utils.py", line 258, in generate_model_signature
merged_params[param_name] = Parameter(
^^^^^^^^^^
File "/usr/local/lib/python3.11/inspect.py", line 2722, in __init__
raise ValueError('{!r} is not a valid parameter name'.format(name))
ValueError: 'not' is not a valid parameter name And my tweaked entrypoint.sh - I wanted to run the migrate command too: #!/bin/bash
export BALSAM_LOG_DIR="/balsam/log"
mkdir -p $BALSAM_LOG_DIR
gunicorn balsam server migrate || echo "gunicorn balsam server migrate not successful"
gunicorn --print-config -c /balsam/gunicorn.conf.py balsam.server.main:app
exec gunicorn -c /balsam/gunicorn.conf.py balsam.server.main:app And I have some of the envars (that I saw for the docker-compose setup) defined by the flux operator: services:
- image: postgres
name: postgres
ports:
- 5432
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: balsam
- image: redis
name: redis
ports:
- 6379
- image: ghcr.io/rse-ops/balsam-base:tag-latest
name: balsam
workingDir: /balsam
ports:
- 8000
environment:
BALSAM_DATABASE_URL: "postgresql://postgres:postgres@flux-sample-services.flux-service.flux-operator.svc.cluster.local:5432/balsam"
BALSAM_REDIS_PARAMS: '{"host": "flux-sample-services.flux-service.flux-operator.svc.cluster.local", "port": "6379"}'
BALSAM_AUTH_SECRET_KEY: "SOME_SECRET_KEY"
BALSAM_OAUTH_CLIENT_ID: "SOME_CLIENT_ID"
BALSAM_OAUTH_CLIENT_SECRET: "SOME_CLIENT_SECRET" Let me know if you want to see anything else, or if anything sticks out to you. I've added it to my TODO to look into adding Flux to balsam, and likely I won't need to resolve the above issues until after that! |
hey @cms21 that sounds like a great idea and I might take you up on it! Let me futz around a little bit with setting up a development environment, and then seeing if I'm able to add Flux. My goals are fairly simple - I'm testing out every workflow tool / simulation that I can with the Flux Operator, the goal being to get a nice survey of the landscape and try to start paving a direction for what (personally I'd like) for workflows at my institution. it's not uncharted territory because there are a ton of tools, but it's certainly not a paved path because I haven't really identified a leader in the space yet. I will do some work and learning and follow up here! Thank you for the kind offer! |
okay I'm following your docker-compose setup in your docker.yaml, and hitting the same issue:
Does this look familiar? I just need to setup a development environment. |
I also have the above error with the |
We get all kind of errors, See: * argonne-lcf#343 so pinned it on `FROM python:3.10-slim`
@basvandervlies this is super helpful! It looks like I was using python 3.11 too:
I'll try downgrading. |
Fix for the image: #363 |
Hi! I'm looking at the tutorial here: https://github.com/CrossFacilityWorkflows/DOE-HPC-workflow-training/tree/main/Balsam and trying to imagine how this works with a job manager like Flux Framework. Here is some of my early guesses so far:
this application script is creating an application to run lammps, and we would need to tweak the
lmp
command itself to work for flux. Is this assumed to be given to the launcher (e.g., you would normally do like mpirun with a certain number of nodes tasks). I'm wondering where this "app" gets saved, since the instruction to the user is to run the script one off?preparing the jobs references the app by name. I'm guessing there is some context that Balsam is saving to know that the result of step 0 (the bullet above) is this same "Lammps"?
submit should this be given to a launcher (e.g., flux start would give the script and run it under a flux instance for which there might be the resources needed?)
And the plot results seems straight forward. Thanks for the help / advice - I will likely just start playing around with it and figure out some of these concepts, but I wanted to start a conversation here to anticipate getting greater insights first.
The text was updated successfully, but these errors were encountered: