Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert migration of htcondor release role #855

Merged

Conversation

sanjaysrikakulam
Copy link
Member

@sanjaysrikakulam sanjaysrikakulam commented Jul 29, 2023

condor_rm on the maintenance node does not work as the jobs are submitted by a different scheduler (sn06.galaxyproject.eu). At the moment there were more than 150 jobs in the held state that were due to memory issues and were supposed to be removed. To fix this for now I have manually added the script and enabled the cronjob (galaxy user) on sn06 and disabled it on the maintenance node.

I tried:

  1. added the name of the scheduler to the condor_rm command but encountered authentication errors and issues
galaxy@maintenance:~$ condor_rm -name sn06.galaxyproject.eu -reason "This job was resubmitted $RESUBMIT_CAP times. Most likely because of running out of memory." 45013273
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate.  Globus is reporting error (851968:50).  There is probably a problem with your credentials.  (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
Couldn't find/remove all jobs in cluster 45013273

Added more verbosity

root@maintenance:~$ condor_rm -debug -name sn06.galaxyproject.eu -reason "This job was resubmitted $RESUBMIT_CAP times. Most likely because of running out of memory." 45013273
07/29/23 13:32:46 AUTH_ERROR: Cannot resolve network address for KDC in requested realm
07/29/23 13:32:46 authenticate_self_gss: acquiring self credentials failed. Please check your Condor configuration file if this is a server process. Or the user environment variable if this is a user process.

GSS Major Status: General failure
GSS Minor Status Error Chain:
globus_gsi_gssapi: Error with GSI credential
globus_gsi_gssapi: Error with gss credential handle
globus_credential: Valid credentials could not be found in any of the possible locations specified by the credential search order.
Valid credentials could not be found in any of the possible locations specified by the credential search order.
Attempt 1
globus_credential: Error reading host credential
globus_sysconfig: Could not find a valid certificate file: The host cert could not be found in:
1) env. var. X509_USER_CERT
2) /etc/grid-security/hostcert.pem
3) $GLOBUS_LOCATION/etc/hostcert.pem
4) $HOME/.globus/hostcert.pem

The host key could not be found in:
1) env. var. X509_USER_KEY
2) /etc/grid-security/hostkey.pem
3) $GLOBUS_LOCATION/etc/hostkey.pem
4) $HOME/.globus/hostkey.pem


Attempt 2
globus_credential: Error reading proxy credential
globus_sysconfig: Could not find a valid proxy certificate file location
globus_sysconfig: Error with key filename
globus_sysconfig: File does not exist: /tmp/x509up_u0 is not a valid file
Attempt 3
globus_credential: Error reading user credential
globus_sysconfig: Error with certificate filename: The user cert could not be found in:
1) env. var. X509_USER_CERT
2) $HOME/.globus/usercert.pem
3) $HOME/.globus/usercred.p12

07/29/23 13:32:46 SECMAN: required authentication with <132.230.223.239:9618> failed, so aborting command ACT_ON_JOBS.
07/29/23 13:32:46 DCSchedd::actOnJobs: Failed to send command (ACT_ON_JOBS) to the schedd
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate.  Globus is reporting error (851968:50).  There is probably a problem with your credentials.  (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
Couldn't find/remove all jobs in cluster 45013273

Will investigate this further next week and see how to achieve the expected outcome and migrate this role with proper fix.

condor_rm on maintenance node does not work. Needs further investigation
@sanjaysrikakulam
Copy link
Member Author

@mira-miracoli, @sj213 Do you already have any experience with this?

@mira-miracoli
Copy link
Contributor

From the SchedLog I can see that your condor tries to access sn05, I think the old IP for condor-cm.galaxyproject.eu is still cached. maybe you can try to flush the dns-cache.

@sanjaysrikakulam
Copy link
Member Author

From the SchedLog I can see that your condor tries to access sn05, I think the old IP for condor-cm.galaxyproject.eu is still cached. maybe you can try to flush the dns-cache.

Thank you! I once tried to run it with condor-cm and then tried sn06 as sched name which is why you might be seeing condor-cm in the log. Also, DNS is fine because pining the name results in sn06's IP. So it's properly resolving. I suspect that I need to ALLOW the maintenance node to WRITE or to ADMIN other schedulers. This needs to be updated on the sn06 condor_config. I am looking for the right macro/setting to modify.

@bgruening
Copy link
Member

@sj213 should know I think. He has one other submit node configured already.

@mira-miracoli
Copy link
Contributor

The permissions on the headnode are fine I think. But before the Sched is not actually connecting to it, but to sn05 we can not know for sure

Copy link
Contributor

@kysrpex kysrpex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this on sn06 then for the time being?

@sanjaysrikakulam
Copy link
Member Author

Let's keep this on sn06 then for the time being?

it's currently on sn06 (deployed manually)

@kysrpex kysrpex merged commit 48a2ba6 into usegalaxy-eu:master Aug 1, 2023
2 checks passed
@sj213
Copy link
Contributor

sj213 commented Sep 8, 2023

Thank you! I once tried to run it with condor-cm and then tried sn06 as sched name which is why you might be seeing condor-cm in the log. Also, DNS is fine because pining the name results in sn06's IP. So it's properly resolving. I suspect that I need to ALLOW the maintenance node to WRITE or to ADMIN other schedulers. This needs to be updated on the sn06 condor_config. I am looking for the right macro/setting to modify.

Using -name sn06.galaxyproject.eu is probably the right thing to do in this case, but as you correctly stated, the maintenance node apparently lacks the requisite access privileges; the relevant macros are ALLOW_WRITE and ALLOW_ADMINISTRATOR as defined in sn06:/etc/condor/condor_config.local (not sure OTTOMH whether or not this file is maintained by Jenkins).

I don't know if giving the host write privilege is sufficient or if admin access is needed; in the spirit of the principle of least privilege I'd try it with write access only first.

Also note that according to condor_rm(1)

For any given job, only the owner of the job or one of the queue super users (defined by the QUEUE_SUPER_USERS macro) can remove the job.

so make sure that the command is issued by user #999.

@sj213
Copy link
Contributor

sj213 commented Sep 8, 2023

I don't know if giving the host write privilege is sufficient or if admin access is needed; in the spirit of the principle of least privilege I'd try it with write access only first.

Actually, if admin access were generally required for removal of jobs from the queue, users would not be able to remove their own jobs, so admin access should only be required in the case of owner ID mismatch. So maybe just becoming user galaxy on the host before running condor_rm is sufficient, given that the network 10.5.68.0/24 is already allowed write access.

@sj213
Copy link
Contributor

sj213 commented Sep 8, 2023

It would appear that Condor does not trust remote user IDs w/o strong authentication. Which is the right thing to do, of course, but it means we have to dig deeper into the issue of authentication. (Which we'll have to do anyway when transitioning to Condor >= v9.x)

@sanjaysrikakulam
Copy link
Member Author

@sj213 since we have migrated to the latest version, do you think this might be possible now?

@sj213
Copy link
Contributor

sj213 commented Feb 7, 2024

@sj213 since we have migrated to the latest version, do you think this might be possible now?

Token authentication for the admin level access should work now in principle, but I'm not sure about the gory details. AFAIK we're ATM using two levels of privilege, one associated with the user condor (used for deamon-to-deamon auth) and another with the user galaxy (for submitter-to-deaemon). Giving the relevant processes on the maintenance host access to a condor-user token should IMO do the trick of authenticating as admin-level user. If additional macros need to be configured on the CM I don't know OTTOMH, although I'd expect that the admin user is authorized to remove any job by default.

@sanjaysrikakulam
Copy link
Member Author

It works! After fixing the firewall issues: PR1 and PR2. I created a test job on the maintenance node and then was able to remove that job from the headnode (container) by running condor_rm -name maintenance.galaxyproject.eu 563 (format: <command> <scheduler name> <condor job id>)

@bgruening
Copy link
Member

Can -name be a wildcard? Or can we specify 2? At some point we hopefully have 2 submit/head nodes.

@sanjaysrikakulam
Copy link
Member Author

Can -name be a wildcard? Or can we specify 2? At some point we hopefully have 2 submit/head nodes.

Testing exactly those things. I'm trying to see what all would work. Currently, I am trying to filter by the scheduler name htcondor, which seems to be the name for the scheduler on the sn06 container. Trying to filter jobs based on that name does not seem to work as it does not recognize this. I will update soon (will figure something out).

@mira-miracoli
Copy link
Contributor

As I understand we will then also have two condor ID counts for the separated schedulers sn06 and sn07
maybe we should think about if that could break something, when different Galaxy job IDs get the same Condor/Runner ID

@bgruening
Copy link
Member

I hope this is not true. Its the same cluster so why would the same cluster have multiple identical IDs? Would the jwds etc logs clash potentially?

@sj213
Copy link
Contributor

sj213 commented Feb 16, 2024

I hope this is not true. Its the same cluster so why would the same cluster have multiple identical IDs?

This is by design in Condor: Each submit host has its very own queue along with its own set of job IDs; the CM tells them apart by internally qualifying job IDs with the name of the submitter host. (Actually, jobs IDs are "cluster IDs" in Condor parlance, as each job is potentially a whole "cluster" of subjobs. "Cluster" does here not refer to the compute hardware but to a software abstraction that groups a collection of (sub-)jobs into one scheduling unit, go figure... To minimize confusion, I'll continue to refer to Condor "cluster IDs" as "job IDs".)

You can check this for yourself with condor_q -global on sn06: this command currently lists two queues, one for the Galaxy instance on sn06 and another one for the RNA server. When put into service sn07 will add another queue with job IDs initially starting at 1.0 and counting upwards.

Whether or not this is a problem for Galaxy in a setup with multiple frontend nodes using a Condor backend each depends on how Galaxy uses the job scheduler's IDs internally. Does it somehow embed the job IDs in workflow IDs or other internal monikers? That would be potentially problematic, but I don't think Galaxy does this. In any case, this should be looked at in detail before putting a second headnode into production.

Would the jwds etc logs clash potentially?

This question actually goes beyond the Condor job ID issue. It is of course of vital importance, that the dataset and workflow IDs generated by both headnodes do not clash, regardless of what compute backend is used; and similar concerns arise wrt the data store and the DB. The real question is in fact if two instances of Galaxy can actually use the same PG database R/W (and also JWD / data store) without wreaking havoc on data integrity, e.i. if Galaxy is designed for concurrent usage of resources by multiple instances. For if it isn't, separate instances of the PG database and JWDs would be required for each frontend node, making the setup more involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants