revert migration of htcondor release role #855

sanjaysrikakulam · 2023-07-29T11:47:24Z

condor_rm on the maintenance node does not work as the jobs are submitted by a different scheduler (sn06.galaxyproject.eu). At the moment there were more than 150 jobs in the held state that were due to memory issues and were supposed to be removed. To fix this for now I have manually added the script and enabled the cronjob (galaxy user) on sn06 and disabled it on the maintenance node.

I tried:

added the name of the scheduler to the condor_rm command but encountered authentication errors and issues

galaxy@maintenance:~$ condor_rm -name sn06.galaxyproject.eu -reason "This job was resubmitted $RESUBMIT_CAP times. Most likely because of running out of memory." 45013273
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate.  Globus is reporting error (851968:50).  There is probably a problem with your credentials.  (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
Couldn't find/remove all jobs in cluster 45013273

Added more verbosity

root@maintenance:~$ condor_rm -debug -name sn06.galaxyproject.eu -reason "This job was resubmitted $RESUBMIT_CAP times. Most likely because of running out of memory." 45013273
07/29/23 13:32:46 AUTH_ERROR: Cannot resolve network address for KDC in requested realm
07/29/23 13:32:46 authenticate_self_gss: acquiring self credentials failed. Please check your Condor configuration file if this is a server process. Or the user environment variable if this is a user process.

GSS Major Status: General failure
GSS Minor Status Error Chain:
globus_gsi_gssapi: Error with GSI credential
globus_gsi_gssapi: Error with gss credential handle
globus_credential: Valid credentials could not be found in any of the possible locations specified by the credential search order.
Valid credentials could not be found in any of the possible locations specified by the credential search order.
Attempt 1
globus_credential: Error reading host credential
globus_sysconfig: Could not find a valid certificate file: The host cert could not be found in:
1) env. var. X509_USER_CERT
2) /etc/grid-security/hostcert.pem
3) $GLOBUS_LOCATION/etc/hostcert.pem
4) $HOME/.globus/hostcert.pem

The host key could not be found in:
1) env. var. X509_USER_KEY
2) /etc/grid-security/hostkey.pem
3) $GLOBUS_LOCATION/etc/hostkey.pem
4) $HOME/.globus/hostkey.pem


Attempt 2
globus_credential: Error reading proxy credential
globus_sysconfig: Could not find a valid proxy certificate file location
globus_sysconfig: Error with key filename
globus_sysconfig: File does not exist: /tmp/x509up_u0 is not a valid file
Attempt 3
globus_credential: Error reading user credential
globus_sysconfig: Error with certificate filename: The user cert could not be found in:
1) env. var. X509_USER_CERT
2) $HOME/.globus/usercert.pem
3) $HOME/.globus/usercred.p12

07/29/23 13:32:46 SECMAN: required authentication with <132.230.223.239:9618> failed, so aborting command ACT_ON_JOBS.
07/29/23 13:32:46 DCSchedd::actOnJobs: Failed to send command (ACT_ON_JOBS) to the schedd
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate.  Globus is reporting error (851968:50).  There is probably a problem with your credentials.  (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
Couldn't find/remove all jobs in cluster 45013273

Will investigate this further next week and see how to achieve the expected outcome and migrate this role with proper fix.

condor_rm on maintenance node does not work. Needs further investigation

sanjaysrikakulam · 2023-07-31T06:21:06Z

@mira-miracoli, @sj213 Do you already have any experience with this?

mira-miracoli · 2023-07-31T07:11:40Z

From the SchedLog I can see that your condor tries to access sn05, I think the old IP for condor-cm.galaxyproject.eu is still cached. maybe you can try to flush the dns-cache.

sanjaysrikakulam · 2023-07-31T07:32:49Z

From the SchedLog I can see that your condor tries to access sn05, I think the old IP for condor-cm.galaxyproject.eu is still cached. maybe you can try to flush the dns-cache.

Thank you! I once tried to run it with condor-cm and then tried sn06 as sched name which is why you might be seeing condor-cm in the log. Also, DNS is fine because pining the name results in sn06's IP. So it's properly resolving. I suspect that I need to ALLOW the maintenance node to WRITE or to ADMIN other schedulers. This needs to be updated on the sn06 condor_config. I am looking for the right macro/setting to modify.

bgruening · 2023-07-31T07:36:44Z

@sj213 should know I think. He has one other submit node configured already.

mira-miracoli · 2023-07-31T07:38:52Z

The permissions on the headnode are fine I think. But before the Sched is not actually connecting to it, but to sn05 we can not know for sure

kysrpex

Let's keep this on sn06 then for the time being?

sanjaysrikakulam · 2023-08-01T08:17:55Z

Let's keep this on sn06 then for the time being?

it's currently on sn06 (deployed manually)

sj213 · 2023-09-08T11:17:51Z

Thank you! I once tried to run it with condor-cm and then tried sn06 as sched name which is why you might be seeing condor-cm in the log. Also, DNS is fine because pining the name results in sn06's IP. So it's properly resolving. I suspect that I need to ALLOW the maintenance node to WRITE or to ADMIN other schedulers. This needs to be updated on the sn06 condor_config. I am looking for the right macro/setting to modify.

Using -name sn06.galaxyproject.eu is probably the right thing to do in this case, but as you correctly stated, the maintenance node apparently lacks the requisite access privileges; the relevant macros are ALLOW_WRITE and ALLOW_ADMINISTRATOR as defined in sn06:/etc/condor/condor_config.local (not sure OTTOMH whether or not this file is maintained by Jenkins).

I don't know if giving the host write privilege is sufficient or if admin access is needed; in the spirit of the principle of least privilege I'd try it with write access only first.

Also note that according to condor_rm(1)

For any given job, only the owner of the job or one of the queue super users (defined by the QUEUE_SUPER_USERS macro) can remove the job.

so make sure that the command is issued by user #999.

sj213 · 2023-09-08T12:24:04Z

I don't know if giving the host write privilege is sufficient or if admin access is needed; in the spirit of the principle of least privilege I'd try it with write access only first.

Actually, if admin access were generally required for removal of jobs from the queue, users would not be able to remove their own jobs, so admin access should only be required in the case of owner ID mismatch. So maybe just becoming user galaxy on the host before running condor_rm is sufficient, given that the network 10.5.68.0/24 is already allowed write access.

sj213 · 2023-09-08T13:28:27Z

It would appear that Condor does not trust remote user IDs w/o strong authentication. Which is the right thing to do, of course, but it means we have to dig deeper into the issue of authentication. (Which we'll have to do anyway when transitioning to Condor >= v9.x)

sanjaysrikakulam · 2024-02-07T17:44:41Z

@sj213 since we have migrated to the latest version, do you think this might be possible now?

sj213 · 2024-02-07T18:15:32Z

@sj213 since we have migrated to the latest version, do you think this might be possible now?

Token authentication for the admin level access should work now in principle, but I'm not sure about the gory details. AFAIK we're ATM using two levels of privilege, one associated with the user condor (used for deamon-to-deamon auth) and another with the user galaxy (for submitter-to-deaemon). Giving the relevant processes on the maintenance host access to a condor-user token should IMO do the trick of authenticating as admin-level user. If additional macros need to be configured on the CM I don't know OTTOMH, although I'd expect that the admin user is authorized to remove any job by default.

sanjaysrikakulam · 2024-02-08T10:38:07Z

It works! After fixing the firewall issues: PR1 and PR2. I created a test job on the maintenance node and then was able to remove that job from the headnode (container) by running condor_rm -name maintenance.galaxyproject.eu 563 (format: <command> <scheduler name> <condor job id>)

bgruening · 2024-02-08T10:42:02Z

Can -name be a wildcard? Or can we specify 2? At some point we hopefully have 2 submit/head nodes.

sanjaysrikakulam · 2024-02-08T10:52:18Z

Can -name be a wildcard? Or can we specify 2? At some point we hopefully have 2 submit/head nodes.

Testing exactly those things. I'm trying to see what all would work. Currently, I am trying to filter by the scheduler name htcondor, which seems to be the name for the scheduler on the sn06 container. Trying to filter jobs based on that name does not seem to work as it does not recognize this. I will update soon (will figure something out).

mira-miracoli · 2024-02-15T11:14:03Z

As I understand we will then also have two condor ID counts for the separated schedulers sn06 and sn07
maybe we should think about if that could break something, when different Galaxy job IDs get the same Condor/Runner ID

bgruening · 2024-02-15T19:48:04Z

I hope this is not true. Its the same cluster so why would the same cluster have multiple identical IDs? Would the jwds etc logs clash potentially?

sj213 · 2024-02-16T13:41:34Z

I hope this is not true. Its the same cluster so why would the same cluster have multiple identical IDs?

This is by design in Condor: Each submit host has its very own queue along with its own set of job IDs; the CM tells them apart by internally qualifying job IDs with the name of the submitter host. (Actually, jobs IDs are "cluster IDs" in Condor parlance, as each job is potentially a whole "cluster" of subjobs. "Cluster" does here not refer to the compute hardware but to a software abstraction that groups a collection of (sub-)jobs into one scheduling unit, go figure... To minimize confusion, I'll continue to refer to Condor "cluster IDs" as "job IDs".)

You can check this for yourself with condor_q -global on sn06: this command currently lists two queues, one for the Galaxy instance on sn06 and another one for the RNA server. When put into service sn07 will add another queue with job IDs initially starting at 1.0 and counting upwards.

Whether or not this is a problem for Galaxy in a setup with multiple frontend nodes using a Condor backend each depends on how Galaxy uses the job scheduler's IDs internally. Does it somehow embed the job IDs in workflow IDs or other internal monikers? That would be potentially problematic, but I don't think Galaxy does this. In any case, this should be looked at in detail before putting a second headnode into production.

Would the jwds etc logs clash potentially?

This question actually goes beyond the Condor job ID issue. It is of course of vital importance, that the dataset and workflow IDs generated by both headnodes do not clash, regardless of what compute backend is used; and similar concerns arise wrt the data store and the DB. The real question is in fact if two instances of Galaxy can actually use the same PG database R/W (and also JWD / data store) without wreaking havoc on data integrity, e.i. if Galaxy is designed for concurrent usage of resources by multiple instances. For if it isn't, separate instances of the PG database and JWDs would be required for each frontend node, making the setup more involved.

revert migration of htcondor release role

fb27cf8

condor_rm on maintenance node does not work. Needs further investigation

sanjaysrikakulam requested a review from bgruening July 29, 2023 11:47

sanjaysrikakulam mentioned this pull request Aug 1, 2023

revert migration of stop-its to maintenance node #858

Merged

kysrpex approved these changes Aug 1, 2023

View reviewed changes

kysrpex merged commit 48a2ba6 into usegalaxy-eu:master Aug 1, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revert migration of htcondor release role #855

revert migration of htcondor release role #855

sanjaysrikakulam commented Jul 29, 2023 •

edited

Loading

sanjaysrikakulam commented Jul 31, 2023

mira-miracoli commented Jul 31, 2023

sanjaysrikakulam commented Jul 31, 2023

bgruening commented Jul 31, 2023

mira-miracoli commented Jul 31, 2023

kysrpex left a comment

sanjaysrikakulam commented Aug 1, 2023

sj213 commented Sep 8, 2023

sj213 commented Sep 8, 2023 •

edited

Loading

sj213 commented Sep 8, 2023

sanjaysrikakulam commented Feb 7, 2024

sj213 commented Feb 7, 2024

sanjaysrikakulam commented Feb 8, 2024

bgruening commented Feb 8, 2024

sanjaysrikakulam commented Feb 8, 2024

mira-miracoli commented Feb 15, 2024

bgruening commented Feb 15, 2024

sj213 commented Feb 16, 2024

revert migration of htcondor release role #855

revert migration of htcondor release role #855

Conversation

sanjaysrikakulam commented Jul 29, 2023 • edited Loading

sanjaysrikakulam commented Jul 31, 2023

mira-miracoli commented Jul 31, 2023

sanjaysrikakulam commented Jul 31, 2023

bgruening commented Jul 31, 2023

mira-miracoli commented Jul 31, 2023

kysrpex left a comment

Choose a reason for hiding this comment

sanjaysrikakulam commented Aug 1, 2023

sj213 commented Sep 8, 2023

sj213 commented Sep 8, 2023 • edited Loading

sj213 commented Sep 8, 2023

sanjaysrikakulam commented Feb 7, 2024

sj213 commented Feb 7, 2024

sanjaysrikakulam commented Feb 8, 2024

bgruening commented Feb 8, 2024

sanjaysrikakulam commented Feb 8, 2024

mira-miracoli commented Feb 15, 2024

bgruening commented Feb 15, 2024

sj213 commented Feb 16, 2024

sanjaysrikakulam commented Jul 29, 2023 •

edited

Loading

sj213 commented Sep 8, 2023 •

edited

Loading