-
Notifications
You must be signed in to change notification settings - Fork 175
JobRequirements RFC
RFC #7
Authors: S.Poss, A.Tsaregorodtsev
Last Modified: 9.03.2013
User jobs can specify various requirements to the resources to be eligible for the job execution. In most cases these requirements are specified as reserved keywords in the job description ( JDL ). The keywords are:
- Site
- BannedSite
- Platform
- CPUTime
If these parameters are specified in the job description, they are added to the definition of the corresponding Task Queue ( TQ ). The pilots are providing the resource description which is matched against TQs by the Matcher service.
The described standard matching mechanism is very efficient but is rather limited as well. Not all the requirements can be expressed in terms of predefined job parameters. New activities can require new resources specification that can be requested by the users jobs. Examples are: preinstalled software tags, specific services available on site, e.g. databases, memory available for jobs, CPU models, etc. Therefore there is a necessity to add more non-predefined characterisitics to the resources ( sites, CEs, queues ) that can be used in the job requirements without changing the code and the schema of the TQ database. In the present RFC, we present a proposal for such mechanism.
As described in RFC #5 the new Resources description schema in the CS allows to specify arbitrary parameters at each level ( Site, Computing Element, Queue ) valid for all the VOs that can use the resources as well as parameters valid for a given VO. These parameters can be used in order to select a set of resources according to certain specification using the Resources Configuration helper. For example:
result = Resources( vo = vo ).getEligibleNodes( 'Queue', ceSelectDict, queueSelectDict )
As a result, a tree structure ( nested dictionaries ) is returned starting from the Site level. This structure allows to construct a list of eligible Sites, CEs and Queues with respect to the given selection criteria. It is important to note that queues are inheriting parameters from their computing elements and sites, computing elements are inheriting their parameters from the sites. Queues can override the specification on the computing element or site level of the queue value is more specific than the one of its parent.
In the job description which is usually done in a form of a JDL, but can be a new, CFG format based structure, the specific job requirements can be described in a special section, called Requirements. For example:
[
Executable = "my_executable";
...
Requirements = [
SoftwareTag = { "AppVersion1","AppVersion2" };
CPUModel = "Intel Xeon";
Memory = 4000;
]
]
- if a parameter is specified in the job Requirements section and is not present in the resource description, the resources is not eligible for the job;
- if the parameter value is a list, resources having one of this values in their description are eligible;
- parameters can have numerical values that should be compared to the resource capacity, e.g. Memory in the example above. The parameter type can be deduced from the parameter description in the JobDescription section of the <Operations> configuration. In this case, the resource capacity should be greater than the job requirement.
The job requirements are applied in two steps. First a list of eligible queues is evaluated and stored in the database as a SubmitPool object. Second, each pilot is instrumented with a method to determine the list of SubmitPools that it is eligible for. Finally, the matching is done using the standard mechanism using the SubmitPool TQ parameter.
When a new job arrives to the JobManager it goes through a chain of Optimizer executors which are checking the job validity and prepare its entry into the Task Queue. In case the job specifies Requirements as described above, a special executor performs the following actions:
- extracts the job requirements, orders them alphabathically and creates a unique hash of the given set of requirements. This hash value will be refered to as the SubmitPool identifier;
- the job requirements are stored in a special SubmitPools table ( see below ) with the hash as a primary key;
- eligible queues and sites are obtained for the given requirements using the Resources helper as described above;
- the queues eligible to this set of job requirements are stored in a special table which defines the SubmitPool to queues correspondence, where the queue is defined by its Site/CEName/QueueName set of names;
- the list of eligible sites can be added to the Site job description parameter;
- the SubmitPool parameter of the job description is given the value of the identifier corresponding to the job requirements
There are two cases: pilots submitted by SiteDirectors and by TaskQueueDirectors. In both cases the Matcher service interface getSubmitPoolsForQueue() is used. However in the case of SiteDirectors it is putting less load on the Matcher service as one call is made for a bunch of submitted pilots.
SiteDirectors are submitting pilots to a certain queue. At this moment they interrogate the Matcher service with the query getSubmitPoolsForQueue() which returns all the defined SubmitPools ( if any ) for the queue. The so obtained list of SubmitPools is included into the pilot arguments while submission and is used by the pilot directly to pass to the Matcher when asking for jobs.
When pilots are submitted by TaskQueueDirectors there is no way to know in which queue they will be running. Therefore, such pilots will make a getSubmitPoolsForQueue() query to Matcher service presenting the queue that thay happen to run in. Once the list of SubmitPools is obtained, it can be used in the job requests.
The proposed solution implies that the computing resources properties are not changing often, only static parameters can be used. Still site parameters can change after the jobs are submitted but are not yet picked up by the pilots. In this case, either a no more eligible resource can pick up a job which will likely fail or an eligible resource will not be able to pick up a job. In order to minimize failures of this kind, the SubmitPool to queue mapping which is stored in the TaskQueueDB, should be refreshed at regular intervals ( for example each 15 minutes ). The SubmitPool definition in the database should have a time stamp of its last update.
When no more jobs corresponding to a given SubmitPool are available in the TaskQueue, the SubmitPool should be removed from the database. This should not be done necessarily as soon as the last job defining the SubmitPool is picked up ( it can be rescheduled soon after ). Rather the SubmitPools can be cleaned up by a dedicated agent running at regular intervals.