-
Notifications
You must be signed in to change notification settings - Fork 176
Splitter Executor
RFC #15
Author: A.Casajús
Last Modified: 2013-05-06
Up until now there have been two ways to split jobs in DIRAC:
- Client-side splitting: Client generates and submits to DIRAC n jobs
- : Client can do whatever is needed
- : Takes a lot of time
- : Needs n job submissions
- : Can't cache results
- JobManager splitting: As soon as DIRAC's JobManager gets the job, it divides the job as required and returns a list of job ids
- : Client doesn't have to do the splitting
- : Client receives all the job ids at submission time
- : Takes a lot of time and slows other submissions
- : It's very restrictive on what can be done.
- : Difficult to extend the functionality
We'd like to get the best of both worlds. Fast job submission, extendable splitting mechanisms and take advantage of knowledge that DIRAC already has to speed up the splitting.
Users should be able to send a job, define how it has to be divided and let DIRAC do the splitting. The job manifest should include all the necessary information for DIRAC to know how to split it. But instead of the JobManager dividing the job, it should be stored in the JobDB and divided asynchronously.
DIRAC will not know a priori how many jobs will be generated. That means that users will not know at submission time all the job ids. But they will be able to request that information once the job has been divided.
The Job splitting will be done by specific modules. Each module will divide the job in a different way. Users will define which module they want DIRAC to use when dividing their job. For instance, one module can do parametric jobs just like the JobManager does, another one can divide Input Data based on where it is... Any DIRAC extension can make their own modules if needed.
When a job is received by the JobManager, it will do minimal checks, store it into the JobDB and return the resulting job id. It will not do any splitting and will always return one job id if the submission has been successful.
Jobs will define which module has to be used when splitting them by defining a Splitter option in the manifest. The value of this option is the name of the module to use. Splitter=Parametric will use the ParametricSplitter module to divide it.
Since each module can require the job to be in a different step of the optimization chain, they have to define after which optimizer they can work. For instance, the ParametricSplitter can divide the job right after the JobPath since it does not require any extra information. On the other hand a splitter module that requires Input Data information will have to run after the InputDataResolution optimizer.
Once the JobPath optimizers gets any job, it will look if it has the Splitter option defined in the manifest. If it is not there, it will proceed as usual. If it is defined, the JobPath will load the requested module (ParametricSplitter in the example) and check after what optimizer they can work. In that position the JobPath will insert the new Splitter optimizer. As an example:
For Jobs with Splitter=Parametric, since ParametricSplitter can work after the JobPath optimizer:
Before:
OptimizationChain: JobPath -> JobSanity -> JobScheduling
After:
OptimizationChain: JobPath -> Splitter -> JobSanity -> JobScheduling
Once the job reaches the Splitter optimizer, the requested splitting module will be loaded and used to split the job manifest into as many manifests as needed. Then all the manifests will be sent to the Optimization Mind that will generate all the required jobs and start them into the optimization chain.
We need a way to track down all the jobs in a herd (all the jobs that come from a split). To do so we will use the MasterJobId and JobSplitType attributes. By default all jobs have their own job ids as MasterJobId and have JobSplitType=Single. If the Splitter parameter is defined, the JobManager will set the JobSplitType to WillSplit.
Once a list of manifests comes back from the Splitter optimizer, the Optimization Mind will generate all the required jobs. All the new jobs will have the generating job id as their MasterJobID. The generating manifest will be stored as a MasterJDL in the JobDB. Once all jobs have been stored, the originating job JobSplitType will be changed to Splitted.
This way users can know if their job has been splitted (WillSplit vs Splitted in JobSplitType) and they can retrieve all the jobs in the herd just by knowing any job id. All the jobs that belong in the same herd have the same MasterJobID.
An important note: The originating job will become the first of the new set of jobs. It will be replaced completely. For instance, if the splitter has returned three manifests. The first one will be assigned to the originating job and two new jobs will be generated. This will prevent regenerating again the same herd of jobs if the originating job gets rescheduled. Also we don't need to invent a new set of states for this job because there are no special cases.
Each splitter module can do different things. For instance, the ParametricSplitter will just make a list of jobs based on a list of parameters, the InputDataBySeSplitter will take the list of Input Data and generate jobs with a maximum number of Input Data files in the same SE... This changes will be reflected in the manifest, and users need to define where this modifications have to be used. To do so the manifest has to include a minimal variable substitution functionality. Splitters will define new options in the manifest and users will define where this options have to be used. As an example:
- Originating manifest
Executable = "/bin/echo";
JobName = "parametric_${SplitSourceJob}:${SplitID}";
Arguments = "$Parameter";
Parameters = 10;
ParameterStep = 1;
ParameterFactor = 1.3;
StdOutput = "StdOut_$SplitID";
StdError = "StdErr_$SplitID";
OutputSandbox = { "StdOut_$SplitID","StdErr_$SplitID" };
Splitter = Parametric;
- Generated manifest after ParametricSplitter and variable expansion
Executable = "/bin/echo";
Arguments = 3.99;
JobName = "parametric_41:02";
StdOutput = "StdOut_02";
StdError = "StdErr_02";
OutputSandbox = { "StdOut_02", "StdErr_02" };
JobID = 43;
Parameter = 3.99;
SplitID = 02;
SplitSourceJob = 41;