Skip to content

Latest commit

 

History

History
3 lines (2 loc) · 2.08 KB

FaultTolerance.md

File metadata and controls

3 lines (2 loc) · 2.08 KB

Section 11: Fault Tolerance

In compute environments such as the ones used in this work, various levels of fault tolerance need to be considered. In terms of the jobs launched on the cloud, network issues have to be invariably managed by the cloud provider. In addition, avoiding the use of preemptible VMs (GCP) or spot instances (AWS) or low-priority VMs (Azure) although excessively cheaper can reduce failures in the middle of job executions. The next level of fault tolerance is handled by the workflow manager or job execution engine where a job failure can be attributed to a multitude of reasons including sudden network issues, invalid input files or bugs in the software tool. In GCP, we use dsub as the job scheduler which has the ability to rerun a failed job a certain number of times as provided by the user in the "retries" optional parameter. Reruns of failed jobs on AWS Batch is based on the "retryStrategy" specified during job definition which can be implemented using "attempts" (default: one attempt) or "evaluateOnExit" which works in tandem with "attempts" but the job is retried under set conditions. Similarly, on Azure Batch various categories of error handling including task errors such task exit codes and file upload errors can be used along with "maxTaskRetryCount" which reruns the task up to the maximum limit given by user in addition to a default initial try. Application failures can be tackled using information from standard output and standard error or generic log files generated in prior job executions in addition to the level of granularity in job definitions offered by the cloud provider. Lastly, for network failures occurring during the bioinformatics pipeline execution, rerunning of steps that were completed successfully can be prevented using job control mechanisms. These techniques include parameters used for job dependencies operating at different levels of detail based on the cloud provider and one example is the "skip" parameter with GCP’s dsub. The "skip" parameter when used with dsub allows a job to be skipped if the outputs already exist in the specified location.