-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can Pypiper generate a DAG to guide the execution of commands that comprise a pipeline? #189
Comments
Hey @zhangzhen thanks for this question and idea. I'm not currently developing on I could add, though, that I'd love this feature (DAG-like declaration of the relationships among the pipeline's steps, and then automatic conncurrent execution where possible, based on that structure) and would definitely use it! If you're interested in prototyping, I think a PR would be welcome, certainly by me and I think probably by the maintainers, though it's a question for @nsheff |
You are right that pypiper is really intended to run sequentially. Our mode of operating is to parallelize by sample, rather than by task within a pipeline. This has lots of advantages, and a few disadvantages -- but for most of the analysis we're doing, it makes a lot of sense and you won't gain any/much efficiency by parallelizing by task if you're parallelizing effectively by sample. Making your pipeline parallel by task also can add complexity to the pipeline, so it isn't always worth it. That said, you can actually still make a pipeline parallize tasks in pypiper if you need to, it's just not a built-in, recommended thing to do. If you want some guidance on how to do it, let me know and I can show you. |
And to directly answer your question: I am not planning to add parallelizing by task like this. But if you want to add it, I would consider a PR, as long is it was a simple solution that didn't complicate the codebase too much. |
I've built bioinformatics pipelines for NGS testing in clinical oncology for more than 5 years. Pyflow and Nextflow are pipeline frameworks I use most of the time. Pyflow is light-weight and does well in sample-level analysis, while Nextflow is heavy-weight and does well in batch-level analysis. However, they both adopt the monolithic approach that makes them do more things than they should do. The modular approach you come up with is the better way to build pipeline frameworks. The philosophy behind a series of softwares such as looper, pypiper, bulker is what I love and brings me inspiration. Moreover, one of your posts helps me form a clearer picture on parallelism in bioinformatics. It's a bit of a pity that I know the work your lab and you have done just a few days ago.
Pipelines in clinical oncology have indeed such needs. After doing reads mapping, variants calling such as SNV/INDEL calling, CNV calling, SV calling, etc., and QC are often performed simultaneously. Hey @nsheff, could you please show me how to parallelize tasks within a pipeline? Thanks a lot! |
As far as I know, Pypiper runs commands of a pipeline sequentially, even if some commands can be run concurrently. Will you plan to support the concurrent execution in the near future?
Cheers,
Zhen Zhang
The text was updated successfully, but these errors were encountered: