Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samtools Customview Fails with Unavoidable Segmentation Faults #199

Open
ieres-amgen opened this issue May 31, 2023 · 4 comments
Open

Samtools Customview Fails with Unavoidable Segmentation Faults #199

ieres-amgen opened this issue May 31, 2023 · 4 comments
Assignees
Labels
bug Something isn't working enhancement Improvement for existing functionality WIP Work in progress

Comments

@ieres-amgen
Copy link

Description of the bug

The "SAMTOOLS_CUSTOMVIEW" process fails on some files due to a segmentation fault, with a core dump, suggesting a memory issue. On my most recent run I allocated 768 GB RAM with 96 CPUs to the task, and it still failed. The input file is fairly large but nothing absurd for a FASTQ, ~15 GB. On slack Harshil suggested that the command may crash due to a lot of piping and offered some ideas to resolve:

  • Splitting up that command to not use as many pipes
  • Picard CollectMultipleMetrics to get the fragment length distribution
  • Converting from BED -> BAM and then scraping, or using a simple pysam parsing script.

Thanks!

Command used and terminal output

No response

Relevant files

No response

System information

No response

@ieres-amgen ieres-amgen added the bug Something isn't working label May 31, 2023
@chris-cheshire chris-cheshire added the enhancement Improvement for existing functionality label Jun 15, 2023
@chris-cheshire
Copy link
Contributor

Hi @ieres-amgen, how technically minded are you? Are you able to test/implement any of these as a feature?

@ieres-amgen
Copy link
Author

Hi @chris-cheshire, I am not particularly fluent with nextflow, but I did try splitting the pipe up to ascertain the cause of the break, and found that the samtools view bit completes successfully, whereas a core dump/seg fault is encountered that breaks the pipe when passed to sort -T '.'. I will try splitting this command up further (to essentially use no piping) and see if that can resolve. If it still doesn't, I think I would feel more confident trying to implement an nf module of Picard CollectMultipleMetrics to get what's needed here, but would certainly be willing to try the third option if you think it'd be easier to implement or can provide some guidance.

@ieres-amgen
Copy link
Author

ieres-amgen commented Jun 30, 2023

I found that I was able to resolve by replacing the current process for collecting fragment lengths with the below implementation of Picard CollectInsertSizeMetrics. I note a custom process label for more memory, the setting of avail_mem, and the use of a ${prefix}.tmp file and tail -n+12 on that file to ensure the formatting of this module's ${prefix}.txt file will be correct for the FRAG_LEN_HIST downstream task (otherwise calc_frag_hist.py will encounter a ParserError):

process SAMTOOLS_CUSTOMVIEW {
    tag "$meta.id"
    label 'process_samtoolscustomview'

    conda (params.enable_conda ? "bioconda::picard=2.27.4" : null)
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/picard:2.27.4--hdfd78af_0' :
        'quay.io/biocontainers/picard:2.27.4--hdfd78af_0' }"

    input:
    tuple val(meta), path(bam), path(bai)

    output:
    tuple val(meta), path("*.txt"), emit: tsv
    tuple val(meta), path("*.pdf"), emit: histogram
    path  "versions.yml"          , emit: versions

    when:
    task.ext.when == null || task.ext.when

    script:
    def prefix   = task.ext.suffix ? "${meta.id}${task.ext.suffix}" : "${meta.id}"
	
	def avail_mem = 500
    """
	picard \\
        -Xmx${avail_mem}g \\
        CollectInsertSizeMetrics \\
        --INPUT $bam \\
        --OUTPUT ${prefix}.tmp \\
        --Histogram_FILE ${prefix}.pdf
		
	tail -n+12 ${prefix}.tmp > ${prefix}.txt
		
    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        picard: \$(echo \$(picard CollectInsertSizeMetrics --version 2>&1) | grep -o 'Version:.*' | cut -f2- -d:)
    END_VERSIONS
    """
	
	
	stub:
    def prefix = task.ext.prefix ?: "${meta.id}"
    """
    touch ${prefix}.pdf
    touch ${prefix}.txt
    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        picard: \$(echo \$(picard CollectInsertSizeMetrics --version 2>&1) | grep -o 'Version:.*' | cut -f2- -d:)
    END_VERSIONS
    """
}

I am not sure if there is an option in Picard to exclude the various header outputs in lines 1-11 of the OUTPUT file, or if a newer version of the software may provide a different format. But this resolved my issues. Thanks!

-Ittai

@chris-cheshire
Copy link
Contributor

thank you for your investigation! I will try and roll this in as a feature for the next release. It won't be 3.2 though I am afraid, as this is already about to ship.

@chris-cheshire chris-cheshire reopened this Jul 4, 2023
@chris-cheshire chris-cheshire self-assigned this Jul 4, 2023
@chris-cheshire chris-cheshire added the WIP Work in progress label Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement Improvement for existing functionality WIP Work in progress
Projects
None yet
Development

No branches or pull requests

2 participants