GraphSeq: Accelerating String Graph Construction for De Novo Assembly on Spark

Abstract

De novo genome assembly is an important application on both uncharacterized genome assembly and variant identification in a reference-unbiased way. In comparison with de Brujin graph, string graph is a lossless data representation for de novo assembly. However, string graph construction is computational intensive. We propose GraphSeq to accelerate string graph construction by leveraging the distributed computing framework.

Workflow

Usage

$ /usr/local/spark/bin/spark-submit   --master spark://XXX:7077   --class com.atgenomix.seqslab.cli.SparkSTMain   ./target/graphseq-1.0.0.jar overlap
INPUT                  : Input path (generated by Adam transform)
OUTPUT                 : Output path
-cache                 : Cache the reads in memory to speedup data processing
-h (-help, --help, -?) : Print help
-max_edges N           : Maximal number of edges per read [default = Integer.MAX_VALUE]
-max_read_length N     : Maximal read length [default = 151]
-mlcp N                : Minimal longest common prefix [default = 45]
-packing_size N        : The number of reads will be packed together [default = 100]
-pl_batch N            : Prefix length for number of batches [default=1]
-pl_partition N        : Prefix length for number of partitions [default=7]
-print_metrics         : Print metrics to the log on completion
-profiling             : Enable performance profiling and output to $OUTPUT/STATS
-rmdup                 : Remove duplication of reads
-stats                 : Enable to output statistics of String Graph to $OUTPUT/STATS

Citing GraphSeq

GraphSeq is published at BioRxiv for open access.

@techreport{Su18,
    title={{GraphSeq}: Accelerating String Graph Construction for De Novo Assembly on Spark},
    author={Su, Chung-Tsai and Chang, Ming-Tai and Cheng, Yun-Chian and Li, Yun-Lung and Wang, Yao-Ting},
    year={2018},
    institution={Atgenomix}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
docs		docs
target		target
tools		tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphSeq: Accelerating String Graph Construction for De Novo Assembly on Spark

Abstract

Workflow

Usage

Citing GraphSeq

About

Releases

Packages

Contributors 2

Languages

atgenomix/graphseq

Folders and files

Latest commit

History

Repository files navigation

GraphSeq: Accelerating String Graph Construction for De Novo Assembly on Spark

Abstract

Workflow

Usage

Citing GraphSeq

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages