gsufsort [1] is a fast, portable and lightweight tool for constructing the suffix array and related data structures for string collections.
gsufsort runs in internal memory and data structures are written to disk.
For a string collection, gsufsort can compute the following data structures:
- Suffix array (SA)
- Inverse suffix array (ISA)
- LCP-array (LCP)
- k-truncated LCP-array (k-LCP)
- Document array (DA)
- Burrows-Wheeler transform (BWT)
- Inverse BWT
- Generalized suffix array (GSA)
- Generalized enhanced suffix array (GESA)
gsufsort will compile in systems with a standard C compiler (like gcc) and make.
git clone https://github.com/felipelouza/gsufsort.git
cd gsufsort
make
Issuing these commands will build executables gsufsort and gsufsort-64.
For inputs larger than 2GB, gsufsort-64 must be used.
To enable support to compressed files, zlib is required. If zlib is
not installed in you system, build with option make GZ=0
.
./gsufsort INPUT [options]
where INPUT is a single file or directory with a string collection.
--build (default)
--sa [w] compute the SA using w bytes (default 4), write to INPUT.w.sa
--isa [w] compute the ISA, write to INPUT.w.isa
--lcp [w] compute the LCP, write to INPUT.w.lcp
--trlcp k compute the k-truncated LCP array, write to INPUT.w.lcp
--da [w] compute the DA, write to INPUT.w.da
--gsa [w1][w2] compute the GSA=(string,suffix) as pairs of w1+w2 bytes, write to INPUT.w1.w2.gsa
--gesa [w1][w2][w3] compute the GESA=(GSA,LCP,BWT), write to INPUT.w1.w2.w3.1.gesa
--bwt compute the BWT using 1 byte per symbol, write to INPUT.bwt
--docs d process only the first d strings in the collection
--light run the lightweight algorithm to compute DA, GSA and GESA
--output DIR/NAME write output files to DIR and use NAME as a prefix for file names
--load load data-structures from disk INPUT[.sa][.da][.lcp][.gsa][.str]
--ibwt invert the BWT, given INPUT.bwt, write output to INPUT.bwt.ibwt
--txt handle input as raw text files (one string per line)
--fasta handle input as fasta files
--fastq handle input as fastq files
--dir include all files at the input directory
--lower convert input to lowercase before data structures construction
--upper convert input to uppercase before data structures construction
--str write the collection concatenation (T^{cat}) to INPUT.1.str
--print [p] print the first p elements of arrays to stdout, defaults to the collection length
--qs write QS sequences in fastq permuted according to the BWT to INPUT.bwt.qs
--lcp_max write the maximum LCP value, write to INPUT.w.lcp_max
--lcp_max_text write the maximum LCP value text, write to INPUT.lcp_max.txt
--lcp_avg write the average LCP value, write to INPUT.w.lcp_avg
--time print the running time in seconds
--verbose verbose output
--help this help message
-
File types (text, fasta or fastq) will be selected by extensions: .txt, .fasta (or .fa, .fna.) and .fastq (or .fq).
-
Options
--txt
,--fasta
and--fastq
enable loading file disregarding extensions. -
In txt files, each line is taken as a strings in the collection. In fasta and fastq files, each sequence is taken as a string in the collection.
-
gsufsort supports the ASCII alphabet, but values 0 and 255 are reserved and must not occur in the input.
-
IUPAC symbols and 'N' are not handled as special symbols in fasta or fastq files.
-
gzipped input files (with .gz extension) are supported using zlib and kseq libraries. If zlib is not installed in your system, build gsufsort with the option
make GZ=0
. If zlib is not available and a gzipped file is given as input, a runtime error will occur. -
A directory may be given as input, selecting option
--dir
. Every file with expected extensions in the directory will be processed to compose the collection, and the default output file prefix will be all. See also Wiki.
-
Output files are written by default in the current directory.
-
If option
--output DIR/
is set, files are written to directoryDIR
. Setting--output DIR/NAME
will make files be written to directoryDIR
with suffixNAME
. -
Output files format is discussed below.
To run a test with all strings from dataset/example.txt
, type:
./gsufsort dataset/example.txt --sa --bwt
## gsufsort ##
## store_to_disk ##
example.txt.4.sa 76 bytes (n = 19)
example.txt.bwt 19 bytes (n = 19)
To see the result (option --print
) stored in disk INPUT.4.sa
and INPUT.bwt
, use --load
option:
./gsufsort example.txt --sa --bwt --load --print
## load_from_disk ##
example.txt.4.sa 76 bytes (n = 19)
example.txt.bwt 19 bytes (n = 19)
i SA BWT suffixes
0 18 $ #
1 6 a $
2 12 a $
3 17 n $
4 5 n a$
5 11 b a$
6 9 n aba$
7 15 n an$
8 3 n ana$
9 7 $ anaba$
10 13 $ anan$
11 1 b anana$
12 10 a ba$
13 0 # banana$
14 16 a n$
15 4 a na$
16 8 a naba$
17 14 a nan$
18 2 a nana$
-
SA, ISA, LCP, k-LCP and DA are each written sequentially to a binary file. The file has no header and every integer takes
w
bytes. The default value ofw
is 4. -
BWT and iBWT are written in ASCII format, using 1 byte per input symbol.
-
The GSA is written sequentially to a binary file, with no headers. The values of DA and SA are intercalated throughout the file with w1 and w2 bytes respectively: DA[0],SA[0],DA[1],SA[1],...
-
The GESA is written sequentially to a binary file, with no headers. The values of DA and SA, LCP and BWT are intercalated throughout the file with w1, w2, w3 and 1 bytes respectively: DA[0],SA[0],LCP[0],BWT[0],DA[1],SA[1],LCP[1],BWT[1]...
See more details and additional features in Wiki.
We thank to Antonis Maronikolakis for his helpful suggestions.
[1] Louza, F.A., Telles, G.P., Gog, S., Prezza, N., Rosone, G.. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms Mol Biol 15, 18 (2020). https://doi.org/10.1186/s13015-020-00177-y
Supported by the project Italian MIUR-SIR CMACBioSeq ("Combinatorial methods for analysis and compression of biological sequences") grant n.~RBSI146R5L. P.I. Giovanna Rosone ↩