Skip to content

Latest commit

 

History

History
400 lines (314 loc) · 11.9 KB

README.md

File metadata and controls

400 lines (314 loc) · 11.9 KB

Parallel Construction of Suffix Arrays in Rust

This code is inspired by Cache-friendly, Parallel, and Samplesort-based Constructor for Suffix Arrays and LCP Arrays. We copied many ideas from the original C++ implementation CaPS-SA, most notably the mergesort that constructs the longest common prefix (LCP).

Setup

  • Install Rust (this is easy)
  • Run cargo install sufr to install the CLI
  • Alternately, execute cargo run in the source code directory to build a debug version of the program from source
$ cargo run
Parallel Construction of Suffix Arrays in Rust

Usage: sufr [OPTIONS] [COMMAND]

Commands:
  create     Create sufr file
  extract    Extract suffixes from a sufr file
  list       List the suffix array from a sufr file
  count      Count occurrences of sequences in a sufr file
  locate     Locate sequences in a sufr file
  summarize  Summarize sufr file
  help       Print this message or the help of the given subcommand(s)

Options:
  -t, --threads <THREADS>    Number of threads
  -l, --log <LOG>            Log level [possible values: info, debug]
      --log-file <LOG_FILE>  Log file
  -h, --help                 Print help
  -V, --version              Print version
  • Execute cargo build --release to build an optimized binary in ./target/release/sufr, which you can execute directly and copy into your $PATH:
$ cargo build --release

$ ./target/release/sufr -h
Parallel Construction of Suffix Arrays in Rust

Usage: sufr [OPTIONS] [COMMAND]

Commands:
  create     Create sufr file
  extract    Extract suffixes from a sufr file
  list       List the suffix array from a sufr file
  count      Count occurrences of sequences in a sufr file
  locate     Locate sequences in a sufr file
  summarize  Summarize sufr file
  help       Print this message or the help of the given subcommand(s)

Options:
  -t, --threads <THREADS>    Number of threads
  -l, --log <LOG>            Log level [possible values: info, debug]
      --log-file <LOG_FILE>  Log file
  -h, --help                 Print help
  -V, --version              Print version

Some of the commands may produce debug/info messages. Use the -l|--log option to view these on STDOUT or optionally write to a given --log-file.

Create a suffix array

To begin, you must create a .sufr using the create (cr) action:

$ sufr create -h
Create sufr file

Usage: sufr create [OPTIONS] <INPUT>

Arguments:
  <INPUT>  Input file

Options:
  -n, --num-partitions <NUM_PARTS>  Subproblem count [default: 16]
  -m, --max-query-len <CONTEXT>     Max context
  -o, --output <OUTPUT>             Output file
  -d, --dna                         Input is DNA
  -a, --allow-ambiguity             Allow suffixes starting with ambiguity codes
  -i, --ignore-softmask             Ignore suffixes in soft-mask/lowercase regions
  -D, --sequence-delimiter <DELIM>  Character to separate sequences [default: %]
  -s, --seed-mask <MASK>            Spaced seeds mask
  -r, --random-seed <RANDSEED>      Random seed [default: 42]
  -h, --help                        Print help

The resulting binary-encoded output file will contain:

  • metadata about the input
  • the entire input text encoded as u8 (bytes)
  • a fully sorted suffix array (SA)
  • an array of the LCP (longest common prefix) for the SA

The input file is a required positional argument and should be a FASTA/Q-formatted file with one or more sequences.

$ cat data/inputs/2.fa
>ABC
acgtacgt
>DEF
acgtacgt

First, read the input sequences into a string of u8 bytes and uppercase the characters. Each sequence is separated by a specified character (% by default). A final dollar sign ($) is appended to the end of the input text. If the --dna flag is present, suffixes are skipped if they begin with any character other than A, C, G, or T unless the --allow-ambiguity flag is present. Additionally, soft-masked (lowercase) input is converted to uppercase unless the --ignore-softmask flag is present, in which case it is converted to N and ignored.

Next, we partition the suffixes into some number N partitions by randomly selecting --num-partitions - 1 pivot suffixes, sorting them, and using the pivots to place each suffix into the highest bounded partition. The partitions are sorted using a merge sort algorithm that also generates an LCP (longest common prefix) array. The sorted suffix/LCP arrays are then concatenated to produce the final output.

The sufr CLI will create an output file containing a binary-encoded representation of the sorted suffix/LCP arrays along with the original sequence data and other metadata used to generate the arrays. For instance, with the 1.fa file, the default output file will be 1.sufr:

$ sufr --log debug create data/inputs/1.fa -n 2 --dna
[2025-01-06T20:28:54Z INFO  sufr] Using 8 threads
[2025-01-06T20:28:54Z INFO  sufr] Read input of len 11 in 982.625µs
[2025-01-06T20:28:54Z INFO  libsufr::sufr_builder] Selected 1 pivot in 74.291µs
[2025-01-06T20:28:54Z INFO  libsufr::sufr_builder] Wrote 9 unsorted suffixes to partition in 469.833µs
[2025-01-06T20:28:54Z INFO  libsufr::sufr_builder] Sorted 9 suffixes in 2 partitions (avg 4) in 356.833µs
[2025-01-06T20:28:54Z INFO  sufr] Wrote 172 bytes to '1.sufr' in 422.042µs
[2025-01-06T20:28:54Z INFO  sufr] Total time: 2.836917ms

Summarize a sufr file

Use the summarize (su) action to view metadata about a .sufr file:

$ sufr su -h
Summarize sufr file

Usage: sufr summarize <SUFR>

Arguments:
  <SUFR>  Sufr file

Options:
  -h, --help  Print help

For instance:

$ sufr su 1.sufr
+-----------------+------------------+
| Filename        | 1.sufr           |
+-----------------+------------------+
| Modified        | 2025-01-06 13:28 |
+-----------------+------------------+
| File Size       | 172 bytes        |
+-----------------+------------------+
| File Version    | 6                |
+-----------------+------------------+
| DNA             | true             |
+-----------------+------------------+
| Allow Ambiguity | false            |
+-----------------+------------------+
| Ignore Softmask | false            |
+-----------------+------------------+
| Text Length     | 11               |
+-----------------+------------------+
| Len Suffixes    | 9                |
+-----------------+------------------+
| Max query len   | 0                |
+-----------------+------------------+
| Num sequences   | 1                |
+-----------------+------------------+
| Sequence starts | 0                |
+-----------------+------------------+
| Headers         | 1                |
+-----------------+------------------+

Listing suffixes in a sufr file

You can use the list (ls) action to view the sorted arrays by their rank:

$ sufr ls -h
List the suffix array from a sufr file

Usage: sufr list [OPTIONS] <SUFR> [RANK]...

Arguments:
  <SUFR>     Sufr file
  [RANK]...  Ranks of suffixes to show

Options:
  -r, --show-rank        Show rank column
  -s, --show-suffix      Show suffix position column
  -p, --show-lcp         Show LCP column
  -v, --very-low-memory  Very low memory
  -n, --len <LEN>        Length of suffixes to show
  -o, --output <OUT>     Output
  -h, --help             Print help

For example, to view the 1.sufr file including rank, suffix, and LCP:

$ sufr ls -rsp 1.sufr
 0 10  0 $
 1  6  0 ACGT$
 2  0  4 ACGTNNACGT$
 3  7  0 CGT$
 4  1  3 CGTNNACGT$
 5  8  0 GT$
 6  2  2 GTNNACGT$
 7  9  0 T$
 8  3  1 TNNACGT$

An optional positional argument for the ranks allows you to select only a portion:

$ sufr ls 1.sufr 0-5
$
ACGT$
ACGTNNACGT$
CGT$
CGTNNACGT$
GT$

As suffixes can get quite long, use the -l|--len option to restrict the length of the shown suffix:

$ sufr ls 1.sufr 2-3 -n 3
ACG
CGT

Count occurrences of suffixes

Use the count (co) command to find the number of occurrences of suffixes:

$ sufr co -h
Count occurrences of sequences in a sufr file

Usage: sufr count [OPTIONS] <SUFR> <QUERY>...

Arguments:
  <SUFR>      Sufr file
  <QUERY>...  Query

Options:
  -m, --max-query-len <LEN>  Maximum query length
  -o, --output <OUT>         Output
  -l, --low-memory           Low-memory
  -v, --very-low-memory      Very low memory
  -h, --help                 Print help

The -l|--low-memory option will force the suffixes to be read from disk rather than loaded into memory. The -v|--very-low-memory option will also force the text to be read from disk rather than loaded into memory. The time to load a large index (e.g., human genome) into memory may take longer than the actual search, so you may find this is faster for only a few queries but slower for a large number.

For example:

$ sufr co 1.sufr AC GT X
AC 2
GT 2
X not found

Suffixes that do not occur are printed to STDERR such as the "X" in the preceding output.

Locate suffixes

Use the locate (lo) command to find the positions of a given suffix:

$ sufr lo -h
Locate sequences in a sufr file

Usage: sufr locate [OPTIONS] <SUFR> <QUERY>...

Arguments:
  <SUFR>      Sufr file
  <QUERY>...  Query

Options:
  -o, --output <OUT>         Output
  -m, --max-query-len <LEN>  Maximum query length
  -l, --low-memory           Low memory
  -v, --very-low-memory      Very low memory
  -a, --abs                  Show absolute position in text
  -h, --help                 Print help

For instance, given the 3.fa input:

$ cat data/inputs/3.fa
>1
AGCTTTTCATTCTGACTGCAACGG
GCAATANNNNNNNNNNTGTCTC
>2
TGTGTGGATTAAAAAAAGAGTGTC
>3
TGATAGCAGCTTCTGAACTGGTTA
CCTGCCGTGAGTAAAT

The suffix "ACT" is found at position 14 in sequence 1 and 16 in sequence 3, "GT" is found at position 20 in sequence 3, and "X" is not found and is printed to STDERR:

$ sufr lo data/expected/3.sufr ACT GTT X
ACT
1 14
3 16
//
GTT
3 20
//
X not found

Use the -a|--absolute flag if you prefer to see the raw suffix position:

$ sufr lo 3.sufr ACT GTT -a
ACT 14 88
GTT 92

By default, the entire suffix array will be loaded into memory. If your machine lacks suffixient resources, you can use the -m|--max-query-len option to create a down-sampled suffix array. Or use the -l|--low-memory option to search the suffix array on disk, which is slower but requires only enough memory to hold the original text/sequences.

Extract suffixes

Use the extract (ex) to print suffixes in FASTA format:

$ sufr ex -h
Extract suffixes from a sufr file

Usage: sufr extract [OPTIONS] <SUFR> <QUERY>...

Arguments:
  <SUFR>      Sufr file
  <QUERY>...  Query

Options:
  -m, --max-query-len <LEN>      Maximum query length
  -l, --low-memory               Low memory
  -v, --very-low-memory          Very low memory
  -p, --prefix-len <PREFIX_LEN>  Prefix length
  -s, --suffix-len <SUFFIX_LEN>  Suffix length
  -o, --output <OUT>             Output
  -h, --help                     Print help

Using the preceding locate results, I can extract the suffixes at those positions. As the suffixes can get quite long, I will use the -s|--suffix-len option to limit them to 15bp:

$ sufr ex data/expected/3.sufr ACT GTT -s 15
>1:14-29 ACT 0
ACTGCAACGGGCAAT
>3:16-31 ACT 0
ACTGGTTACCTGCCG
>3:20-35 GTT 0
GTTACCTGCCGTGAG

Combine this with the -p|--prefix-len to control the amount of preceding text, which might be useful when identifying alignment seeds:

$ sufr ex data/expected/3.sufr ACT GTT -s 15 -p 3
>1:11-29 ACT 3
CTGACTGCAACGGGCAAT
>3:13-31 ACT 3
TGAACTGGTTACCTGCCG
>3:17-35 GTT 3
CTGGTTACCTGCCGTGAG

The FASTA header contains the matching sequence ID, a colon, the start/stop position of the extracted sequence, followed by the query, and finally the location of the query in the extracted sequence, which is only relevant if you included a prefix.

Testing

Run cargo test.

Authors