pla-index

pla-index builds a lightweight index from a FASTA file and its corresponding suffix array. The index supports k-mer membership queries and suffix array searches. It is substantially faster than a binary search on the suffix array.

Requirements

Linux (64 bit)
C++17
SDSL (>=2.1.1)
cmake (>= 3.16)

Quick Start

Installation

git clone --recursive https://github.com/medvedevgroup/pla-index.git
cd pla-index
./update_cmake_lists.sh SDSL_INCLUDE_PATH SDSL_LIB_PATH
mkdir build
cd build
cmake ..
make -j 4

The two parameters for update_cmake_lists.sh are

Parameter Name	Description
SDSL_INCLUDE_PATH	Path to the SDSL include folder [typically ~/include/]
SDSL_LIB_PATH	Path to the SDSL library folder [typically ~/lib/]

In case you want to construct your own suffix array, you may use libdivsufsort tool. We modify it slightly to allow 64 bit integers.

To create the suffix array builder executable mksary from the current build folder:

cd ../libdivsufsort
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE="Release" \
-DCMAKE_INSTALL_PREFIX="/usr/local" ..
sed -i 's/int32_t/int64_t/g' include/divsufsort.h
make
cp examples/mksary ../../build/

Usage

Now, all the executables are inside the pla-index/build folder.

./build_pla_index -h

would display the command line interface:

build_pla_index {OPTIONS}

    build_pla_index

OPTIONS:

    -h, --help    Print help and exit
    -g [STRING], --genome_fasta=[STRING]
                Fasta file with one entry and only ACGT characters. [Required]
    -s [STRING], --suffix_array=[STRING]
                Suffix array of the genome in a binary file. [Required]
    -t [STRING], --index_type=[STRING]
                Whether to build "basic-pla" or "repeat-pla" index. "repeat-pla" is
                generally the better option but can slow down rank queries unless the
                -r option is also used. [default: basic-pla]
    -k [INT], --kmer_size=[INT]
                Kmer size to be used to construct the index. [default: 21]
    -e [INT], --eps=[INT]
                Epsilon value to be used for constructing the pla-index. [default: 15]
    -o [STRING], --index=[STRING]
                File name where to save the index. [default: genome_fasta.index]
    -l [INT], --lookup=[INT]
                The average number of elements that will be searched on the X array. It is used to determine the size of the prefix lookup table. [default: 16]
    -r            Construct an extra bitvector with the same length as the suffix array.
                Use this to speed up rank queries. [default: disabled]

Similarly, query command line interface can be seen by:

./query_pla_index -h

and the interface:

query_pla_index {OPTIONS}

    query_pla_index

OPTIONS:

    -h, --help    Print help and exit
    -g [STRING], --genome_fasta=[STRING]
                Fasta file with one entry and only ACGT characters [Required]
    -s [STRING], --suffix_array=[STRING]
                Suffix array of the genome in a binary file [Required]
    -i [STRING], --index=[STRING]
                Name of the stored index file [Required]
    -q [STRING], --query=[STRING]
                Name of a text file containing queries. Each line of the file should
                contain a single k-mer. The length of the kmer should be the same as
                the one provided while building the index. [Required]
    --query_type=[STRING]
                Whether to do "search" or "rank" query. "rank" gives you the value of
                the first position in the suffix array where the query is found.
                "search" on the other hand, also returns the value where the query
                appears, but it might not be the very first spot. [default: search]

To build the suffix array:

./mksary INFILE OUTFILE

Parameter description:

Parameter Name	Description
INFILE	Fasta file with one entry and only ACGT characters
OUTFILE	Name of the file where the suffix array for INFILE will be stored

Small Example

We will use the formatted fasta file: tests/ecoli/ecoli.processed.fasta for the example. From now on, we will be using the executables that are inside the pla-index/build folder. To build suffix array:

cd build/
./mksary ../tests/ecoli/ecoli.processed.fasta ../tests/ecoli/ecoli.sa.bin

The built suffix array is written on tests/ecoli/ecoli.sa.bin.

To build a basic-pla index with default parameters (k=21, eps = 15), we will use the formatted fasta file and the built suffix array:

./build_pla_index -g ../tests/ecoli/ecoli.processed.fasta -s ../tests/ecoli/ecoli.sa.bin -o ../tests/ecoli/ecoli.index

The built index will be saved on ../tests/ecoli/ecoli.index file.

To do a search query with the provided tests/ecoli/ecoli.1.query.txt on the built tests/ecoli/ecoli.index one can do the following:

./query_pla_index -g ../tests/ecoli/ecoli.processed.fasta -s ../tests/ecoli/ecoli.sa.bin -i ../tests/ecoli/ecoli.index -q ../tests/ecoli/ecoli.1.query.txt

The query time will be shown on console.

Reproducibility

To reproduce the results related to the evaluation of the PLA-index, please follow the README file inside the Reproducibility folder. To reproduce the results related to the PLA complexity of various datasets, please follow the README file inside the Reproducibility-PLA-Complexity folder.

Citation

If you use pla-index, please cite:

@article{Abrar2024.02.08.579510,
    author = {Md Hasin Abrar and Paul Medvedev},
    title = {PLA-complexity of k-mer multisets},
    elocation-id = {2024.02.08.579510},
    year = {2024},
    doi = {10.1101/2024.02.08.579510}
}

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
Reproducibility-PLA-Complexity		Reproducibility-PLA-Complexity
Reproducibility		Reproducibility
libdivsufsort @ c0e8c69		libdivsufsort @ c0e8c69
sdsl		sdsl
src		src
tests/ecoli		tests/ecoli
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CreateConfigFile.py		CreateConfigFile.py
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
update_cmake_lists.sh		update_cmake_lists.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pla-index

Requirements

Quick Start

Installation

Usage

Small Example

Reproducibility

Citation

About

Releases

Packages

Contributors 2

Languages

License

medvedevgroup/pla-index

Folders and files

Latest commit

History

Repository files navigation

pla-index

Requirements

Quick Start

Installation

Usage

Small Example

Reproducibility

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages