document-array

This repository contains the implementation of algorithm [1] that computes the document-array (DA) for a string collection in linear time using constant additional space.

Introduction

Our algorithm receives the suffix array (SA) of the concatenated string [2] and reuses its space to store the LF-array. Then SA is reconstructed during the computation of DA.

Build requirements

An ANSI C Compiler (e.g. GNU GCC)

API

/** @brief computes the document array of string s[0, n-1] given its SA
 *
 *  @param T  input concatenated string, using separators s[i]=1 and with s[n-1]=0
 *  @param SA Suffix array 
 *  @param DA Document array
 *  @param n  string length
 *  @param d  number of documents
 *  @return -1 if an error occured, otherwise returns 0
 */
int document_array_9n(unsigned char* T, int_t* SA, int_da* DA, uint_t n, uint_t d);

Example

Compilation:

make

Available options:

-d D	use the first D documents of the INPUT
-v	verbose output
-o	output computed arrays to disk (INPUT.da and INPUT.sa)
-c	check output (for debug)
-p P	print the output arrays LA[1,P] and SA[1,P] (for debug)
-h	this help message

Notes:

Supported extensions are .txt, .fasta and .fastq.

Run a test:

/main dataset/input.txt -p 10 -c

Output:

d = 3
N = 32 bytes
sizeof(int) = 4 bytes
TOTAL:
CLOCK = 0.000059 TIME = 0.000000
0.000059
SA: isSorted!!
DA: isDA!!
########
i) DA	SA	suffixes
0) 3	31	#
1) 0	6	$abacabad$...
2) 1	15	$banaanana...
3) 2	30	$#
4) 0	5	a$abacabad...
5) 2	29	a$#
6) 2	24	aanana$#
7) 2	19	aananaanan...
8) 1	7	abacabad$b...
9) 1	11	abad$banaa...
########
malloc_count ### exiting, total: 25,032, peak: 21,512, current: 1,024

References

[1] Louza, F. A., A Simple Algorithm for Computing the Document Array. CoRR abs/1812.09094 (2018)

[2] Louza, F. A., Gog, S., Telles, G. P., Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci., vol. 678, pp. 22-39, 2017, Elsevier.

Thanks

Thanks to Giovanni Manzini, Travis Gagie and Nicola Prezza for helpful discussions

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dataset		dataset
experiments/bitvector		experiments/bitvector
external		external
lib		lib
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
document-array.c		document-array.c
document-array.h		document-array.h
main.c		main.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

document-array

Introduction

Build requirements

API

Example

References

Thanks

About

Releases

Packages

Languages

felipelouza/document-array

Folders and files

Latest commit

History

Repository files navigation

document-array

Introduction

Build requirements

API

Example

References

Thanks

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages