Skip to content

Approximate Shortest Superstring Generator

License

Notifications You must be signed in to change notification settings

eloj/superstrings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Approximate Shortest Superstring Generator

WARNING: While this note persist, I may force-push to master

A Python library and tool for generating approximate shortest common superstrings.

All code is provided under the MIT License.

Introduction

A superstring is simply a string that contains some set of two or more strings as substrings. Though one can trivially be generated by just concatenating the input strings, we're usually interested in finding a more optimal solution, one which exploits the fact that strings in the input set may overlap:

"Given a finite set of strings, we would like to find their shortest common superstring. That is, we want the shortest possible string s such that every string in the set is a substring of s."1

Finding the shortest such superstring is NP-hard, but there are algorithms to approximate it, the simplest of which is called 'Greedy'2

Greedy is conjectured to have an approximation factor of two, though the proven upper bound34 is much worse.

Tool Discussion

The concept of this tool is simple; an input file with the string set to optimize is provided, and the tool generates an approximate superstring, plus some auxiliary information and tables that may or may not be useful.

The data directory contains some example files to play with.

It's perfectly valid to supply a file containing duplicated terms. This doesn't change anything for the superstring being generated -- these duplicates are removed in a pre-processing step -- but it's useful when you require a mapping between the term order in the input, and their corresponding locations in the superstring. This mapping can be generated using the --index-table option, and with it you can iterate over the input.

This tool is meant for small problems. It is written in Python, after all.

Usage

usage: superstring [-h] [-q | -v] [-s | -S | -L LOOPS | -B] [-C STR] [-F STR] [-j] [-i] [-l] [-R] [-G] [-V] [infile]

Approximate Shortest Superstring Generator -- https://github.com/eloj/superstrings

positional arguments:
  infile                   File containing set of strings, one per line

options:
  -h, --help               show this help message and exit
  -q, --quiet              Least verbose
  -v, --verbose            Increase output verbosity
  -s, --shuffle            Shuffle the input
  -S, --sort               Sort input by entry frequency
  -L LOOPS, --loops LOOPS  Shuffle and regenerate until min-length doesn't improve
  -B, --brute              Use brute-force. Warning: Only for tiny inputs!
  -C STR, --comment STR    String(s) that start a comment in the input
  -F STR, --mtf STR        Input element(s) to move-to-front
  -j, --join-only          Only join input, don't generate superstring
  -i, --index-table        Always output offset/index table
  -l, --length-table       Always output lengths table
  -R, --reduce-lengths     Reduce lengths based on minimum entry length/GCD
  -G, --reduce-offsets     Reduce offsets based on their GCD (gen. indeces)
  -V, --version            Display program version and exit

You can also supply arguments from a file using the '@argsfile' syntax.

Usage examples

Basic example using a commonly referenced DNA fragment string set:

$ ./superstring -q data/dna-example.txt
GCTAAGTTCATGCATC

Generating a superstring and offset table for 6502 opcodes, which are all three characters long:

$ ./superstring -v --index-table --mtf JAM data/MOS6510-mnem-basic.txt
Removing duplicates from input.
Applying move to front on input set ['JAM']
Moved JAM at index 2 to front.
256 strings (total len=768, min/max klen=3/3) in input, 57 unique strings (total len=171) remain.
Final pre-processed input:
['JAM', 'BRK', 'ORA', 'ASL', 'PHP', 'BPL', 'CLC', 'JSR', 'AND', 'BIT', 'ROL', 'PLP', 'BMI', 'SEC', 'RTI', 'EOR', 'LSR', 'PHA', 'JMP', 'BVC', 'CLI', 'RTS', 'ADC', 'ROR', 'PLA', 'BVS', 'SEI', 'STA', 'STY', 'STX', 'DEY', 'TXA', 'BCC', 'TYA', 'TXS', 'LDY', 'LDA', 'LDX', 'TAY', 'TAX', 'BCS', 'CLV', 'TSX', 'CPY', 'CMP', 'DEC', 'INY', 'DEX', 'BNE', 'CLD', 'CPX', 'SBC', 'INC', 'INX', 'NOP', 'BEQ', 'SED']
Output alphabet size=21:
ABCDEHIJKLMNOPQRSTVXY
Generated Superstring is 127 characters, saving 641 on original, 44 on unique:
JAMTAXBVCLIJMPSTAYRORTSXSTXADEYSECMPLDXBEQSEINYBVSBCCPXBCSTYAJSRTINCLDYBITXSEDECLCPYBPLPLABMINXNOPHPHADCLVBNEORASLSROLDANDEXBRK
The 256 verified ok offsets (~1057/224*8 bits, min unit=7 bits) are:
[124, 109, 0, 0, 0, 109, 111, 0, 97, 109, 111, 0, 0, 109, 111, 0, 84, 109, 0, 0, 0, 109, 111, 0, 79, 109, 0, 0, 0, 109, 111, 0, 61, 119, 0, 0, 71, 119, 115, 0, 85, 119, 115, 0, 71, 119, 115, 0, 90, 119, 0, 0, 0, 119, 115, 0, 31, 119, 0, 0, 0, 119, 115, 0, 63, 108, 0, 0, 0, 108, 113, 0, 99, 108, 113, 0, 11, 108, 113, 0, 6, 108, 0, 0, 0, 108, 113, 0, 8, 108, 0, 0, 0, 108, 113, 0, 20, 101, 0, 0, 0, 101, 18, 0, 87, 101, 18, 0, 11, 101, 18, 0, 47, 101, 0, 0, 0, 101, 18, 0, 42, 101, 0, 0, 0, 101, 18, 0, 0, 14, 0, 0, 57, 14, 24, 0, 28, 0, 25, 0, 57, 14, 24, 0, 50, 14, 0, 0, 57, 14, 24, 0, 58, 14, 73, 0, 0, 14, 0, 0, 68, 117, 36, 0, 68, 117, 36, 0, 15, 117, 3, 0, 68, 117, 36, 0, 55, 117, 0, 0, 68, 117, 36, 0, 103, 117, 21, 0, 68, 117, 36, 0, 81, 33, 0, 0, 81, 33, 77, 0, 44, 33, 121, 0, 81, 33, 77, 0, 106, 33, 0, 0, 0, 33, 77, 0, 67, 33, 0, 0, 0, 33, 77, 0, 52, 49, 0, 0, 52, 49, 65, 0, 92, 49, 95, 0, 52, 49, 65, 0, 39, 49, 0, 0, 0, 49, 65, 0, 75, 49, 0, 0, 0, 49, 65, 0]

The offsets provide a mapping from the index of a string in the input set, to the start of that string within the superstring.

In the case that the input strings are not all of equal length, a length table will be generated too.

The order of the input to the algorithm matters, and some orderings will generate a shorter superstring than others. To this end it can be useful to use the --shuffle option and re-running the tool a couple of times, or use --loops with a small iteration count (10-100'ish).

The --sort option will sort the input by frequency, which can reduce the textual size of the offset table. The --mtf options has a similar use-case.

The --brute option can be used to find the true optimal shortest superstring, but it is intractable for any but the smallest of inputs (N <~ 30).

$ ./superstring -q data/greedy-hard.txt
cabababcbababa
$ ./superstring -q --brute data/greedy-hard.txt
cababababc

Python Superstring Library API

generate_superstring(list) -> str : Given a list of strings, returns an approximate superstring using the default algorithm ('greedy').

greedy(list) -> str : Given a substring-free list of strings, returns the approximate superstring as generated by algorithm Greedy.

brute(list) -> str : Given a substring-free list of strings, returns an optimal superstring as generated by brute-force enumeration.

brutedp(list) -> str : Given a substring-free list of strings, returns an optimal superstring as generated by brute-force dynamic-programming approach.

make_substring_free(list) -> list : For Greedy to work as originally specified, its input must be substring-free (aka factor-free), i.e contain no elements that are substrings of one another. This function will process a list to ensure this pre-condition is true.

See the ssp.py source code for details.

Example API Usage

import ssp

arr = ["CATGC", "CTAAGT", "GCTA", "TTCA", "ATGCATC"]

res = ssp.generate_superstring(arr)
# res=GCTAAGTTCATGCATC

TODO

  • Verify that the type-spec in ssp.py is actually correct.
  • Algorithm selection.
  • Implement encoding into bitstrings.
  • Built-in benchmark.

Footnotes

  1. "Linear Approximation of Shortest Superstrings", Blum, Jiang, Li, et al., 1994.

  2. So named for the obvious reason

  3. "The Greedy Algorithm for Shortest Superstrings", Kaplan & Shafrir, 2004.

  4. "Improved Approximation Guarantees for Shortest Superstrings...", Englert, Matsakis, Veselý, 2021. (arXiv:2111.03968)