Skip to content

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
mattjeffryes committed Nov 26, 2018
0 parents commit 77ccf75
Show file tree
Hide file tree
Showing 27 changed files with 4,415 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
searchsifter/_version.py export-subst
2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
include versioneer.py
include searchsifter/_version.py
103 changes: 103 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Search-Sifter
Search-Sifter uses hashes to rapidly compare protein families, in order to
analyse their relationship to each other.

## Installation

Search-Sifter requires Python 3.3 or greater. It's recommended that Search-Sifter
is installed into a virtual environment.

To install Search-Sifter, clone this repository:

git clone URLHERE

Install the Search-Sifter package:

pip install [path to Search-Sifter]

## Usage

### Generating Pfam hashes
First download a copy of the Pfam database in Stockholm file format. For example:

wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.full.uniprot.gz

Run Search-Sifter with appropriate arguments depending on desired hash length
and window size:

python -m searchsifter.scripts.generate_residue_hashes -n [hash length]
-w [window size(s)] -o [output directory] -p [path to Pfam file]
-t stockholm

For each window size specified, a file `rhashes_[w].json` will be created in the
output directory. This will contain a hash for each of the families in the
input file. The hashes are stored in a JSON dictonary keyed by family
accession. Each hash is a JSON list. The elements of the list are in the
following format:
[hash, [protein accession, chunk number]]

### Running analysis

Two scripts are provided.

#### Accuracy

To analyse the accuracy of generated hashes:

python -m searchsifter.scripts.performance -s [test families]
-p [path to Pfam file] -t stockholm -a [hash files] [-n [hash lengths]]

Where `hash files` should be paths to one or more of the hashes generated by
`generate_residue_hashes`.

In `test families`, supply a file containing a JSON list of accessions.

In `hash lengths`, specify one or more lengths of hashes to test. Note that the
script can shorten longer hashes. So if you have generated hashes of length 800,
you could use `-n 200 400 800` to test hashes of length 200, 400 and 800.

If the n flag is not used, the script will not use hashes to estimate the
index and containment, but will instead compute the true index and containment.

The script will write to standard output a TSV file with the following columns:

family_A Accession of the family being compared to
family_B Accession from test families
jaccard_index Estimated (or true if -n is not used) Jaccard Index
jaccard_containment Estimated (or true if -n is not used) Jaccard Containment
n Length of hash (or 1 if -n is not used)
w Window size (or 1 if -n is not used)
type "estimated" if -n is used, "exact" otherwise

#### Time

To analyse performance:

python -m searchsifter.scripts.time -s [test families]
-p [path to Pfam file] -t stockholm -a [hash files] -n [hash lengths]

Arguments are as for searchsifter.scripts.performance.

The script will write to standard output a TSV with the following columns:

test_acc Accesssion from the test families
ji_time Time to estimate Jaccard index
jc_time Time to estimate Jaccard containment
n Length of hash
w Window size
size Family size in number of proteins

### Further usage

The file `searchsifter/Family.py` provides functions for creating objects to
represent protein families, and comparing them to each other.

The file `searchsifter/relationships/hmmer.py` provides the ability to create
Family objects from HMMER searches, given a Stockholm format output file.

The file `searchsifter/relationships/jaccard.py` provides functions for
calculating Jaccard index and containment.
`searchsifter/relationships/minhash.py` provides functions for estimating these
using MinHash.

Further documentation and usage is given as docstrings.
Loading

0 comments on commit 77ccf75

Please sign in to comment.