-
Notifications
You must be signed in to change notification settings - Fork 4
/
README
208 lines (137 loc) · 7.15 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
MIRAGE: Multiple-sequence Isoform Alignment Tool Guided by Exon Boundaries
--------------------------------------------------------------------------
CONTENTS
--------
1. Setup (Short)
2. Setup (Detailed)
3. Basic Usage
4. More Usage
5. About
6. Source Files
7. Included Directories
- Using the BLAT binary instead of compiling from source
SETUP (Short)
-------------
1. Run the SETUP.pl script ('./SETUP.pl')
2. Add the Mirage directory to the PATH
3. Check setup by running './mirage --check'
SETUP (Detailed)
----------------
To setup the mirage software package, first run the SETUP.pl script as follows:
$ perl SETUP.pl
Once the setup script has reported successful completion, it is necessary to
add the program 'spaln' to the PATH environment variable. This can be done by
first running the command "pwd" to see the present working directory. It
should look like this:
$ pwd
/Users/yourname/a/file/path/mirage
Once you have the current directory, you can add it to your PATH variable.
In BASH, you can add this line to the file ~/.bashrc:
export PATH=(pwd_results):$PATH
Refresh your path variable by setting the file that you edited as your source.
Depending on whether PATH is set in .bash_profile or .bashrc, this will be done
with one of the following two commands:
$ source ~/.bash_profile
- OR -
$ source ~/.bashrc
To test whether everything is setup, you can run the following command from the
directory containing mirage:
$ mirage --check
If setup was successful, this command will report success. Otherwise, it will
inform you of what is missing.
BASIC USAGE
-----------
Mirage requires 2 arguments: a FASTA-formatted protein database and a simple
guide file directing the program to genomes and gtf indicies for each species
being searched on.
$ perl mirage2.pl <Protein DB> <Species Guide>
It is expected that the names of the sequences in the protein database will
consist of three fields, in order and separated by bar (|) characters:
1. Species
2. Gene family (or multiple families, separated by '/')
3. A unique identifier within the species/family pair
This can be followed by a '#', after which any additional comments can be
written (and ignored by Mirage).
For example: '>human|znf143/zn143|2 # Accession:P52747, zinc finger protein'
The guide file is a simple text file where each line corresponds to a species
and has the following fields (separated by whitespace):
1. Species name, as spelled in the species fields of the protein database
2. Path to genome, from the directory containing mirage.pl
3. Path to gtf index file, from the directory containing mirage.pl
For example: 'human ../genomes/Human.DNA.fa ../indices/human.gtf'
The guide file can also include a line with a Newick-formatted tree to set the
order in which inter-species alignment occurs.
For example: '(dolphin,((mouse,rat),human))'
For more information, use 'mirage --help'.
ABOUT
-----
Mirage is a tool for generating exon-aware multiple sequence alignments (MSAs)
of protein isoforms within the same gene family and across species. This is
accomplished through 3 major phases: protein-to-genome alignment, intra-species
protein alignment, and cross-species alignment of MSAs.
During the first phase (protein-to-genome alignment), each protein in the database
is searched against its species' genome, identifying all "hits" (places where some
section of the protein is identical to one of its purported exons). The program
analyzes each protein's hit set and identifies an optimal way of "stitching" the
hits together to achieve full coverage of the protein. This is the work done
by the 'Quilter2.pl' script, which relies on 'FastMap2' and SPALN (if installed)
to determine the individual hits.
The second phase consists of isolating each set of "stitched" hits for each
isoform within the same species and gene family and joining these hits together
to form an intra-species multiple-sequence alignment. Because the hits can be
compared to one another by way of their positions in the genome, this alignment
preserves the exon boundaries indicated by the hits identified in phase 1 and
places special markers in the MSAs to mark where splice sites are observed.
This work is done with the 'MapsToMSAs.pl' script.
The third phase aligns the MSAs generated during the second phase across species
(staying within gene families). This is done with profile-to-profile alignment
using an slightly modified version of the Needleman-Wunsch algorithm, using the
'MultiSeqNW' program.
SOURCE FILES (by order of progression through mirage pipeline)
--------------------------------------------------------------
+ CleanMirageDB.pl
A script used to verify that a FASTA-formatted database meets the formatting
requirements expected by mirage. If a database does not meet the requirements,
mirage will direct the user towards this program.
+ src/run_mirage2.sh
A shell script used to run Mirage.
+ src/Mirage2.pl
The top-level script used to generate MSAs. This and CleanMirageDB.pl are
the only programs that the user is expected to engage with.
+ src/Quilter2.pl
A script used to align proteins within a particular species to their
genome (stitching together small hits for many proteins, hence the name).
+ src/FastMap2.c
A program used to identify nearly-identical alignments of protein sequence
to DNA (based on the Needleman-Wunsch dynamic programming algorithm, but only
allowing match states).
+ src/BasicBio.c
A collection of helper functions used by FastMap2 and MultiSeqNW.
+ src/MapsToMSAs.pl
A script used to generate MSAs for all gene families within a species based
on the results of Quilter.pl.
+ src/MultiSeqNW.c
A program used to align two MSAs to one another by constructing profiles
for each of the MSAs and then performing profile-to-profile alignment based
on the Needleman-Wunsch dynamic programming algorithm.
+ src/FinalMSA.pl
A script used to remove splice site markers and perform some common-sense
MSA cleanup of the final MSAs.
INCLUDED DIRECTORIES
--------------------
+ inc/hsi
A set of tools used mainly to extract sequences or sequence metadata from
fasta-formatted files.
+ inc/spaln2.2.2
A program used to generate intron-aware alignments of proteins to genomes.
Slower than FastMap2 but allows for search without guidance from a GTF index.
+ inc/blat
A program used to perform extremely fast translated local mapping.
NOTE: BLAT's library requirements can be annoying to work around if
you're in a compute environment where you don't have permission
to change /usr/include. To get around BLAT make issues, you
can use the binary located at 'dependencies/x86_64-blat-binary'
Simply perform the following (from the 'mirage' directory):
% mkdir build/blat
% mkdir build/blat/bin
% cp dependencies/x86_64-blat-binary build/blat/bin/blat