-
Notifications
You must be signed in to change notification settings - Fork 25
/
TODO
269 lines (201 loc) · 9.11 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
--
IN_PROGRESS:
-----------
-//-
reprobate, agglomerate
Add extra stop codon penalty ?
Share subopt between fwd/rev strands for translating models ?
optimisation: stop codegen from testing match more than once per cell.
optimisation: prevent subopt for certain optimals (eg. global).
Fix configure_extra clashes from intron models
(remove configure_extra altogether in intron?)
./exonerate -m p2g pep.fa human.genomic -E --forcegtag no
(score check problem)
- add dependency in codegen.c:188
replace when: codegen/model.o is older than model/model.o
(Make model expiry depend on parent model expiry)
- Expand man page intro, mention main features.
-------------------------------------------------------------------
--
o Separate code for DP scheduler from c4.[ch]
o Option for --refine (###check with region)
(needs to work with suboptimal alignments)
Refine strategies:
- none: default
- full: redo full alignment
- region: redo alignment region + edge boundary
- cobs: do alignment between all cobs points
in the final alignment. (or just 'clean' cobs points)
( Also terminal cobs->corner,
terminal cobs->align_end + boundary )
- hsp : redo alignment hsp_set bound box + boundary
- align: bad splice sites
low quality regions
missing regions at ends.
o Split RedSpace from viterbi ?
o Move region finding, redspace etc outside of viterbi
o Allow hidden states / compound transitions for optimiser
(store chained transition for each transition, also state map,
must translate alignment afterwards)
o Fix u:t type models to check revcomp (analysis.c:133)
(Should translate target for trans2dna models)
o Add c2g model (c2c + tg intron): check working for reversed ?
(./exonerate c.f e.f -m c2g)
o Add phase 1->2 transition for phase model
o Add joint intron type to intron model (joint cis only)
o Add joint phase type, (9-state model)
o Split alignment: alignment->sequence/gene,ryo,label
combine Alignment output methods
add Alignment_traverse
have query_gene target_gene
o Add --ryo %[qt]f for frame (%[qt]b %3)
- need gene object ? can map all onto gene object ?
gene = Gene_create(alignment, on_query);
Gene_write_gff(gene);
o Separate GFF code from alignment
o Should GFF show all coordinates on the +ve strand? (jason_p2g eg)
o Report frameshifts (and in-frame stops) in GFF
o GAM_Type_Data
o Rearrange C4/optimal:
1 RedSpace: subalignment,checkpoint,recursion,traceback
problem with two different cell sizes
2 Viterbi: interpreted,codegen,optimal_data,optimal_mode
3 SubOpt: store,region,macros
4 OptModel: dp model optimiser eg. [-O->] => [->]
5 OPair: create,destroy,next_score,next_path
(make HPair same interface)
6 Optimal: optimal_type,find_score,find_path
o Refactor Heuristic (and HPair) to clarify heuristic model
representation (use (shared?) simple model object ?)
o Make interpreted and codegen implementations more similar.
o Need Alignment_traverse() to handle shadow setting etc.
( Adapt from Alignment_has_valid_alignment() )
o Implement --verbose <level> or --verbose <typemask>
(or have module-specific verbose options ?)
o Introduce chain type/chain object ? [dna|protein|other]
- replace C4 user_data / terminal_data system
- allow multiple chain types to be used with some models
o Make all Optimal_Mode find_{region,score,path,checkpoints}
inherit from basic model definitions
o Command line meta-options
o Separation of C4 runtime and compile time requirements
(eg. upper bounds for calcs not required at compile time
- fix this with model-specific params ?)
-//-
sequence/gene: create(dna_seq), destroy, add_exon
sequence/alphabet: stuff from Sequence_{Type,Filter} also masking
SubOpt: create(alignment), allow exclusion in viterbi
Alignment: Alignment_build_gene
-//-
BUGS:
-----
o Genome2Genome model:
- C4_Label_SPLIT_CODON display fixes for dna2dna
- Get test working with joint Phase and Intron models
o Add checks to catch duplicate C4 names.
(necessary as duplicates are removed on model copying)
(also need namespace management for C4 codegen)
o Bug with --saturatethrehold memory on large analyses
(maybe not taken into account by --fsmmemory ?)
o Increase default SAR ranges to span wordlength
(check titin example (intron:146) - bug ?).
o Bug with memory allocation on genome exhaustive alignments
o Fix alignment drawing scheme to handle silent dna:dna mutations
o Clean up unnecessary C4 inheritance macros
o Suboptimal exhaustive alignments
(can still do reuse lookups in constant time,
for linear space alignment as we know the DP computation order)
(also required to prevent SARs containing paths
from other HSPs or higher scoring alignments)
- requires ajoining HSP component retraction?).
OPTIMISATIONS:
-------------
o Only copy designated shadows in codegen.
o Defer/join SAR/Region memory allocation and/or use memchunks.
(just return bound initially, as most bounds are not confirmed ?)
o Profiling (and memory profiling)
FEATURES:
--------
o Change --ryo transition per line format to allow
() : all
{} : non-silent only (qy_adv || tg_adv)
<> : match only (qy_adv && tg_adv)
(or add --ryog for gene output ?)
-//-
o Prevent from generating > --bestn suboptimal alignments
for any pairwise comparison.
(ie. update --bestn/BSDP threshold after every new alignment)
o Allow multiple span states in a single span ?
(ie. do all dp in a single sweep)
o Add support for large bounds and joins for missing exon finding
and for worst cases when two close HSPs fail to join.
(or just detect why this occurs ...)
o Write DP optimiser
- skip single input/output states O->O->O => O->O
- remove start/end states when single transistion
or simple input/output transitions can be duplicated
(ie. make multiple start/end states)
- shadows for single seq-independent loop states
(eg. simple ins/del)
DOCUMENTATON:
------------
o Add note about interpretation of exonerate alignments
(split introns, translating equivalence notation etc.)
o Add note about --score and --hspthreshold for ungapped models
- or add a warning/error when different)
- or make hsp_threshold = MAX(score, hsp_threshold)
o Add new examples to the man page.
eg. multiple files etc.
o Add info about each type of model.
---------------------------------------------------------------------
--< RELEASE POINT >--
--< NEXT RELEASE POINT >--
o Add macros for Span start/end reporting functions
(Heuristic_Span_src_report_end_func,
Heuristic_Span_dst_init_start_func).
o Add option to show single-char AAs in alignments.
o Option for no revcomp of query or target (useful for exhaustive)
o Division of calcs into [independent, qy dep, tg, dep, qy/tg dep].
(this will allow further optimisations)
o Add full alignment dump. One of:
o each transition
o each non-silent transition
o each differently scoring label.
or just add match:mismatch base-pair level detail to --ryo
eg. vulgar: M 7 7 (%m) -> 5 5 5 -4 5 5 5
o Alignment dumping/regeneration facility
o Add FastaDB sequence cache ?
--< SUBSEQUENT RELEASE POINT >--
o Add simple model specification (like smile strings ?)
- describe model structure using predefined states
and transition types
- should be able to describe all existing models
o Change default nucleic matrix to +5/-1 ?
o Optimise DP row_matrix access with shadows ?
o Tidy codegen
o Removal of unnecessary codegen (with Optimal_Type)
o Check function naming to ensure reuse of global model codegen
o Add --use-config <name> option (auto update implementation)
o Option to dump parameter set used ?
o GFF3 support ? http://song.sf.net/
http://sourceforge.net/mailarchive/forum.php?thread_id=2591765&forum_id=27223
--
o Training of heuristic params using exhaustive alignments,
training of wordhood params using ungapped suboptimal alignments
etc
o Installed headers for libxnr8/libc4 (as required for C4 codegen)
- requires addition of opaque types
o Add special tests directory including:
* tests with data
* all tests with valgrind
* all tests with MALLOC_CHECK_
o Automation of model:[global,bestfit,local,overlap] system ?
o Test speed of simplified DP implementations:
- by removing initialisations (on PatternMatrix:normal)
- or a faster interpreted implementation ?
o Sublinear time / non-word-based HSP-generation
o Stats
o Heuristics for flips / reversals ?
o Heuristics for SCFGs
o Blast-compatible output format (sigh)
--