-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
572 lines (453 loc) · 21.9 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
README file for Leucippus software distribution.
Software Goal
=============
Leucippus is a Java program aimed at determining mosaic SNVs and estimating
their allelic frequency at specified sites of interest from amplicon and
capture sequencing data. Mosaic SNV is one that is not uniformly present in
the DNA of all cells sequenced. To make a decision about an SNV being mosaic,
Leucippus estimates background sequencing noise from the other sites
surrounding specified sites of interest. Confidence in conclusion for each
site is reflected in a p-value, which is calculated based on the background
noise. A dedicated functionality displays noise in the data graphically, to
enable data visual inspection and quality control. The software also allows
constructing long read-fragments from pairs of reads with overlapping 3'-ends.
Leucippus Five Tasks
=====================
a. Constructs long reads
-construction of long read by sliding paired reads and then choosing
the long read with the best matching window.
(the sliding is performed from left to right and
from right to left(the paired reads are in fastq format, thus
one read must be reversed and complemented before the sliding
process).
b. Creates noise tables that provide the following information
for each reference position around the sites of interest:
- expected base counts
- specific mismatch counts
- next position(s) deletion (number counts)
- next position(s) insertion (sequence-number counts)
- total position counts
c. Creates graphs (single or comparison):
- p-value and mutation rate graphs by reading the above table(s).
d. Making decision about provided sites, whether they are mosaics.
e. Create pvalue or mutation rate graphs for a particular site, by reading
noise table files
1. Compilation
==============
Leucippus is a Java application. It runs on UNIX operating system with java
installed on it. First the Java extension files (programs) must be compiled
into byte-code using the following directions.
To compile Leucippus users must do the following:
$ cd src
$ javac Leucippus.java
To test "Leucippus.class" while they are in Leucippus root directory, users
must run:
$ java Leucippus
It must print on the screen the following general info about Leucippus
commands:
Program: Leucippus (Mosaics: Site Noise-Estimation, Validation, Identification)
Version: 0.1.0
Usage: Leucippus <command> [options]
Command: frag create long reads
noisetab create noise table
graph create graph
decide decide [somatic | germline | unknown | omitted]
extract extract reads that present alternative allele
config -cf+suffix <value> [none]
2. Leucippus Commands
=====================
Leucippus has 4 commands: frag, noisetab, graph, and decide. Each command is
associated with a particular task:
a. 'frag' => Constructs long reads from short paired reads.
b. 'noisetab' => Creates noise tables for sites of interest.
c. 'graph' => Creates graphs from noise tables.
d. 'decide' => Makes decision whether sites of interest are mosaics.
e. 'posgraph' => Creates position graphs from noise tables.
Cut offs at each step:
a. 'frag'
min [50] bp overalp
min 75% of matching bases in the overlap (hardcoded)
b. 'noisetab'
min read mapping quality 30 (hardcoded)
min base quality [20]
distance from interval boundaries [-1]
c. 'graph'
min coverage [100]
max range [0.005]
d. 'decide'
min germline AF [0.35]
min coverage [100]
max p-value [0.05]
e. 'posgraph'
min coverage [100]
max range [0.005]
A. Command 'frag'
-----------------
Leucippus constructs long fragments from paired reads which overlap. To find
overlap, reads are sled against each other by one base pair at a time.
Beginning of read sliding
read1 5'----------3' slide to the right >
read2 3'----------5' slide to the left <
End of read sliding
read1 5'----------3'
read2 3'----------5'
Matched and mismatched nucleotides in the overlap are counted. Then a
statistical test, using binomial coefficients, is performed to calculate
the probability that such or a larger count of matches for the overlap length
can occur by chance. Here a random chance for a base match is set to 0.25. The
best overlap is the one with the smallest such probability. The pair of reads
with the best overlap of at least 50 nucleotides in length and at least 75% of
matched nucleotides is used to construct a long fragment. Nucleotides in the
overlapping part of both reads are constructed in the following way:
* if nucleotides match, then this nucleotide is assigned at the current
position, and its quality is equal to the sum of the nucleotide qualities at
this position in each read.
* if nucleotides mismatch, then the one with the higher quality is assigned at
the current position, while its quality is reduced by the quality of the
other nucleotide. If the qualities are equal, one of the two nucleotides is
chosen randomly and its quality is assigned zero.
Leucippus finds pairs of reads from their names. It recognizes the following
naming conventions for read pairs.
@name type
@name:type
@name/type
@name
where 'type' is one character (1-9 or a letter) to specify a read in a pair.
For example,
First read @HWI-7001311 1
Second read @HWI-7001311 2
First read @read_pair_1:1
Second read @read_pair_1:3
First read @HWI-7001311:50:H9V9KADXX:1:2116:15471:92324/1
Second read @HWI-7001311:50:H9V9KADXX:1:2116:15471:92324/2
First read @HWI-7001311:50:H9V9KADXX:1:2116:15471:92324
Second read @HWI-7001311:50:H9V9KADXX:1:2116:15471:92324
If other separator (other than 'space', ':', or '/') is used then this can be
stated using a configuration argument -cfseparator.
The usage of the command is as follows:
java Leucippus frag [options] [file1.fq.gz file2.fq.gz]
Options: -o FILE output file in gzip (e.g., longreads.fq.gz)
-o1 FILE output file in gzip (e.g., remainedreads1.fq.gz)
-o2 FILE output file in gzip (e.g., remainedreads2.fq.gz)
-l maximum long reads length [500]
Input: file1.fq.gz input gzipped file with reads in fastq format
(if not provided, input is taken from standard input)
file2.fq.gz input gzipped file with reads in fastq format
First fastq file may contain both reads. In such a case there is no need to
provide second fastq file. If no input files are provided, Leucippus expects
reads in fastq format from standard input (e.g., STDIN). This feature allows
to generate long fragments from aligned .bam file by piping the output
of 'samtools bam2fq' command to Leucippus (see example below).
Examples:
java Leucippus frag -o long.fq.gz \
pathtoreads/reads1.fastq.gz \
pathtoreads/reads2.fastq.gz
java Leucippus frag -o long.fq.gz -o1 remained1.fq.gz -o2 remained2.fq.gz \
pathtoreads/reads1.fastq.gz \
pathtoreads/reads2.fastq.gz
java Leucippus frag -o out.fq.gz -l 300 pathtoreads/both_reads.fastq.gz
gzip -cd pathtoreads/both_reads.fastq.gz | java Leucippus frag -o out.fq.gz
samtools bam2fq pathtobam/file.bam | java Leucippus frag -o out.fq.gz
To skip secondary aligments in BAM use the following
samtools view -b -F 0x900 pathtobam/file.bam | samtools bam2fq - | \
java Leucippus frag -o out.fq.gz
B. Command 'noisetab'
---------------------
Leucippus tabulates sequencing error (e.g., "noise") for sites targeted by
amplicon or capture sequencing. This information is used later to call mosaic
sites. In essence, the information tabulated is a comprehensive count of
nucleotides and small "indels" in each read, covering a particular position.
Therefore, the required inputs are a list of targeted sites, and .bam files
with aligned reads. The table includes all necessary information that is used
to visualize noise and make a decision about sites of interest being mosaic.
The tab-delimited interval file specifying the sites of interest is of the
following format:
CHROM POS REF ALT LEFT RIGHT
1 11242672 T A 11242499 11242922
1 72605472 C T 72605349 72605579
2 15636892 G C 15636830 15637227
where
CHROM is chromosome
POS is site of interest
REF is expected nucleotide (same as reference nucleotide)
ALT is alternative nucleotide (different from REF)
LEFT is start of a targeted interval
RIGHT is end of a targeted interval
All of these columns are required, but additional columns (e.g., with other
information relevant to a particular experiment) are allowed. The distance cut
off is applied to reads covering a particular site of interest and specifies:
i) the maximum distance from the first aligned position in a long read to the
LEFT position of the corresponding interval;
ii) the maximum distance from the last aligned position in a long read to the
RIGHT position of the corresponding interval.
If distance cut off is not specified or negative, then all reads will be
considered as data source (the default value is -1). The output table is
of the following format:
chr pos x y Nuc_ref Nuc_exp tot As Cs Ts Gs
2 5627738 261 184 A A 8003 7987 9 3 4
2 5627739 262 183 C C 8003 10 7988 2 3
2 5627740 263 182 C G 8003 14 7980 3 6
2 5627741 264 181 A A 8004 7994 3 1 6
2 5627742 265 180 C C 8004 13 7980 6 5
where
chr is chromosome
pos is position
x is the distance from the LEFT of the targeted interval
y is the distance from the RIGHT of the targeted interval
Nuc_ref is the reference nucleotide for this position
Nuc_exp is the expected nucleotide (only different from Nuc_ref at
position for the sites of interest)
tot is the total number of by good bases (i.e., passign base quality
and padding cutoffs) at this position
tot = #A + #C + #T + #G + #Ds, where D represents deletion
As is the coverage by As passing all cut-offs
Cs is the coverage by Cs passing all cut-offs
Ts is the coverage by Ts passing all cut-offs
Gs is the coverage by Gs passing all cut-offs
Noise tables also contain detailed information about mutations,
deletions (start, length), and insertions (start, length, and sequence)
of all positions in intervals of targeted site.
Example of additional columns:
Ds Ps cov D1 D2... D10 I1 I2... I10 DelInfo InsInfo
0 0 8053 2 0 ... 0 0 1 ... 0 C,C TA,
2 0 8053 0 0 ... 0 2 0 ... 0 0 A,A
0 0 8053 0 0 ... 0 0 0 ... 0 0 0
0 0 8053 0 0 ... 0 0 0 ... 0 0 0
0 0 8053 0 0 ... 0 0 0 ... 0 0 0
where
Ds is the count of deletions found for this position
Ps is the count of Ps when padding has been applied
cov is the total number of reads covering the base; this is used for
calculation of indel allele frequency
D1 is the count of reads with one base deletion starting at next position
D2 is the count of reads with two base deletion starting at next position
...
I1 is the count of reads with one base insertion at the next position
I2 is the count of reads with two base insertion at the next position
...
DelInfo is detailed deletion(s) information, for example, "0_1D_5627738_1" is
read with one base deletion after position 5627738 (at 5627739)
InsInfo is detailed insertion(s) information including the insertion sequence
The usage of the "noisetab" command is as follows:
java Leucippus noisetab [options] file1.bam [file2.bam, ...]
Options:
-interval FILE file with intervals of targeted site
-ref FILE FASTA file with reference sequence (could be .gz file)
-q base quality cut off 0-100 [20]
-d distance cut off [-1]
-o FILE output file
Input:
file1.bam a file with aligned reads
file2.bam a file with aligned reads
Examples:
java Leucippus noisetab \
-interval pathtosites/sites.txt \
-ref pathtoreference/ref.fa.gz \
-o pathtooutput/noisetable.tsv \
pathtobams/file01_LONG.bam pathtobams/file02_LONG.bam
(example with custom 'samtools' directory)
java Leucippus noisetab \
-interval pathtosites/sites.txt \
-ref pathtoreference/ref.fa.gz \
-o pathtooutput/noisetable.tsv \
-cfsam pathtosamtools/bin \
pathtobams/file01_LONG.bam pathtobams/file02_LONG.bam
(example with custom quality cut off, and distance cut off)
java Leucippus noisetab \
-interval pathtosites/sites.txt \
-ref pathtoreference/ref.fa.gz \
-o pathtooutput/noisetable.tsv \
-q 20 -d 10 \
pathtobams/file01_LONG.bam pathtobams/file02_LONG.bam
C. Command 'graph'
------------------
Leucippus can create graphs for substitution rates and p-values of random
chance, for each nucleotide substitution type (i.e., C to T, etc.). The
graphs are created from the noise table generated at the previous step and
exclude the sites of interest (where Nuc_exp and Nuc_alt are not the same).
Mismatches to the reference base in aligned long reads are assumed to be
noise, either from sample preparation (e.g., amplification) or sequencing. A
mutation rate graph shows the frequency of sites versus rate of a particular
substitution type (e.g., C to T). A p-value graph shows the proportion of
sites having substitution rate larger than a value. For example, for the
expected nucleotide C, substitution type C to T, and substitution frequency
of 0.01 the p-value is the fraction of C-sites having T substitution frequency
larger than 0.01, i.e., #C_sites_with_CtoT_more_0.01/#C_sites. To make the
graphs Leucippus calls an R script passing the necessary parameters to it.
The usage of the "graph" command is as follows:
Leucippus graph [options] -o <prefix> table1.file [table2.file]
Options: -type graph type: pvalue|mutrate [pvalue]
-coverage minimum site/position coverage [100]
-range DOUBLE maximum range for error [0.005]
-overlap INT use positions in overlapping 3'-ends of reads.
The number specifies read length. Only useful
for analysis of amplicon-seq data
-o prefix for output files: prefix.pdf,
prefix.fpvtb[1,2].tsv
Second file with noise table is optional. If it is given, graphs from the two
files will be superimposed, with colours representing substitution type and
line style (solid and dashed) representing data from different files. Such
visualization allows for instant comparison of noise level in different data.
Examples:
java Leucippus graph \
-o pathtooutput/graphortable \
-type pvalue \
pathtoinputnoisetable/MosaicNoiseTable.tsv
Example of Leucippus command that provides a path for R (graph):
java Leucippus graph \
-o pathtooutput/graphortable \
-type pvalue \
-cfR ~/pathtoR \
pathtoinputnoisetable/MosaicNoiseTable.tsv
Example of Leucippus command that provides range and coverage arguments:
java Leucippus graph \
-o pathtooutput/graphortable \
-coverage 1000 \
-range 0.03 \
-type pvalue \
-cfR pathtoR \
pathtoinputnoisetable/MosaicNoiseTable.tsv
Example of Leucippus command for comparison (graph):
java Leucippus graph \
-o pathtooutput/graphortable \
-type pvalue \
pathtoinputnoisetable/MosaicNoiseTable1.tsv \
pathtoinputnoisetable/MosaicNoiseTable2.tsv
D. Command 'decide'
-------------------
Leucippus can decide whether sites of interest are germline or somatic. By
default sites with at least 35% non reference nucleotide coverage are termed
germline. The decision to call a variant as somatic is done by comparing
the frequency of expected non reference nucleotide with the background
substitution rate of reference to expected nucleotide. The background rate is
derived from all sites in the noise table, excluding sites of interest and
likely germline variants that have more than 10% of non reference nucleotide
coverage. By default, sites with p-value of less than 0.05 are deemed somatic.
Also, by default, sites with read coverage less than 100 reads are deemed as
having insufficient data and omitted from decision process. The remaining sites
are called undetermined, i.e., they are not germline, but frequency of
expected nucleotide is consistent with background.
The input is a noise table described above for the section for 'noisetab'
command. The output is a tab delimited file in the following format:
CHR POS REF EXP COV FREQ(#EXP/COV) P-VALUE CONCLUSION
2 5627740 C G 8107 0.00333045516 0.00131 somatic
2 6912245 T C 3305 0.00121028744 0.08378 unknown
2 16266496 G T 19691 0.47036 0.0 germline
2 40177413 C T 52788 0.05804 0.0 somatic
2 47883415 A G 152619 0.47365 0.0 germline
2 60074304 A T 27167 0.07026 0.0 somatic
where:
CHR is chromosome
POS is position
REF is the reference nucleotide
EXP is the expected nucleotide (different than reference)
COV is the site coverage(number of reads found, which include the site)
FREQ is the number of expected nucleotides found, divided by site coverage
P-VALUE is the p-value of the site being somatic
CONCLUSION is the decision made for the site:
- germline close to 50% allele frequency
- omitted insufficient data (low coverage)
- somatic p-value lower than threshold
- undetermined not germline and with p-value higher than threshold.
The usage of the "decide" command is as follows:
Leucippus decide [options] table.file
Options: -o FILE results
-coverage minimum site/position coverage [100]
-germlineAF DOUBLE minimum AF to call variant as germline [0.35]
-pvalue DOUBLE p-value for somatic call[0.05]
Examples:
Example of complete decide command (all parameters have been stated)
java Leucippus decide \
-coverage 200 \
-germlineAF 0.35 \
-o pathtoresults/resultstable.tsv \
-pvalue 0.03
pathtoinput/noisetable.tsv
Example of simplest Leucippus decide command with default values for
"-coverage[100], -germlineAF[0.35], -pvalue[0.05]":
java Leucippus decide \
-o pathtoresults/resultstable.tsv \
pathtoinput/noisetable.tsv
E. Command 'posgraph'
--------------------
Leucippus can create graphs for visualizing nucleotide substitution rates
and p-values for a site-position or any other position present in multiple noise
tables. The position graph depicts the behaviour of Allele Frequency of the site
alternative nucleotide found in noise tables. Sinse the site alternative
nucleotide can be more than one, the posgraph command generates three lines for
all possible substitutions. For example if the reference in site-position under
investigation is A, then posgraph (with -type p-value command) will generate
three pvalue lines for substitutions A to T, A to C, and A to G. Each one of
the three lines in the plot has different color. The color information is
provided with the x-title.
Usage: Leucippus posgraph [options] -o <prefix> [table1 table2 .... table(n)]
Options: -type graph type: pvalue|mutrate [pvalue]
-pos chr(x):position (1-22, X, Y) [chr1:1000000]
-coverage INT minimum site/position coverage [100]
-range DOUBLE maximum range for error [0.05]
-overlap INT use positions in overlapping 3'-ends of reads
(the number specifies read length and it is
only useful for analysis of amplicon-seq data)
-o prefix for output files: prefix.pdf
Example:
Example of posgraph command
java Leucippus posgraph \
-type pvalue \
-pos 4:153253817 \
-cfR /pathtoRscript/Rscript \
-coverage 200 \
-range 0.05 \
-o pathtooutput/outposgraph \
pathtoinput/noisetable1.tsv pathtoinput/noisetable2.tsv ...
F. Command 'extract'
--------------------
Leucippus provides 'extract' command for read retrieval from bam files. User
have to specify position and nucleotide in question. The returned reads will
be those that present the nucleotide in question.
The usage of the "extract" command is as follows:
java Leucippus extract [options] file.bam
Options: -o FILE results
-a alelle_X [X could be A,C,T,G]
-p chr:position in format according to bam
(for examples, chr3:234234, 1:5454635)
example:
java Leucippus extract \
-o outfile.tsv \
-a T \
-p chr15:34563 \
/pathtobam/file.bam
3. Configuration
================
Leucippus uses softwares samtools and R at various steps of calculations.
To run these softwares it assumes that path to their binaries exists in the
system PATH variable and that bash shell is run by /bin/bash. Additionally,
it generates temporary folder 'Temp' in the current directory, to store
temporary files during execution of 'noisetab' command. Path to the software,
bash shell, and temporary folder can be redefined through configuration
parameters. These parameters can be used with any command.
Configuration Options:
-cfsam <path to samtools executable [samtools]>
-cfR <path to R executable [R]>
-cftmp <path to temporary folder [./Temp]>
-cfbash <path to bash shell [/bin/bash]>
-cfseparator character.
Examples:
java Leucippus noisetab -interval pathtosites/intervalsites.txt \
-ref /pathtoref/GenomeRef.tsv \
-o pathtooutput/Table1.tsv \
-q 20 -d 0 \
-cfsam pathtosamtools/my_samtools \
-cfbash pathtobash/my_bash \
-cftmp pathtotmpfolder/my_tmp \
pathtoinputbam/longreads.bam
java Leucippus graph \
-o pathtooutput/graphortable \
-type pvalue \
-cfR pathtoR/my_R \
-cfbash pathtobash/my_bash \
pathtoinputnoisetable/MosaicNoiseTable.tsv
java Leucippus frag -o long.fq.gz \
-cfseparator #
pathtoreads/reads1.fastq.gz \
pathtoreads/reads2.fastq.gz
Please send your comments and suggestions to Abyzov.Alexej@mayo.edu