This is an experimental, untested prototype!! Do not use this for production analyses.
This will be ported to the biom-format
project at some point. You can track progress on this on biocore/biom-format#627. Feedback is welcome.
This code should work with biom-format >= 2.1.4, < 2.2.0
and python >= 2.7, < 3.0
.
conda create -n scaling-octo-robot python=2.7 scikit-bio
source activate scaling-octo-robot
pip install biom-format
wget https://github.com/gregcaporaso/scaling-octo-robot/archive/master.zip
unzip master.zip
cd scaling-octo-robot-master/
For now, you can run commands from that directory. That will be improved soon (but the priority is for code unit testing). The following test commands should all work from your scaling-octo-robot-master/
directory.
# example input file included in the repository, generated by uclust (with open reference clustering)
$ head -15 test.uc
# uclust --input seqs1.fna --lib refseqs.fna --uc results.uc --id 0.97 --usersort
# version=1.2.22
# Tab-separated fields:
# 1=Type, 2=ClusterNr, 3=SeqLength or ClusterSize, 4=PctId, 5=Strand, 6=QueryStart, 7=SeedStart, 8=Alignment, 9=QueryLabel, 10=TargetLabel
# Record types (field 1): L=LibSeed, S=NewSeed, H=Hit, R=Reject, D=LibCluster, C=NewCluster, N=NoHit
# For C and D types, PctId is average id with seed.
# QueryStart and SeedStart are zero-based relative to start of sequence.
# If minus strand, SeedStart is relative to reverse-complemented seed.
L 3 1389 * * * * * 295053 *
H 3 128 100.0 + 0 0 519I128M742I f2_1271 HWI-EAS440_0386:1:30:4487:20156#0/1 orig_bc=ACCAGACGATGC new_bc=ACCAGACGATGC bc_diffs=0 295053
H 3 133 100.0 + 0 0 519I133M737I f2_1539 HWI-EAS440_0386:1:31:12039:10494#0/1 orig_bc=ACCAGACGATGC new_bc=ACCAGACGATGC bc_diffs=0 295053
H 3 133 100.0 + 0 0 519I133M737I f1_2278 HWI-EAS440_0386:1:32:3943:19113#0/1 orig_bc=ACACTGTTCATG new_bc=ACACTGTTCATG bc_diffs=0 295053
H 3 133 100.0 + 0 0 519I133M737I f2_2349 HWI-EAS440_0386:1:33:11754:2337#0/1 orig_bc=ACCAGACGATGC new_bc=ACCAGACGATGC bc_diffs=0 295053
H 3 133 100.0 + 0 0 519I133M737I f1_2727 HWI-EAS440_0386:1:33:17206:16370#0/1 orig_bc=ACACTGTTCATG new_bc=ACACTGTTCATG bc_diffs=0 295053
H 3 133 100.0 + 0 0 519I133M737I f2_2750 HWI-EAS440_0386:1:33:4173:17707#0/1 orig_bc=ACCAGACGATGC new_bc=ACCAGACGATGC bc_diffs=0 295053
# call uc2biom with the input file and the output file
$ ./uc2biom test.uc test.biom
# summarize the resulting biom table
$ biom summarize-table -i test.biom
Num samples: 9
Num observations: 10
Total count: 189
Table density (fraction of non-zero values): 0.256
Counts/sample summary:
Min: 17.0
Max: 22.0
Median: 22.000
Mean: 21.000
Std. dev.: 1.563
Sample Metadata Categories: None provided
Observation Metadata Categories: None provided
Counts/sample detail:
p2: 17.0
not16S.1: 20.0
f1: 21.0
f2: 21.0
p1: 22.0
f4: 22.0
t2: 22.0
t1: 22.0
f3: 22.0
# example input file, generated by vsearch (with de novo clustering)
$ head seqs1.vsearch.uc
S 0 152 * * * * * f1_19304 *
H 0 152 100.0 + 0 0 152M f1_4866 f1_19304
H 0 152 98.7 + 0 0 152M f1_8307 f1_19304
H 0 152 100.0 + 0 0 152M f1_8746 f1_19304
S 1 152 * * * * * f2_4913 *
H 0 152 100.0 + 0 0 152M f3_136 f1_19304
H 0 152 98.7 + 0 0 152M f3_140 f1_19304
H 0 152 100.0 + 0 0 152M f3_141 f1_19304
H 0 152 100.0 + 0 0 152M f3_150 f1_19304
S 2 152 * * * * * p1_38017 *
# call uc2biom with the input file and the output file
$ ./uc2biom test.vsearch.uc test.biom
# summarize the resulting biom table
$ biom summarize-table -i test.biom
Num samples: 9
Num observations: 9
Total count: 187
Table density (fraction of non-zero values): 0.272
Counts/sample summary:
Min: 18.0
Max: 22.0
Median: 21.000
Mean: 20.778
Std. dev.: 1.227
Sample Metadata Categories: None provided
Observation Metadata Categories: None provided
Counts/sample detail:
p2: 18.0
f1: 20.0
not16S.1: 20.0
f2: 21.0
t1: 21.0
p1: 21.0
f3: 22.0
f4: 22.0
t2: 22.0
This requires vsearch 1.7.0
or later. Observation ids will be sha1 hashs of the sequences (since it's dereplication, all sequences in an OTU are identical). This means that the resulting BIOM tables can be safely merged across runs of these commands (provided that your sample ids don't overlap unless they should - you should be using CualID for your sample identifiers to help with that).
$ vsearch --derep_fulllength seqs.fna --output vsearch-derep-rep-set.fna --uc vsearch-derep.uc --relabel_sha1 --relabel_keep
$ ./uc2biom vsearch-derep.uc vsearch-derep.biom vsearch-derep-rep-set.fna
$ biom summarize-table -i vsearch-derep.biom
Num samples: 9
Num observations: 43
Total count: 200
Table density (fraction of non-zero values): 0.147
Counts/sample summary:
Min: 22.0
Max: 23.0
Median: 22.000
Mean: 22.222
Std. dev.: 0.416
Sample Metadata Categories: None provided
Observation Metadata Categories: None provided
Counts/sample detail:
p2: 22.0
f1: 22.0
f2: 22.0
f3: 22.0
f4: 22.0
t2: 22.0
not16S.1: 22.0
t1: 23.0
p1: 23.0
# example input file included in the repository, generated by Diamond.
$ head test.bl9
1_1bx_ATCACG_L001_R1_001_1 637159676 65.62 32 11 0 1 96 26 57 7e-05 51.6
1_1bx_ATCACG_L001_R1_001_6 648914362 96.67 30 1 0 92 3 90 119 9e-09 64.3
1_1bx_ATCACG_L001_R1_001_6 647247189 90.00 30 3 0 92 3 90 119 6e-08 61.6
1_1bx_ATCACG_L001_R1_001_6 651125185 90.00 30 3 0 92 3 90 119 1e-07 60.8
...
# call bl2biom with the input file and the output file
$ ./bl2biom test.bl9 test.biom
# summarize the resulting biom table
$ biom summarize-table -i test.biom
Num samples: 3
Num observations: 934
Total count: 1000
Table density (fraction of non-zero values): 0.335
Counts/sample summary:
Min: 43.0
Max: 882.0
Median: 75.000
Mean: 333.333
Std. dev.: 388.186
Sample Metadata Categories: None provided
Observation Metadata Categories: None provided
Counts/sample detail:
3: 43.0
2: 75.0
1: 882.0