Skip to content

caporaso-lab-graveyard/scaling-octo-robot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is an experimental, untested prototype!! Do not use this for production analyses.

This will be ported to the biom-format project at some point. You can track progress on this on biocore/biom-format#627. Feedback is welcome.

This code should work with biom-format >= 2.1.4, < 2.2.0 and python >= 2.7, < 3.0.

Installing with conda

conda create -n scaling-octo-robot python=2.7 scikit-bio
source activate scaling-octo-robot
pip install biom-format
wget https://github.com/gregcaporaso/scaling-octo-robot/archive/master.zip
unzip master.zip
cd scaling-octo-robot-master/

For now, you can run commands from that directory. That will be improved soon (but the priority is for code unit testing). The following test commands should all work from your scaling-octo-robot-master/ directory.

uclust/usearch/vsearch uc to BIOM

# example input file included in the repository, generated by uclust  (with open reference clustering)
$ head -15 test.uc
# uclust --input seqs1.fna --lib refseqs.fna --uc results.uc --id 0.97 --usersort
# version=1.2.22
# Tab-separated fields:
# 1=Type, 2=ClusterNr, 3=SeqLength or ClusterSize, 4=PctId, 5=Strand, 6=QueryStart, 7=SeedStart, 8=Alignment, 9=QueryLabel, 10=TargetLabel
# Record types (field 1): L=LibSeed, S=NewSeed, H=Hit, R=Reject, D=LibCluster, C=NewCluster, N=NoHit
# For C and D types, PctId is average id with seed.
# QueryStart and SeedStart are zero-based relative to start of sequence.
# If minus strand, SeedStart is relative to reverse-complemented seed.
L	3	1389	*	*	*	*	*	295053	*
H	3	128	100.0	+	0	0	519I128M742I	f2_1271 HWI-EAS440_0386:1:30:4487:20156#0/1 orig_bc=ACCAGACGATGC new_bc=ACCAGACGATGC bc_diffs=0	295053
H	3	133	100.0	+	0	0	519I133M737I	f2_1539 HWI-EAS440_0386:1:31:12039:10494#0/1 orig_bc=ACCAGACGATGC new_bc=ACCAGACGATGC bc_diffs=0	295053
H	3	133	100.0	+	0	0	519I133M737I	f1_2278 HWI-EAS440_0386:1:32:3943:19113#0/1 orig_bc=ACACTGTTCATG new_bc=ACACTGTTCATG bc_diffs=0	295053
H	3	133	100.0	+	0	0	519I133M737I	f2_2349 HWI-EAS440_0386:1:33:11754:2337#0/1 orig_bc=ACCAGACGATGC new_bc=ACCAGACGATGC bc_diffs=0	295053
H	3	133	100.0	+	0	0	519I133M737I	f1_2727 HWI-EAS440_0386:1:33:17206:16370#0/1 orig_bc=ACACTGTTCATG new_bc=ACACTGTTCATG bc_diffs=0	295053
H	3	133	100.0	+	0	0	519I133M737I	f2_2750 HWI-EAS440_0386:1:33:4173:17707#0/1 orig_bc=ACCAGACGATGC new_bc=ACCAGACGATGC bc_diffs=0	295053

# call uc2biom with the input file and the output file
$ ./uc2biom test.uc test.biom

# summarize the resulting biom table
$ biom summarize-table -i test.biom
Num samples: 9
Num observations: 10
Total count: 189
Table density (fraction of non-zero values): 0.256

Counts/sample summary:
 Min: 17.0
 Max: 22.0
 Median: 22.000
 Mean: 21.000
 Std. dev.: 1.563
 Sample Metadata Categories: None provided
 Observation Metadata Categories: None provided

Counts/sample detail:
 p2: 17.0
 not16S.1: 20.0
 f1: 21.0
 f2: 21.0
 p1: 22.0
 f4: 22.0
 t2: 22.0
 t1: 22.0
 f3: 22.0
# example input file, generated by vsearch (with de novo clustering)
$ head seqs1.vsearch.uc
S	0	152	*	*	*	*	*	f1_19304	*
H	0	152	100.0	+	0	0	152M	f1_4866	f1_19304
H	0	152	98.7	+	0	0	152M	f1_8307	f1_19304
H	0	152	100.0	+	0	0	152M	f1_8746	f1_19304
S	1	152	*	*	*	*	*	f2_4913	*
H	0	152	100.0	+	0	0	152M	f3_136	f1_19304
H	0	152	98.7	+	0	0	152M	f3_140	f1_19304
H	0	152	100.0	+	0	0	152M	f3_141	f1_19304
H	0	152	100.0	+	0	0	152M	f3_150	f1_19304
S	2	152	*	*	*	*	*	p1_38017	*

# call uc2biom with the input file and the output file
$ ./uc2biom test.vsearch.uc test.biom

# summarize the resulting biom table
$ biom summarize-table -i test.biom
Num samples: 9
Num observations: 9
Total count: 187
Table density (fraction of non-zero values): 0.272

Counts/sample summary:
 Min: 18.0
 Max: 22.0
 Median: 21.000
 Mean: 20.778
 Std. dev.: 1.227
 Sample Metadata Categories: None provided
 Observation Metadata Categories: None provided

Counts/sample detail:
 p2: 18.0
 f1: 20.0
 not16S.1: 20.0
 f2: 21.0
 t1: 21.0
 p1: 21.0
 f3: 22.0
 f4: 22.0
 t2: 22.0

vsearch dereplication to BIOM and representative sequences

This requires vsearch 1.7.0 or later. Observation ids will be sha1 hashs of the sequences (since it's dereplication, all sequences in an OTU are identical). This means that the resulting BIOM tables can be safely merged across runs of these commands (provided that your sample ids don't overlap unless they should - you should be using CualID for your sample identifiers to help with that).

$ vsearch --derep_fulllength seqs.fna --output vsearch-derep-rep-set.fna --uc vsearch-derep.uc --relabel_sha1 --relabel_keep
$ ./uc2biom vsearch-derep.uc vsearch-derep.biom vsearch-derep-rep-set.fna

$ biom summarize-table -i vsearch-derep.biom
Num samples: 9
Num observations: 43
Total count: 200
Table density (fraction of non-zero values): 0.147

Counts/sample summary:
 Min: 22.0
 Max: 23.0
 Median: 22.000
 Mean: 22.222
 Std. dev.: 0.416
 Sample Metadata Categories: None provided
 Observation Metadata Categories: None provided

Counts/sample detail:
 p2: 22.0
 f1: 22.0
 f2: 22.0
 f3: 22.0
 f4: 22.0
 t2: 22.0
 not16S.1: 22.0
 t1: 23.0
 p1: 23.0

BLAST-tablular output to BIOM

# example input file included in the repository, generated by Diamond.
$ head test.bl9
1_1bx_ATCACG_L001_R1_001_1	637159676	65.62	32	11	0	1	96	26	57	7e-05	51.6
1_1bx_ATCACG_L001_R1_001_6	648914362	96.67	30	1	0	92	3	90	119	9e-09	64.3
1_1bx_ATCACG_L001_R1_001_6	647247189	90.00	30	3	0	92	3	90	119	6e-08	61.6
1_1bx_ATCACG_L001_R1_001_6	651125185	90.00	30	3	0	92	3	90	119	1e-07	60.8
...

# call bl2biom with the input file and the output file
$ ./bl2biom test.bl9 test.biom

# summarize the resulting biom table
$ biom summarize-table -i test.biom
Num samples: 3
Num observations: 934
Total count: 1000
Table density (fraction of non-zero values): 0.335

Counts/sample summary:
 Min: 43.0
 Max: 882.0
 Median: 75.000
 Mean: 333.333
 Std. dev.: 388.186
 Sample Metadata Categories: None provided
 Observation Metadata Categories: None provided

Counts/sample detail:
 3: 43.0
 2: 75.0
 1: 882.0

Releases

No releases published

Packages

No packages published