Skip to content

Commit

Permalink
BaitsTools version 0.7
Browse files Browse the repository at this point in the history
  • Loading branch information
campanam authored Dec 29, 2017
1 parent 1e3f585 commit 9f357ba
Show file tree
Hide file tree
Showing 10 changed files with 131 additions and 28 deletions.
24 changes: 23 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ Contact: campanam@si.edu
[Deprecated](#deprecated)

## aln2baits
### Version 0.5
Upcasing removed since now read as upcased
RNA output handling added

### Version 0.4
Version constant added to header
Outputs bed files rather than coordinates tables
Expand All @@ -35,13 +39,25 @@ Corrected double-tabbed output in parameters output
Preliminary script to generate weighted bait set from a fasta alignment

## annot2baits
### Version 0.3
Upcasing removed

### Version 0.2
Version constant added to header

### Version 0.1
Preliminary script to generate baits from an annotation file and a reference sequence

## baitstools
### Version 0.7
-H defaults to haplotype
Ambiguity collapsing added to bait filtration
Fa_Seq method 'make_dna' converts uracil residues to thymines for internal consistency
make_dna upcases all sequences for later analyses
read_fasta calls make_dna after each sequence read in
frontend handling for RNA bait output
RNA output handling for snp_to_baits and collapse_ambiguity

### Version 0.6
Incorporated baitslib into baitstools main script
require statements for subcommands now require_relative
Expand Down Expand Up @@ -111,6 +127,9 @@ Popvar @alleles replaces @pnuc and @qnuc, which were otherwise unused
Preliminary script to turn a coordinates table and a reference sequence into baits

## checkbaits
### Version 0.3
RNA output handling added

### Version 0.2
Version constant added to header

Expand All @@ -133,7 +152,10 @@ Fixed bug in that reread reference sequence every time snps_to_baits run in stac
### Version 0.1
Preliminary script to turn a Stacks summary tsv file and a reference sequence into baits

## tilebaits
## tilebaits
### Version 0.6
RNA output handling added

### Version 0.5
Returns BED format information rather than coordinates tables
Version constant added to header
Expand Down
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ A tutorial and example data are available in the example_data subdirectory of th
## Common Options
`-B, --bed`: Output BED file listing the coordinates of baits in comparison to the reference sequences. *Recommended*.
`-D, --ncbi`: Denotes that FASTA/FASTQ file headers have NCBI-style descriptors after the sample name separated by spaces (e.g. `>{sample_name} {descriptor1} {descriptor2}`).
`-Y, --rna`: Output bait sequences as RNA rather than DNA.
`-h, --help`: Print subcommand-specific help to the screen. Use without other arguments (e.g. `ruby baitstools.rb vcf2baits -h`).
`-v, --version`: Print subcommand version to the screen (which may not correspond with the BaitsTools release version). Use without other arguments (e.g. `ruby baitstools.rb vcf2baits -v`).

Expand All @@ -61,11 +62,12 @@ A tutorial and example data are available in the example_data subdirectory of th
`-c, --complete`: Remove candidate baits that are shorter than the requested bait length.
`-G, --nogaps`: Exclude candidate baits that include gap characters (-).
`-N, --noNs`: Exclude candidate baits that include unknown bases (Ns).
`-C, --collapse`: Collapse ambiguity codes to a single nucleotide. One nucleotide will be chosen randomly from the possible nucleotides (e.g. either A or G will be selected for a R site). If `-N` is specified, N sites will be treated as missing data and filtered out as specified. Otherwise, N sites will be treated as matching any nucleotide (A, G, C, T(U)).
`-n, --mingc [VALUE]`: Exclude candidate baits with a GC% below the specified value. Default is 30.0%.
`-x, --maxgc [VALUE]`: Exclude candidate baits with a GC% above the specified value. Default is 50.0%.
`-q, --mint [VALUE]`: Exclude candidate baits with melting temperature below the specified value. Default is 0.0°C.
`-z, --maxt [VALUE]`: Exclude candidate baits with melting temperature above the specified value. Default is 120.0°C.
`-T, --type [VALUE]`: Hybridization chemistry (one of `DNA-DNA`, `RNA-RNA`, or `RNA-DNA`) for melting temperature estimation. Default is `DNA-RNA`.
`-T, --type [VALUE]`: Hybridization chemistry (one of `DNA-DNA`, `RNA-RNA`, or `RNA-DNA`) for melting temperature estimation. Default is DNA-RNA.
`-s, --na [VALUE]`: Sodium concentration for melting temperature estimation. Default is 0.9M.
`-f, --formamide [VALUE]`: Formamide concentration for melting temperature estimation. Default is 0.0M.
`-Q, --meanqual [VALUE]`: Exclude baits with a mean Phred-like base quality below the specified value. Default is 20.0.
Expand All @@ -89,7 +91,7 @@ aln2baits generates baits from a DNA alignment in FASTA or FASTQ format. Bait se
`-i, --input [FILE]`: Input alignment file name. Include the path to the file if not in the current directory.
`-L, --length [VALUE]`: Requested bait length. Default is 120 bp.
`-O, --offset [VALUE]`: Offset (in bp) between tiled baits. Default is 20 bp.
`-H, --haplo [VALUE]`: Haplotype definition for bait generation. Entering `haplotype` will cause the program to identify all unique haplotypes within each bait tiling window observed in the data. Alternatively, entering `variant` will cause the program to generate all possible permutations of single nucleotide variants observed within the window.
`-H, --haplo [VALUE]`: Haplotype definition for bait generation. Entering `haplotype` will cause the program to identify all unique haplotypes within each bait tiling window observed in the data. Alternatively, entering `variant` will cause the program to generate all possible permutations of single nucleotide variants observed within the window. Default is haplotype.

### annot2baits
annot2baits generates baits from an annotation file in GTF or GFF and a corresponding DNA sequence in FASTA or FASTQ format.
Expand Down
19 changes: 11 additions & 8 deletions aln2baits.rb
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
#!/usr/bin/ruby
#-----------------------------------------------------------------------------------------------
# aln2baits
ALN2BAITSVER = "0.4"
ALN2BAITSVER = "0.5"
# Michael G. Campana, 2017
# Smithsonian Conservation Biology Institute
#-----------------------------------------------------------------------------------------------


class Hap_Window # Object defining a haplotype window
attr_accessor :header, :seqstart, :seqend, :haplotypes
def initialize(header, seqstart, seqend, haplotypes = [])
Expand All @@ -22,10 +23,10 @@ def var_permutations # Get possible variant permutations
end
for i in 0...@haplotypes[0].length
for seq in @haplotypes
if !variants[i].include?(seq[i].upcase)
case seq[i].upcase
if !variants[i].include?(seq[i])
case seq[i]
when "A","T","G","C","-"
variants[i].push(seq[i].upcase)
variants[i].push(seq[i])
when "R"
variants[i].push("A","G")
when "Y"
Expand Down Expand Up @@ -100,11 +101,12 @@ def aln2baits
case $options.haplodef
when "haplotype"
for hapno in 1..window.haplotypes.size
rng = (window.seqstart+1).to_s+"-"+(window.seqend+1).to_s #Adjust for 1-based indexing
rng = (window.seqstart+1).to_s+"-"+(window.seqend+1).to_s # Adjust for 1-based indexing
window.haplotypes[hapno-1].gsub!("T","U") if $options.rna # Will correct both raw and filtered sequences
baitsout += ">" + window.header[hapno-1] + "_" + rng + "_haplotype" + hapno.to_s + "\n" + window.haplotypes[hapno-1] + "\n"
coordline += window.header[hapno-1] + "\t" + (window.seqstart+1).to_s + "\t" + (window.seqend+1).to_s + "\n"
if $options.filter
flt = filter_baits(window.haplotypes[hapno-1])
flt = filter_baits(window.haplotypes[hapno-1]) # U won't affect filtration
if flt[0]
outfilter += ">" + window.header[hapno-1] + "_" + rng + "_haplotype" + hapno.to_s+ "\n" + window.haplotypes[hapno-1] + "\n"
filtercoordline += window.header[hapno-1] + "\t" + (window.seqstart+1).to_s + "\t" + (window.seqend+1).to_s + "\n"
Expand All @@ -117,12 +119,13 @@ def aln2baits
when "variant"
nvars = window.var_permutations # Get window_permutations
for hapno in 1..nvars.size
haplo = nvars[hapno-1]
haplo = nvars[hapno-1]
haplo.gsub!("T","U") if $options.rna # Will correct both raw and filtered sequences
rng = (window.seqstart+1).to_s + "-" + (window.seqend+1).to_s #Adjust for 1-based indexing
baitsout += ">Alignment_" + rng + "_haplotype" + hapno.to_s + "\n" + haplo + "\n" # Original window headers are meaningless
coordline += "Alignment\t" + (window.seqstart+1).to_s + "\t" + (window.seqend+1).to_s + "\n" # Original window headers are meaningless
if $options.filter
flt = filter_baits(haplo)
flt = filter_baits(haplo) # U won't affect filtration
if flt[0]
outfilter += ">Alignment_" + rng + "_haplotype" + hapno.to_s+ "\n" + haplo + "\n"
filtercoordline += "Alignment\t" + (window.seqstart+1).to_s + "\t" + (window.seqend+1).to_s + "\n"
Expand Down
1 change: 1 addition & 0 deletions annot2baits.rb
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
# Smithsonian Conservation Biology Institute
#-----------------------------------------------------------------------------------------------


def annot2baits
#Import reference sequence
print "** Reading reference sequence **\n"
Expand Down
81 changes: 69 additions & 12 deletions baitstools.rb
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
#!/usr/bin/ruby
#-----------------------------------------------------------------------------------------------
# baitstools
BAITSTOOLSVER = "0.6"
BAITSTOOLSVER = "0.7"
# Michael G. Campana, 2017
# Smithsonian Conservation Biology Institute
#-----------------------------------------------------------------------------------------------


require 'optparse'
require 'ostruct'
require_relative 'vcf2baits'
Expand All @@ -27,6 +28,10 @@ def initialize(header = "", circular = false, fasta = false, seq = "", qual = ""
@qual_array = [] # Array of numeric quality scores
@fasta = fasta # FASTA format flag
end
def make_dna # Replace uracils with thymines for internal consistency and upcase sequences
@seq.upcase!
@seq.gsub!("U", "T")
end
def calc_quality # Convert quality scores to numeric values so only needed once
for i in 0...@qual.length
@qual_array.push($fq_hash[@qual[i]])
Expand Down Expand Up @@ -88,19 +93,47 @@ def mean(val = [])
return mean
end
#-----------------------------------------------------------------------------------------------
def collapse_ambiguity(bait) # Ambiguity handling
r = ["A","G"]
y = ["C","T"]
m = ["A","C"]
k = ["G","T"]
s = ["C","G"]
w = ["A","T"]
h = ["A","C","T"]
b = ["C","G","T"]
v = ["A","C","G"]
d = ["A","G","T"]
n = ["A","C","G","T"]
bait.gsub!("R",r[rand(2)])
bait.gsub!("Y",y[rand(2)])
bait.gsub!("M",m[rand(2)])
bait.gsub!("K",k[rand(2)])
bait.gsub!("S",s[rand(2)])
bait.gsub!("W",w[rand(2)])
bait.gsub!("H",h[rand(3)])
bait.gsub!("B",b[rand(3)])
bait.gsub!("V",v[rand(3)])
bait.gsub!("D",d[rand(3)])
bait.gsub!("N",n[rand(4)]) unless $options.no_Ns
bait.gsub!("T","U") if $options.rna # Remove any residual Ts after ambiguity collapsing
return bait
end
#-----------------------------------------------------------------------------------------------
def filter_baits(bait, qual = [0])

# To be implemented:
# Sequence complexity filter
# Self-complementarity filter


bait = collapse_ambiguity(bait) if $options.collapse_ambiguities # Ambiguity handling
keep = true
keep = false if (bait.length < $options.baitlength && $options.completebait)
keep = false if (bait.upcase.include?("N") and $options.no_Ns)
keep = false if (bait.upcase.include?("-") and $options.no_gaps)
keep = false if (bait.include?("N") and $options.no_Ns)
keep = false if (bait.include?("-") and $options.no_gaps)
gc = 0.0
for i in 0 ... bait.length
if bait[i].chr.upcase == "G" or bait[i].chr.upcase == "C"
if bait[i].chr == "G" or bait[i].chr == "C"
gc += 1.0
end
end
Expand Down Expand Up @@ -162,6 +195,7 @@ def read_fasta(file) # Read fasta and fastq files
if line[0].chr == ">" and qual == false
if !faseq.nil? # push previously completed sequence into array
faseq.calc_quality unless faseq.fasta # Calculate quality scores
faseq.make_dna # Convert RNA sequences to DNA for internal algorithms and upcase sequences
seq_array.push(faseq)
end
header = line[1...-1] # Remove final line break and beginning >
Expand All @@ -182,6 +216,7 @@ def read_fasta(file) # Read fasta and fastq files
qual = false
if !faseq.nil? # push previously completed sequence into array
faseq.calc_quality unless faseq.fasta # Calculate quality scores
faseq.make_dna # Convert RNA sequences to DNA for internal algorithms and upcase sequences
seq_array.push(faseq)
end
header = line[1...-1] # Remove final line break and beginning >
Expand All @@ -205,6 +240,7 @@ def read_fasta(file) # Read fasta and fastq files
end
end
faseq.calc_quality unless faseq.fasta # Calculate quality scores
faseq.make_dna # Convert RNA sequences to DNA for internal algorithms and upcase sequences
seq_array.push(faseq) # Put last sequence into fasta array
return seq_array
end
Expand Down Expand Up @@ -325,13 +361,14 @@ def snp_to_baits(selectedsnps, refseq)
qual = rseq_var.numeric_quality[be4-1..after-1] unless rseq_var.fasta
end
end
prb.gsub!("T","U") if $options.rna # RNA output handling
seq = ">" + rseq_var.header + "\t" + snp.snp.to_s + "\n" + prb + "\n"
baitsout += seq
be4 = rseq_var.seq.length + be4 if be4 < 1
coord = rseq_var.header + "\t" + be4.to_s + "\t" + after.to_s + "\n"
coordline += coord
if $options.filter
parameters = filter_baits(prb, qual)
parameters = filter_baits(prb, qual) # U should not affect filtration
if parameters[0]
outfilter += seq
filtercoordline += coord
Expand Down Expand Up @@ -402,13 +439,15 @@ def get_command_line # Get command line for summary output
end
end
cmdline += " -B" if $options.coords
cmdline += " -D" if $options.ncbi
cmdline += " -D" if $options.ncbi
cmdline += " -Y" if $options.rna
# Generate filtration options
fltline = ""
fltline += " -w" if $options.params
fltline += " -c" if $options.completebait
fltline += " -G" if $options.no_gaps
fltline += " -N" if $options.no_Ns
fltline += " -C" if $options.collapse_ambiguities
fltline += " -n " + $options.mingc.to_s if $options.mingc_filter
fltline += " -x " + $options.maxgc.to_s if $options.maxgc_filter
fltline += " -q " + $options.mint.to_s if $options.mint_filter
Expand Down Expand Up @@ -467,7 +506,8 @@ def self.parse(options)
args.coords = false # Flag to output coordinates table of baits
args.no_gaps = false # Flag to omit bait sequences with gaps
args.no_Ns = false # Flag to omit bait sequences with Ns
args.haplodef = "" # Haplotype definition for aln2baits
args.collapse_ambiguities = false # Flag to collapse ambiguities to a single nucleotide
args.haplodef = "haplotype" # Haplotype definition for aln2baits
args.sort = false # Flag to sort stack2baits SNPs by between/within population variation
args.hwe = false # Flag to sort stacks2baits SNPs by Hardy-Weinberg Equilibrium
args.alpha = 0.05 # HWE test alpha value
Expand All @@ -478,6 +518,7 @@ def self.parse(options)
args.fasta_score = 0 # Asssumed base quality score for FASTA sequences
args.features = [] # Array holding desired features
args.ncbi = false # Flag whether FASTA/FASTQ headers include NCBI-style descriptors
args.rna = false # Flag whether baits are output as RNA
args.alt_alleles = false # Flag to apply alternate alleles
args.varqual_filter = false # Flag to determine whether to filter vcf variants by QUAL scores
args.varqual = 30 # Minimum vcf variant QUAL score
Expand Down Expand Up @@ -546,8 +587,8 @@ def self.parse(options)
args.features = feat.upcase.split(",")
end
elsif args.algorithm == "aln2baits"
opts.on("-H","--haplo [VALUE]", String, "Window haplotype definition (haplotype or variant)") do |fa|
args.haplodef = fa
opts.on("-H","--haplo [VALUE]", String, "Window haplotype definition (haplotype or variant) (Default = haplotype)") do |fa|
args.haplodef = fa if fa != nil
end
end
end
Expand Down Expand Up @@ -608,7 +649,10 @@ def self.parse(options)
end
opts.on("-N", "--noNs", "Exclude bait sequences with Ns") do
args.no_Ns = true
end
end
opts.on("-C", "--collapse", "Collapse ambiguities to a single nucleotide") do
args.collapse_ambiguities = true
end
opts.on("-n","--mingc [VALUE]", Float, "Minimum GC content (Default = 30.0)") do |mgc|
args.mingc = mgc if mgc != nil
args.mingc_filter = true
Expand Down Expand Up @@ -658,6 +702,9 @@ def self.parse(options)
opts.on("-D", "--ncbi", "FASTA/FASTQ file headers have NCBI-style descriptions") do
args.ncbi = true
end
opts.on("-Y","--rna", "Output baits as RNA rather than DNA") do
args.rna = true
end
opts.on_tail("-h","--help", "Show help") do
print "Welcome to baitstools " + args.algorithm + '.' +"\n\n"
print "To use the interactive interface, enter <ruby baitstools.rb " + args.algorithm + "> without command-line options.\n"
Expand Down Expand Up @@ -933,6 +980,16 @@ def self.parse(options)
if t == "Y" or t == "YES"
$options.no_Ns = true
end
print "Collapse ambiguity codes to a single nucleotide? (y/n)\n"
t = gets.chomp.upcase
if t == "Y" or t == "YES"
$options.collapse_ambiguities = true
end
print "Output baits as RNA rather than DNA? (y/n)\n"
t = gets.chomp.upcase
if t == "Y" or t == "YES"
$options.rna = true
end
print "Filter by minimum GC content? (y/n)\n"
t = gets.chomp.upcase
if t == "Y" or t == "YES"
Expand Down Expand Up @@ -1014,7 +1071,7 @@ def self.parse(options)
$options.fasta_score = 0 if $options.fasta_score < 0 # Correct fasta_score parameter
$options.fasta_score = 93 if $options.fasta_score > 93 # Limit to currently possible scores
$options.no_baits = false if ($options.every or $options.alt_alleles) # Override -p when needed
$options.filter = true if ($options.completebait or $options.params or $options.algorithm == "checkbaits" or $options.mingc_filter or $options.maxgc_filter or $options.mint_filter or $options.maxt_filter or $options.meanqual_filter or $options.minqual_filter or $options.no_gaps or $options.no_Ns) # Force filtration as necessary
$options.filter = true if ($options.completebait or $options.params or $options.algorithm == "checkbaits" or $options.mingc_filter or $options.maxgc_filter or $options.mint_filter or $options.maxt_filter or $options.meanqual_filter or $options.minqual_filter or $options.no_gaps or $options.no_Ns or $options.collapse_ambiguities) # Force filtration as necessary
cmdline = get_command_line
print "** Starting program with the following options: **\n"
print "** Basic command: " + cmdline[0] + " **\n"
Expand Down
1 change: 1 addition & 0 deletions bed2baits.rb
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
# Smithsonian Conservation Biology Institute
#-----------------------------------------------------------------------------------------------


def bed2baits
#Import reference sequence
print "** Reading reference sequence **\n"
Expand Down
Loading

0 comments on commit 9f357ba

Please sign in to comment.