- About Badger
1.1. Supported data types - Installation
- Running Badger
3.1. Badger input
3.2. Command line options
3.3. Badger output
Badger is a tool for long read barcode calling. For a given set of single cell long reads, it extracts the barcodes, identifies cell-associated barcodes and corrects extracted barcodes containing errors. The correction stage is based on an edit distance graph.
Badger works in two stages. First, barcodes are extracted, which results in a file containing readIDs, barcodes and further information about the reads. This is then the input for the next step, in which the graph is constructed and the barcodes corrected.
Currently supported protocols are 10x single cell and 10x visium.
Badger support all kinds of long single cell RNA data:
- PacBio CCS
- ONT dRNA / ONT cDNA
- Assembled / corrected transcript sequences
Reads must be provided in FASTQ or FASTA format (can be gzipped) or aligned reads as a BAM or SAM file.
To obtain Badger you can download repository and install requirements. Clone Badger repository and switch to the latest release:
git clone https://github.com/algbio/Badger.git
cd Badger
git checkout latest
To run Badger, you should provide:
- Long single cell RNA reads (PacBio or Oxford Nanopore) in one of the following formats:
- FASTA/FASTQ (can be gzipped);
- Sorted and indexed BAM;
- Optionally a list of cell-associated barcodes
- Barcode whitelist for the used single cell sequencing protocol
--output
(or -o
)
Prefix for output files in respect to the current folder
--help
(or -h
)
Prints help message.
--barcodes
(or -b
)
Barcode whitelist for the used protocol
--input
(or -i
)
Reads in FASTA, FASTQ, BAM or SAM format
--mode
Extraction method to be used, currently only tenX
--threads
(or -t
)
Number of threads to use, default 16
--output
(or -o
)
Prefix for output files in respect to the current folder
--help
(or -h
)
Prints help message.
--barcodes
(or -b
)
Barcodes extracted from the long reads in tsv format
--reads
(or -r
)
This is the output file of the extraction step, used getting readIDs
--barcode_list
(or -l
)
Barcode whitelist for the used protocol
--true_barcodes
List of the cell-associated barcodes, optional
--data_type
(or -d
)
Type of data to process, supported values are: 10x
and visium
--threshold
(or -t
)
Maximal edit distance between barcodes to be connected in the graph, default 1
--n_cells
(or -c
)
Expected number of cell-associated barcodes
- Extracting 10x single cell barcodes from reads
detect_barcodes.py --barcodes whitelist.txt --input scRNAseq_reads.fasta
--mode tenX --output barcode_file
- Correcting extracted 10x single cell barcodes
barcodes.py --barcodes barcodes.tsv --reads barcode_file.tsv --barcode_list whitelist.txt
-d 10x --output corrected_barcodes --n_cells 5000
Both extraction and correction step have one output file. They will be shortly described here.
OUTPUT_PREFIX.tsv - TSV file containing readID, barcodes and additional information
Columns are:
#read_id
- readIDbarcode
- extracted barcode for the read,*
if no barcode could be extractedBC_score
- 0 if a barcode could be extracted, -1 if notvalid_UMI
- TRUE if a valid UMI was found, FALSE if notstrand
- read direction in which a barcode was found as+
or-
,.
if no barcode could be extractedpolyT_start
- position in the read where the polyT sequence starts, -1 if no polyT sequence was foundR1_end
- position in the read where the adapter ends, -1 if no adapter was found
OUTPUT_PREFIX_output_file.tsv - TSV file containing readID and assigned cell-associated barcode
Columns are:
readID
- readIDbarcode
- assigned barcode,*
if no barcode could be assigned