GitHub - agillen/UMI-tools: Tools for handling Unique Molecular Identifiers in NGS data sets

UMI-tools was published in Genome Research on 18 Jan '17 (open access)

Tools for dealing with Unique Molecular Identifiers

This repository contains tools for dealing with Unique Molecular Identifiers (UMIs)/Random Molecular Tags (RMTs) and single cell RNA-Seq cell barcodes. Currently there are 6 commands.

The extract and whitelist commands are used to prepare a fastq containg UMIs +/- cell barcodes for alignment.

whitelist:

Builds a whitelist of the 'real' cell barcodes

This is useful for droplet-based single cell RNA-Seq where the identity of the true cell barcodes is unknown. Whitelist can then be used to filter with extract (see below)
extract:

Flexible removal of UMI sequences from fastq reads.

UMIs are removed and appended to the read name. Any other barcode, for example a library barcode, is left on the read. Can also filter reads by quality or against a whitelist (see above)

The remaining commands, group, dedup and count/count_tab, are used to identify PCR duplicates using the UMIs and perform different levels of analysis depending on the needs of the user. A number of different UMI deduplication schemes are enabled - The recommended method is directional.

group:

Groups PCR duplicates using the same methods available through `dedup`.

This is useful when you want to manually interrogate the PCR duplicates
dedup:

Groups PCR duplicates and deduplicates reads to yield one read per group

Use this when you want to remove the PCR duplicates prior to any downstream analysis
count:

Groups and deduplicates PCR duplicates and counts the unique molecules per gene

Use this when you want to obtain a matrix with unique molecules per gene. Can also perform per-cell counting for scRNA-Seq.
count_tab:

As per count except input is a flatfile

See QUICK_START.md for a quick tutorial on the most common usage pattern.

If you want to use UMI-tools in single-cell RNA-Seq data processing, see Single_cell_tutorial.md

The dedup, group, and count / count_tab commands make use of network-based methods to resolve similar UMIs with the same alignment coordinates. For a background regarding these methods see:

Genome Research Publication

Blog post discussing network-based methods.

Installation

If you're using Conda, you can use:

$ conda install -c https://conda.anaconda.org/toms umi_tools

Or pip:

$ pip install umi_tools

Or if you'd like to work directly from the git repository:

$ git clone https://github.com/CGATOxford/UMI-tools.git

Enter repository and run:

$ python setup.py install

For more detail see INSTALL.rst

Help

See QUICK_START.md and Single_cell_tutorial.md for tutorials on the most common usage patterns.

To get detailed help on umi_tools run

$ umi_tools --help

To get help on a specific [COMMAND] run

$ umi_tools [COMMAND] --help

Dependencies

umi_tools is dependent on numpy, pandas, scipy, cython, pysam, future, regex and matplotlib

Name		Name	Last commit message	Last commit date
Latest commit History 585 Commits
doc		doc
tests		tests
umi_tools		umi_tools
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
ez_setup.py		ez_setup.py
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
step1_unit_test.py		step1_unit_test.py
test_umi_tools.sh		test_umi_tools.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tools for dealing with Unique Molecular Identifiers

Installation

Help

Dependencies

About

Releases

Packages

Languages

License

agillen/UMI-tools

Folders and files

Latest commit

History

Repository files navigation

Tools for dealing with Unique Molecular Identifiers

Installation

Help

Dependencies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages