GitHub - antigenomics/segment-parser: A parser for immune receptor gene data, mainly IMGT references

Parsing T- and B-cell receptor segments from IMGT database to a flexible plain-text format

Getting raw sequences

Instructions for downloading raw IMGT files:

Go to the genedb page and click the first submit button
Next, scroll down to the end of resulting page (loading can take a while) and mark Select all genes
Select F+ORF+in-frame P nucleotide sequences with IMGT gaps format and click submit
Copy-paste resulting FASTA records and use them as an input file

Running the software

Get the compiled binaries and run the software as java -jar segmentparser.jar [options] imgt_raw_file output_prefix.

The following options can be selected:

-n include non-functional segments into output (pseudogenes, etc)
-m include minor alleles (segments with *02, *03, etc suffix)
-s toggle species detalisation (e.g. BALBc and C57Bl6 for MusMusculus)
-b report IMGT records that cannot be parsed properly (missing conserved residues, etc)

Output files include:

A $output_prefix$.metadata.txt file with summary statistics.
Files with erroneous/bad records: $output_prefix$.nojrefpoint.txt, $output_prefix$.novrefpoint.txt, $output_prefix$.othersegm.txt.
Output file containing sequences, CDR3 reference points and CDR1,2,2.5 coordinates: $output_prefix$.txt.

SegmentParser generates a tab-delimited table with species name, gene and segment id, nucleotide sequence and the reference point position: 0-based coordinate of first nucleotide after conserved Cys for Variable segments and before first nucleotide before conserved Phe/Trp for Joining segments. The metadata table provided with results lists all species and genes and tells if there are any V/D/J segments associated with them (0 or 1 in corresponding row).

A file with CDR1,2,2.5 nucleotide and amino acid sequences: $output_prefix$.txt (only includes V segments).

Note that CDR2.5 is a putative MHC-binding region of TCR V segment, defined in a recent work of Paul Thomas lab (Dash et al. Nature 2017).

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
out		out
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
fetch_imgt.sh		fetch_imgt.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parsing T- and B-cell receptor segments from IMGT database to a flexible plain-text format

Getting raw sequences

Running the software

About

Releases 2

Packages

Languages

License

antigenomics/segment-parser

Folders and files

Latest commit

History

Repository files navigation

Parsing T- and B-cell receptor segments from IMGT database to a flexible plain-text format

Getting raw sequences

Running the software

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages