Skip to content

A parser for immune receptor gene data, mainly IMGT references

License

Notifications You must be signed in to change notification settings

antigenomics/segment-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parsing T- and B-cell receptor segments from IMGT database to a flexible plain-text format

Getting raw sequences

Instructions for downloading raw IMGT files:

  • Go to the genedb page and click the first submit button
  • Next, scroll down to the end of resulting page (loading can take a while) and mark Select all genes
  • Select F+ORF+in-frame P nucleotide sequences with IMGT gaps format and click submit
  • Copy-paste resulting FASTA records and use them as an input file

Running the software

Get the compiled binaries and run the software as java -jar segmentparser.jar [options] imgt_raw_file output_prefix.

The following options can be selected:

  • -n include non-functional segments into output (pseudogenes, etc)
  • -m include minor alleles (segments with *02, *03, etc suffix)
  • -s toggle species detalisation (e.g. BALBc and C57Bl6 for MusMusculus)
  • -b report IMGT records that cannot be parsed properly (missing conserved residues, etc)

Output files include:

  • A $output_prefix$.metadata.txt file with summary statistics.
  • Files with erroneous/bad records: $output_prefix$.nojrefpoint.txt, $output_prefix$.novrefpoint.txt, $output_prefix$.othersegm.txt.
  • Output file containing sequences, CDR3 reference points and CDR1,2,2.5 coordinates: $output_prefix$.txt.

SegmentParser generates a tab-delimited table with species name, gene and segment id, nucleotide sequence and the reference point position: 0-based coordinate of first nucleotide after conserved Cys for Variable segments and before first nucleotide before conserved Phe/Trp for Joining segments. The metadata table provided with results lists all species and genes and tells if there are any V/D/J segments associated with them (0 or 1 in corresponding row).

  • A file with CDR1,2,2.5 nucleotide and amino acid sequences: $output_prefix$.txt (only includes V segments).

Note that CDR2.5 is a putative MHC-binding region of TCR V segment, defined in a recent work of Paul Thomas lab (Dash et al. Nature 2017).