Skip to content

Latest commit

 

History

History
9 lines (5 loc) · 3.74 KB

README.md

File metadata and controls

9 lines (5 loc) · 3.74 KB

Establishing and visualizing a reference dataset of paired-chain T cell receptor (TCR) sequences with TCRPaired and TCRView

There have been many recent advances in TCR repetoire analysis and binding prediction, an area of research which has experienced a rapid expansion with the advent of single-cell RNA sequencing technology. Many algorithms have been developed in an attempt to predict TCR:peptide-major histocompatibility complex (pMHC) binding occurrences and patterns, to varying levels of success.

These methods leverage differing strategies in their predictions. Some focus solely on identifying TCR footprints in an epitope-specific manner (e.g., ERGO), while others encode TCR, peptide, and MHC components in their predictions (e.g., TITAN). Many utilize complementarity-determining loop (CDR) sequences, especially the highly variable CDR3 region, along with variable (V) and joining (J) region genes; an increasing number incorporate or rely entirely on structural information, including electrostatic and proximity data. Notably, many sequence-based prediction methods allow both single- and paired-chain data for their predictions, with certain models exclusively considering TCRβ data (e.g., TITAN). Structural models are, by nature, constrained to paired-chain data.

Single-chain data obtained from bulk sequencing approaches makes up a large majority of currently available TCR binding data, including data housed in the public repositories VDJdb, McPAS-TCR, and IEDB. Other sources, such as the Adaptive Biotech ImmuneCODE database, contain exclusively on TCRβ sequences. Recent studies have shown that this single-chain focus leads to bias in prediction results, especially for certain antigen epitopes which may rely more on TCRα features. Further, reliable sequence-based prediction methods still show strong data dependency on their training data, leading to poor performance when approaching the 'unseen epitope' problem considered an ultimate goal of these predictions. This is likely due to the differing parameters accepted by each method: for instance, ERGO-II takes in a bare minimum input of V/J alleles and CDR sequences from one or both TCR chains, and TITAN only considers CDR3β sequences. Other methods have more stringent data requirements, including SwarmTCR, which requires all CDR sequences (CDR1, CDR2, CDR2.5, and CDR3) for one or both chains, and structure-based methods, which require full TCRα and TCRβ sequences. As a result of these differences, the data used to train these models can vary greatly and cause incompatibilities in both performance comparisons and applying these methods to experimental data.

As such, this repository provides a unified paired-chain TCR dataset intended to be used as a consistent, comprehensive reference for TCR repertoire analysis and TCR specificity prediction. This dataset meets the data input requirements for all common TCR binding prediction methods: namely, all receptors include valid V/J genes; CDR1, CDR2, CDR2.5, and CDR3 sequences for each chain; aligned full sequences for each chain; and identified epitope specificity with epitope species and gene information. Additional features, including source confidence scores and numerous metadata columns, are also included for data subsetting according to specific requirements. Datasets with reduced redundancy (less than 75% and 90% CDR3α + CDR3β sequence similarity) and high confiddence receptor-epitope pairs are provided as well. Data can be prepared using the internally defined methods to add privately obtained data or updated repository releases to facilitate increased model efficacy as new paired-chain TCR data becomes available.