Methods overview.
a) catELMo is an ELMo-based bi-directional amino acid sequence representation model trained on TCR sequences. It takes a sequence of amino acid strings as input and predicts the right (or left) next amino acid tokens. catELMo consists of a charCNN layer and four bidirectional LSTM layers followed by a softmax activation. For a given TCR sequence of length L, each layer returns L vectors of length 1,024. The size of an embedded TCR sequence, therefore, is [5, L, 1024]. Global average pooling with respect to the length of TCR, L, is applied to get a representation vector with a size of 1, 024. b) TCR-epitope binding affinity prediction task. An embedding method (e.g., catELMo) is applied to both TCR and epitope sequences. The embedding vectors are then fed into a neural network of three linear layers for training a binding affinity prediction model. The model predicts whether a given TCR and epitope sequence bind or not. c) Epitope-specific TCR sequences clustering. The hierarchical clustering algorithm is applied to TCR embedding vectors to group TCR sequences based on their epitope-specificity.