The bcRep package helps to analyze IMGT/HighV-QUEST* output in more detail. It functions well with B cells, but can also be used for T cell data, in some cases.
Using this package IMGT/HighV-QUEST output files can be analysed, f.e. functionality, junction frames, gene usage and mutations. The package provides functions to analyze clones out of the IMGT/HighV-QUEST output, but also methods to compare sequences and clones, are provided. Further a function to do principal coordinate analysis on dissimilarity/distance measurements is implemented. Almost all results can be visualized via implemented plotting functions.
*: Alamyar, E. et al., IMGT/HighV-QUEST: A High-Throughput System and Web Portal for the Analysis of Rearranged Nucleotide Sequences of Antigen Receptors, JOBIM 2010, Paper 63 (2010). Website: http://www.imgt.org/
bcRep depends on following packages: vegan, gplots, ineq, parallel, doParallel, foreach, ape, stringdist.
Parallel processing is possible for functions clones(), clones.shared(), sequences.geneComb(), compare.aaDistribution(), compare.geneUsage() and compare.trueDiversity(). By default, 1 core is used.
There are several datasets provided in the package. Most of them are examples of IMGT/HighV-QUEST output; two additional files represent clonotype files and one gene usage data:
library(bcRep)
data(summarytab) # An extract from IMGT/HighV-QUEST output file 1_Summary(...).txt
data(ntseqtab) # An extract from IMGT/HighV-QUEST output file 3_Nt-sequences(...).txt
data(aaseqtab) # An extract from IMGT/HighV-QUEST output file 5_AA-sequences(...).txt
data(mutationtab) # An extract from IMGT/HighV-QUEST output file 
                  # 7_V-REGION-mutation-and-AA-change-table(...).txt
data(clones.ind) # Clonotypes of one individual
data(clones.allind) # Clonotypes of eight individuals
data(vgenes) # V gene usage of 10 individuals (rows) and 30 genes (columns)IMGT/HighV-QUEST can process datasets with up to 500.000 sequences. For larger data sets, FASTA files have to be splitted into smaller datasets and be uploaded individually to IMGT. Afterwards the function combineIMGT() can be used to combine several folders.
# Example: combine folders IMGT1a, IMGT1b and IMGT1c to a new datset NewProject
combineIMGT(folders = c("pathTo/IMGT1a", "pathTo/IMGT1b", "pathTo/IMGT1c"), 
            name = "NewProject")To read IMGT/HighV-QUEST output files, use readIMGT("PathTo/file.txt",filterNoResults=TRUE). You can choose between including or excluding sequences without any information (lines with a sequence ID, but “No results”). Spaces and “-” in headers will be replaced by “_“. If there is a particular IMGT table required as input for a function, this is mentioned in the corresponding help file.
Amino acid distribution can be analyzed using aaDistribution() and visualized with plotAADistribution().
This function returns a list containing proportions of all amino acids (including stop codons “*“) for each analyzed sequence length. Optionally, also the number used for analysis can be given.
Figures can be saved as PDF in the working directory.
## 
## Attaching package: 'bcRep'## The following objects are masked _by_ '.GlobalEnv':
## 
##     clones, clones.IDlist, compare.geneUsage, geneUsage## The following object is masked from 'package:tcR':
## 
##     geneUsage# Example: 
data(aaseqtab)
aadistr<-aaDistribution(sequences = aaseqtab$CDR3_IMGT, numberSeq = TRUE)
# First 4 columns of Amino acid distribution table: 
        # for a sequence length of 13 AA (* = stop codon)
        # (aadistr$Amino_acid_distribution$`sequence length = 13`)| Position1 | Position2 | Position3 | Position4 | |
|---|---|---|---|---|
| F | 0 | 0 | 0.007018 | 0.01053 | 
| L | 0 | 0.003509 | 0.04561 | 0.08772 | 
| I | 0 | 0.02105 | 0.02456 | 0.03158 | 
| M | 0 | 0 | 0.01404 | 0.01754 | 
| V | 0.003509 | 0.007018 | 0.09123 | 0.04561 | 
| S | 0 | 0.07719 | 0.08421 | 0.07368 | 
| P | 0.003509 | 0 | 0.02105 | 0.07018 | 
| T | 0.02105 | 0.05614 | 0 | 0.03158 | 
| A | 0.9614 | 0.003509 | 0.05965 | 0.05614 | 
| Y | 0 | 0 | 0.01754 | 0.05263 | 
| H | 0 | 0.01404 | 0.01053 | 0.007018 | 
| C | 0 | 0 | 0.02105 | 0.007018 | 
| W | 0 | 0 | 0.007018 | 0.02807 | 
| N | 0 | 0.01404 | 0.007018 | 0.04211 | 
| D | 0 | 0 | 0.2281 | 0.06316 | 
| G | 0.003509 | 0.01053 | 0.1825 | 0.207 | 
| Q | 0 | 0 | 0.01053 | 0.01404 | 
| R | 0.007018 | 0.607 | 0.06316 | 0.0807 | 
| K | 0 | 0.186 | 0.007018 | 0.03509 | 
| E | 0 | 0 | 0.09825 | 0.0386 | 
| ***** | 0 | 0 | 0 | 0 | 
# Plot example for sequence lengths of 14-17 amino acids:
aadistr.part<-list(aadistr$Amino_acid_distribution[13:16], 
                   data.frame(aadistr$Number_of_sequences_per_length[13:16,]))
names(aadistr.part)<-names(aadistr)
plotAADistribution(aaDistribution.tab=aadistr.part, plotAADistr=TRUE, 
                   plotSeqN=TRUE, PDF=NULL) Diversity of sequences and clones can be analyzed using trueDiversity() or clones.giniIndex().
Using trueDiversity() richness or diversity of sequences with the same length can be analyzed. Basically diversity of amino acids at each position is calculated. True diversity can be measured for orders q = 0, 1 or 2. plotTrueDiversity() can be used for visualization. If mean.plot = F, a figure for each sequence length including diversity per position will be returned. If mean.plot = T only one figure with mean diversity values for all sequences with the same length will be returned.
Order 0: Richness (in this case it represents number of different amino acids per position).
Order 1: Exponential function of Shannon entropy using the natural logarithm (weights all amino acids by their frequency).
Order 2: Inverse Simpson entropy (weights all amino acids by their frequency, but weights are given more to abundant amino acids).
These indices are very similar (Hill, 1973). For example the exponential function of Shannon index is linearly related to inverse Simpson.
Amino acid sequences can be found in IMGT output table 5_AA-sequences(...).txt.
# Example:
data(aaseqtab)
trueDiv<-trueDiversity(sequences = aaseqtab$CDR3_IMGT, order = 1) 
        # using exponent of Shannon entropy
# True diversity of order 1 for amino acid length of 5 AA 
        # (trueDiv$True_diversity$'sequence length = 5')| Position1 | Position2 | Position3 | Position4 | Position5 | 
|---|---|---|---|---|
| 1 | 3 | 3 | 1.89 | 3 | 
# True diversity for sequences of amino acid length 14-17: 
trueDiv.part<-list(trueDiv$True_diversity_order, trueDiv$True_diversity[13:16])
names(trueDiv.part)<-names(trueDiv)
plotTrueDiversity(trueDiversity.tab=trueDiv.part, mean.plot = F,color="darkblue", PDF=NULL)plotTrueDiversity(trueDiversity.tab=trueDiv, mean.plot = T,color="darkblue", PDF=NULL)‘clones.giniIndex()’ calculates the Gini index of clones. Input is a vector containing clone sizes (copy number). The Gini index measures the inequality of clone size distribution. It ranges between 0 and 1. An index of 0 represents a polyclonal distribution, where all clones have same size. An index of 1 represents a perfect monoclonal distribution.
If a PDF project name is given, the Lorenz curve will be returned, as well. Lorenz curve can be interpreted as p*100 percent have L(p)*100 percent of clone size.
# Example:
data(clones.ind)
clones.giniIndex(clone.size=clones.ind$total_number_of_sequences, PDF = NULL)
# [1] 0.7473557Gene usage can be analyzed using functions geneUsage() and sequences.geneComb().
It can be differentiated between subgroup (e.g. IGHV1), gene (e.g. IGHV1-1) or allele (e.g. IGHV1-1*2), as well as between relative or absolute values.
When using function geneUsage(), functionality and junction frame usage, dependent on gene usage can be studied. Please note, that input vectors have to have the same order. Single genes per line, as well as several genes per line, can be processed. IMGT nomenclature has to be used (e.g. “Homsap IGHV4-34*01“)!
# Example:
data(summarytab)
Vgu<-geneUsage(genes = summarytab$V_GENE_and_allele, level = "subgroup", 
                functionality = summarytab$Functionality)
plotGeneUsage(geneUsage.tab = Vgu, plotFunctionality = TRUE, PDF = NULL, 
              title = "IGHV usage")Using function sequences.geneComb(), gene/gene ratios of two gene families will be analyzed. IMGT nomenclature has to be used (e.g. “Homsap IGHV4-34*01“). Results can be plotted with and without gene/gene combinations, where one or both genes are unknown.
# Example: 
data(summarytab)
VDcomb.tab<-sequences.geneComb(family1 = summarytab$V_GENE_and_allele, 
                family2 = summarytab$D_GENE_and_allele, 
                level = "subgroup", abundance = "relative")
plotGeneComb(geneComb.tab = VDcomb.tab, withNA = FALSE, PDF = NULL)Function to filter and to summarize IMGT/HighV-QUEST are provided:
Sequence files can be filtered for functionality or junction frame usage:
sequences.getAnyFunctionality(): excludes sequences without any functionality information
sequences.getAnyJunctionFrame(): excludes sequences without any junction frame information
The function looks for a column containing functionality or junction frame information and filters the whole table. Columns will remain, rows will be filtered. Junction frame information is only available in the IMGT summary table (1_Summary(...).txt). Functionality information is available in every table.
# Example: filter for productive sequences
data(summarytab)
ProductiveSequences<-sequences.getProductives(summarytab)
# dimension of the summary table [rows, columns]:
dim(summarytab) 
## [1] 3000   29
# dimension of the summary table, filtered for productive sequences [rows, columns]:
dim(ProductiveSequences) 
## [1] 2645   29Some basic statistics about functionality and junction frames can be given by sequences.functionality() and sequences.junctionFrame(). Both functions return absolute or relative values for each kind of sequences.
# Example:
data(summarytab)
sequences.functionality(data = summarytab$Functionality, relativeValues=TRUE)
# Proportion of productive and unproductive sequences:| productive | unproductive | unknown | 
|---|---|---|
| 0.8817 | 0.1127 | 0.005667 | 
Some basic mutation analysis can be done with sequences.mutation(). Input is the IMGT output table 7_V-REGION-mutation-and-AA-change-table(...).txt and optionally 1_Summary(...).txt. Mutation analysis can be done for V-region, FR1-3 and CDR1-2 sequences. Functionality, as well as junction frame analysis can be included. Further the R/S ratio (ratio of replacement and silent mutations) can be returned.
data(mutationtab)
V.mutation<-sequences.mutation(mutationtab = mutationtab, sequence = "V", 
                               junctionFr = TRUE, rsRatio=TRUE)
# V.mutation$Number_of_mutations, first 6 lines:| number_mutations | number_replacement | number_silent | RS_ratio | |
|---|---|---|---|---|
| Sequence1 | 1 | 1 | 0 | 0 | 
| Sequence2 | 0 | 0 | 0 | 0 | 
| Sequence3 | 1 | 0 | 1 | 0 | 
| Sequence4 | 7 | 4 | 3 | 1.333 | 
| Sequence5 | 29 | 21 | 8 | 2.625 | 
| Sequence6 | 3 | 1 | 2 | 0.5 | 
# V.mutation$Junction frame:| proportion | |
|---|---|
| mutation_in_in-frame_sequences | 0 | 
| mutations_in_out-of-frame_sequences | 0 | 
| mutations_in_sequences_with_unknown_junction_frame | 0.7387 | 
| no_mutation_in_in-frame_sequences | 0 | 
| no_mutations_in_out-of-frame_sequences | 0 | 
| no_mutations_in_sequences_with_unknown_junction_frame | 0.2613 | 
Proportions of nucleotide distributions at and nearby mutation sites can be calculated using sequences.mutation.base(). This function calculates the porportion of a, c, g or t at the mutated positions and/or next to mutations from position -3 to +3. Parallel processing can be used for large datasets.
plotSequencesMutationAA() visualizes base proportions nearby mutations. Barplots are given for each base.
# Example:
data(mutationtab)
data(summarytab)
V.BaseMut<-sequences.mutation.base(mutationtab = mutationtab, summarytab = summarytab, 
                                   analyseEnvironment = TRUE, analyseMutation = TRUE, 
                                   sequence = "V", nrCores=1)
plotSequencesMutationBase(mutationBaseTab = V.BaseMut, plotEnvironment = TRUE, 
     plotMutation = TRUE, colHeatmap = c("white","darkblue"))Proportions of amino acid changes can be calculated using sequences.mutation.AA(). This function calculates the porportion of mutations from amino acid x to amino acid y. The original amino acids are in rows, the mutated one in columns.
plotSequencesMutationAA() visualizes the 20 x 20 data frame in some kind of heatmap, showing groups of proportions. Furthermore changes of amino acid characteristics, like hydropathy, volume or chemical can be shown (orange dots). Figures can be saved as PDF.
# Example:
data(mutationtab)
V.AAmut<-sequences.mutation.AA(mutationtab = mutationtab, sequence = "V")
plotSequencesMutationAA(mutationAAtab = V.AAmut, showChange = "hydropathy")bcRep provides some functions to study clone features. Some columns of the clonotype file can simply be plotted using boxplot() or barplot().
Using clones.filterSize() you can filter a set of clonotypes by their size. Three methods for filtering are available.
You can set tresholds for the parameters…
number: filters the table for a given number of sequences; e.g. 20 smallest clonespropOfClones: filters the table for the proportion of the total number of clones, e.g. the 10% biggest clonespropOfSequences: filters the table for the proportion of all sequences, belonging to all clones; e.g. clones, that include 1% of all sequencesAdditionally, filtering can be done for both ends (biggest and smallest clones; two.tailed), or only for biggest (upper.tail) or smallest (lower.tail) clones. Biggest clones refers to clones with the biggest sequence number (or largest size/copy number) (same for smallest clones).
# Example 1: Getting the 3 smallest clones (clones with 85% CDR3 identity)
clones.filtered1<-clones.filterSize(clones.tab=clones.ind, 
                                    column="total_number_of_sequences", 
                                    number=3, method="lower.tail")
    # Output of clones.filtered1[,c("total_number_of_sequences",
            # "unique_CDR3_sequences_AA", "CDR3_length_AA", "V_gene")]:| total_number_of_sequences | unique_CDR3_sequences_AA | CDR3_length_AA | V_gene | 
|---|---|---|---|
| 2 | GALITMVRGVISWRFDP, GALIAMVRGVISWRFDP | 17 | IGHV4-4 | 
| 2 | ARGLQRLVL, VRGLQRLVL | 10 | IGHV4-4 | 
| 2 | AKDRYGDYALY, ARDRYGDYALY | 11 | IGHV4-4 | 
# Example 2: Getting 10% biggest and smallest clones 
# (clones with 85% CDR3 identity; column 4 = "total_number_of_sequences")
clones.filtered2<-clones.filterSize(clones.tab=clones.ind, column=4, propOfClones=0.1,
                                    method="two.tailed")
names(clones.filtered2) # a list with biggest and smallest 10% clones
## [1] "upper.tail" "lower.tail"
dim(clones.filtered2$upper.tail) # dimension of table with biggest clones [rows, columns]
## [1] 100  14
dim(clones.filtered2$lower.tail) # dimension of table with smallest clones [rows, columns]
## [1] 100  14
# Example 3: Getting clones, that include 0.5% of all sequences 
# (clones with 85% CDR3 identity; column 4 = "total_number_of_sequences"")
clones.filtered3<-clones.filterSize(clones.tab=clones.ind, column=4,
                                    propOfSequences=0.005, method="two.tailed")
names(clones.filtered3) # a list with biggest and smallest 10% clones
## [1] "upper.tail" "lower.tail"
dim(clones.ind) ## dimension of clones.ind table [rows, columns]
## [1] 1000   14
## >> number of rows is equal to number of rows of biggest and smallest clones
dim(clones.filtered3$upper.tail) # dimension of table with biggest clones [rows, columns]
## [1] 48 14
dim(clones.filtered3$lower.tail) # dimension of table with smallest clones [rows, columns]
## [1] 952  14clones.filterFunctionality() and clones.filterJunctionFrame() filter set of clones for functionality or junction frame usage.
# Example
data(clones.ind)
productiveClones<-clones.filterFunctionality(clones.tab = clones.ind, 
                                             filter = "productive")
# dimension of clones.ind [rows, columns]:
dim(clones.ind) 
## [1] 1000   14
# dimension of clones.ind, filtered for productive sequences [rows, columns]:
dim(productiveClones) 
## [1] 1000   14# Example
data(clones.ind)
inFrameClones<-clones.filterJunctionFrame(clones.tab = clones.ind, filter = "in-frame")
# dimension of clones.ind [rows, columns]:
dim(clones.ind) 
## [1] 1000   14
# dimension of clones.ind, filtered for in-frame sequences [rows, columns]:
dim(inFrameClones) 
## [1] 873  14Functions to analyse clone features, like CDR3 length distribution or clone copy number, are provided.
Using clones.CDR3Length() CDR3 length distribution of clones can be estimated. The corresponding visualization method is plotClonesCDR3Length(). Input for both functions are vectors containing nucleotide or amino acid sequences. Further functionality and junction frame usage, depending on CDR3 length can be studied. Absolute or relative values can be returned.
# Example:
data(clones.ind)
CDR3length<-clones.CDR3Length(CDR3Length = clones.ind$CDR3_length_AA, 
                  functionality = clones.ind$Functionality_all_sequences)
# output of some CDR3 length proportions (CDR3length$CDR3_length[1:3]): | CDR3_length_7 | CDR3_length_8 | CDR3_length_9 | 
|---|---|---|
| 0.018 | 0.013 | 0.007 | 
# output of functionality ratios for some CDR3 lengths 
    # (CDR3length$CDR3_length_vs_functionality[,1:3)])| CDR3_length_7 | CDR3_length_8 | CDR3_length_9 | |
|---|---|---|---|
| productive | 0.8889 | 0.8462 | 1 | 
| unproductive | 0.1111 | 0.1538 | 0 | 
| unknown | 0 | 0 | 0 | 
plotClonesCDR3Length(CDR3Length = clones.ind$CDR3_length_AA, 
                     functionality = clones.ind$Functionality_all_sequences, 
                     title = "CDR3 length distribution", PDF = NULL)plotClonesCopyNumber() can be used to plot clone sizes as an boxplot.
# Example: Clone copy number distribution with and without outliers
plotClonesCopyNumber(copyNumber = clones.ind$total_number_of_sequences, 
                     withOutliers=TRUE, color = "darkblue", 
                     title = "Copy number distribution", PDF = NULL)plotClonesCopyNumber(copyNumber = clones.ind$total_number_of_sequences, 
                     withOutliers=FALSE, color = "darkblue", 
                     title = "Copy number distribution", PDF = NULL)Further functions like geneUsage()or trueDiversity() can be used for further analysis of clone sets.
If more than one sample should be analyzed, bcRep provides some functions to compare different samples. Gene usage, amino acid distribution, diversity and shared clones can be compared and visualized.
Gene usage can be compared using compare.geneUsage(). Input is a list including sequence vectors of each individual. If no names for the samples are specified, samples will be called Sample1, Sample2, … in input order. Analysis of subgroup (e.g. IGHV1), gene (e.g. IGHV1-1) or allele (e.g. IGHV1-1*2), as well as of relative or absolute values is possible.
Results can be visualized using plotCompareGeneUsage() (optional saved as PDF in working directory). A heatmap will be returned, with samples as rows and genes as columns. Proportions of genes are color coded.
data(aaseqtab)
data(aaseqtab2)
V.comp<-compare.geneUsage(gene.list = list(aaseqtab$V_GENE_and_allele, 
                                           aaseqtab2$V_GENE_and_allele), 
                               level = "subgroup", abundance = "relative", 
                          names = c("Individual1", "Individual2"), 
                          nrCores = 1)| IGHV1 | IGHV2 | IGHV3 | IGHV4 | IGHV5 | IGHV6 | IGHV7 | |
|---|---|---|---|---|---|---|---|
| Individual1 | 0.1675 | 0.002404 | 0.6305 | 0.1687 | 0.02164 | 0.007213 | 0.002004 | 
| Individual2 | 0.1242 | 0.004302 | 0.6834 | 0.154 | 0.0222 | 0.007228 | 0.004646 | 
plotCompareGeneUsage(comp.tab = V.comp, color = c("gray97", "darkblue"), PDF = NULL)The amino acid distribution between two or more individuals can be compared using compare.aaDistribution(). Input is a list including sequence vectors for each individual. Sequences of the same length will be analyzed together. Optional the number of sequences used for analysis can be returned. If no names for the samples are specified, samples will be called Sample1, Sample2, … in input order.
The output is a list, containing amino acid distributions for each sequence length and each individual.
plotCompareAADistribution() returns a plot (optional saved as PDF in working directory), where columns represent the samples, and rows represent different sequence lengths. In each field, one bar represents one position of the sequence.
data(aaseqtab)
data(aaseqtab2)
AAdistr.comp<-compare.aaDistribution(sequence.list = list(aaseqtab$CDR3_IMGT,
                                                          aaseqtab2$CDR3_IMGT), 
                                     names = c("Individual1", "Individual2"), 
                                     numberSeq = TRUE, nrCores = 1)
# Comparison of sequence length of 14-16 amino acids: 
## Individual 1: grep("14|15|16",names(AAdistr.comp$Amino_acid_distribution$Individual1)) 
        ## >> 13:15
## Individual 2: grep("14|15|16",names(AAdistr.comp$Amino_acid_distribution$Individual2)) 
        ## >> 12:14
AAdistr.comp.part<-list(list(AAdistr.comp$Amino_acid_distribution$Individual1[13:15], 
                             AAdistr.comp$Amino_acid_distribution$Individual2[12:14]),
                        list(AAdistr.comp$Number_of_sequences_per_length$Individual1[13:15],
                             AAdistr.comp$Number_of_sequences_per_length$Individual2[12:14]))
names(AAdistr.comp.part)<-names(AAdistr.comp)
names(AAdistr.comp.part$Amino_acid_distribution)<-
  names(AAdistr.comp$Amino_acid_distribution)
names(AAdistr.comp.part$Number_of_sequences_per_length)<-
  names(AAdistr.comp$Number_of_sequences_per_length)
plotCompareAADistribution(comp.tab = AAdistr.comp.part, plotSeqN = TRUE, 
                          colors=c("darkblue","darkred"), PDF = NULL)Richness and diversity can be compared using compare.trueDiversity(). Input can be a list of sequences or the output of compare.aaDistribution(). Diversity can be analyzed for order 0 (richness), 1 (exponent of Shannon entropy) or 2 (inverse Simpson) (see diversity analysis above). If no names for the samples are specified, samples will be called Sample1, Sample2, … in the input order.
The output is a list, containing richness or diversity indices for each sequence length and each individual.
Results can be visualized using plotCompareTrueDiversity() (optional saved as PDF in working directory). A plot will be returned, with one figure for each sequence length, if mean.plot = F. If mean.plot = T only one figure with mean diversity values for all sequences with the same length will be returned. Each figure contains the sequence position on the x-axis and the richness/diversity index on the y-axis. If no colors are specified, rainbow() will be used.
data(aaseqtab)
data(aaseqtab2)
trueDiv.comp<-compare.trueDiversity(sequence.list = list(aaseqtab$CDR3_IMGT, 
                                                         aaseqtab2$CDR3_IMGT), 
                                    names = c("Individual1", "Individual2"), 
                                    order = 1, nrCores = 1)
# Comparison of sequence length of 14-16 amino acids: 
grepindex1<-grep("14|15|16",names(trueDiv.comp$Individual1))
grepindex2<-grep("14|15|16",names(trueDiv.comp$Individual2))
trueDiv.comp.part<-list(trueDiv.comp$True_diversity_order, 
                        trueDiv.comp$Individual1[grepindex1],
                        trueDiv.comp$Individual2[grepindex2])
names(trueDiv.comp.part)<-names(trueDiv.comp)
plotCompareTrueDiversity(comp.tab = trueDiv.comp.part, mean.plot = F, colors=c("darkblue","darkred"), 
                         PDF = NULL)
plotCompareTrueDiversity(comp.tab = trueDiv.comp, mean.plot = T, colors = c("darkblue","darkred"), 
                         PDF = NULL)The function ‘geneUsage.distance()’ provides several dissimilarity and distance measurements for gene usage data. Bray-Curtis, Jaccard and cosine indices can be calculated on data (taken from vegan::vegdist and proxy::dist).frames with genes in columns and samples in rows, see Example below:
# Example:
data(vgenes) 
vgenes[1:5,1:5]| IGHV4-34 | IGHV1-8 | IGHV1-3 | IGHV3-23 | IGHV3-23D | |
|---|---|---|---|---|---|
| Sample1 | 0.06276 | 0.03591 | 0.01268 | 0.08919 | 0.08858 | 
| Sample2 | 0.05292 | 0.04631 | 0.01998 | 0.08522 | 0.08389 | 
| Sample3 | 0.02476 | 0.05528 | 0.008978 | 0.04073 | 0.03966 | 
| Sample4 | 0.05124 | 2.6e-05 | 0.02297 | 0.01556 | 0.01522 | 
| Sample5 | 0.01844 | 0.02616 | 0.03021 | 0.04346 | 0.04256 | 
‘geneUsage.distance()’ then calculates a distance matrix:
# Example for 5 samples and 5 genes:
data(vgenes) 
geneUsage.distance(geneUsage.tab = vgenes[1:5,1:5], method = "bc")| Sample1 | Sample2 | Sample3 | Sample4 | Sample5 | |
|---|---|---|---|---|---|
| Sample1 | 0 | 0.06271 | 0.3456 | 0.5193 | 0.363 | 
| Sample2 | 0.06271 | 0 | 0.2989 | 0.4812 | 0.3294 | 
| Sample3 | 0.3456 | 0.2989 | 0 | 0.5297 | 0.1886 | 
| Sample4 | 0.5193 | 0.4812 | 0.5297 | 0 | 0.4568 | 
| Sample5 | 0.363 | 0.3294 | 0.1886 | 0.4568 | 0 | 
Using ‘sequences.distance()’ distance matrices, based on sequence data, e.g. CDR3 sequences, can be calculated. Following dissmilarity/distance measurements can be used (taken from stringdist::stringdist-metric):
Further it can be distinguished wheather all sequences or subsets of sequences of the same length shall be used for distance estimation.
# Example 1:
## Divide sequences into subsets of same length and then apply Levenshtein distance:
data(clones.ind)
clones.ind<-subset(clones.ind, CDR3_length_AA==7) #take only a subset of clones.ind (CDR3 sequences of 7 AA)
dist1<-sequences.distance(sequences = clones.ind$unique_CDR3_sequences_AA, 
     method = "levenshtein", divLength=TRUE)
# Subset of output for 7AA sequences:
dist1[[1]][1:5,1:5]| ARERADY | AKERADY | AGERADY | ARERADY | AREQADY | |
|---|---|---|---|---|---|
| ARERADY | 0 | 1 | 1 | 0 | 1 | 
| AKERADY | 1 | 0 | 1 | 1 | 2 | 
| AGERADY | 1 | 1 | 0 | 1 | 2 | 
| ARERADY | 0 | 1 | 1 | 0 | 1 | 
| AREQADY | 1 | 2 | 2 | 1 | 0 | 
# Example 2: 
## Use all sequences with a length of 7 or 8 AA for cosine distance matrix:
data(clones.allind)
clones.allind<-subset(clones.allind, CDR3_length_AA %in% c(9,31)) #take only a subset of clones.ind (CDR3 sequences of 9 or 31 AA)
dist2<-sequences.distance(sequences = clones.allind$unique_CDR3_sequences_AA, 
     groups = clones.allind$individuals, method = "cosine", divLength=FALSE)
# Subset of output:
dist2[[1]][1:4,1:4]| AGGGSYYHA | AGGGSYYHS | ARGPYLFDY | ARGPYLFDH | |
|---|---|---|---|---|
| AGGGSYYHA | 0 | 0.05263 | 0.3775 | 0.3882 | 
| AGGGSYYHS | 0.05263 | 0 | 0.4466 | 0.4647 | 
| ARGPYLFDY | 0.3775 | 0.4466 | 0 | 0.09547 | 
| ARGPYLFDH | 0.3882 | 0.4647 | 0.09547 | 0 | 
Principal coordinate analysis can be performed on distance matrices. The function dist.PCoA() provides principal coordinate objects, like in ‘ape::pcoa()’, with an output as follows (taken from ape::pcoa):
The results can also be plotted, using ‘plotDistPCoA()’. Input has to be the PCoA table from ‘dist.PCoA()’, a ‘groups’ data.frame and a pair of axes to be plotted. Optional arguments are names, plotCorrection (default: FALSE), title, plotLegend (default: FALSE) and PDF.
# Example for gene usage data:
data(vgenes) 
vgenes<-vgenes[,1:10] # include only genes 1-10
gu.dist<-geneUsage.distance(geneUsage.tab = vgenes, method = "bc")
gu.pcoa<-dist.PCoA(dist.tab = gu.dist, correction = "cailliez")str(gu.pcoa) # a PCoA object## List of 1
##  $ :List of 7
##   ..$ correction : chr [1:2] "cailliez" "3"
##   ..$ note       : chr "Cailliez correction applied to negative eigenvalues: D' = -0.5*(D + 0.0607817555282937 )^2, except diagonal elements"
##   ..$ values     :'data.frame':  10 obs. of  6 variables:
##   .. ..$ Eigenvalues : num [1:10] 0.1688 0.0608 0.026 0.0196 0.0126 ...
##   .. ..$ Corr_eig    : num [1:10] 0.2148 0.0877 0.0417 0.0349 0.0253 ...
##   .. ..$ Rel_corr_eig: num [1:10] 0.4908 0.2003 0.0953 0.0798 0.0578 ...
##   .. ..$ Broken_stick: num [1:10] 0.3397 0.2147 0.1522 0.1106 0.0793 ...
##   .. ..$ Cum_corr_eig: num [1:10] 0.491 0.691 0.786 0.866 0.924 ...
##   .. ..$ Cum_br_stick: num [1:10] 0.34 0.554 0.707 0.817 0.897 ...
##   ..$ vectors    : num [1:10, 1:7] -0.1721 -0.1274 0.0216 0.2608 0.1213 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:10] "Sample1" "Sample2" "Sample3" "Sample4" ...
##   .. .. ..$ : chr [1:7] "Axis.1" "Axis.2" "Axis.3" "Axis.4" ...
##   ..$ trace      : num 0.29
##   ..$ vectors.cor: num [1:10, 1:8] -0.1945 -0.1477 0.0271 0.2896 0.1399 ...
##   ..$ trace.cor  : num 0.438
##   ..- attr(*, "class")= chr "pcoa"# Example of a 'groups' data.frame: in the case, there are two groups:
groups.vec<-cbind(rownames(vgenes),c(rep("group1",5),rep("group2",5)))
colnames(groups.vec)<-c("Sample","Group")
pandoc.table(groups.vec)| Sample | Group | 
|---|---|
| Sample1 | group1 | 
| Sample2 | group1 | 
| Sample3 | group1 | 
| Sample4 | group1 | 
| Sample5 | group1 | 
| Sample6 | group2 | 
| Sample7 | group2 | 
| Sample8 | group2 | 
| Sample9 | group2 | 
| Sample10 | group2 | 
plotDistPCoA(pcoa.tab = gu.pcoa, groups= groups.vec, axes = c(1,2), 
             names = rownames(vgenes), plotLegend = TRUE)# Example for sequence data, divided into subsets of the same length:
data(clones.ind)
clones.ind<-subset(clones.ind, CDR3_length_AA %in% c(7:10)) # include only CDR3 sequences of 7-10 AA
seq.dist<-sequences.distance(sequences = clones.ind$unique_CDR3_sequences_AA, 
     method = "levenshtein", divLength=T)
seq.pcoa<-dist.PCoA(dist.tab = seq.dist, correction = "none")
# Example of a 'groups' data.frame: 
## in the case, there are no groups (all samples have the same group):
groups.vec<-unlist(apply(data.frame(clones.ind$unique_CDR3_sequences_AA),1,
                         function(x){strsplit(x,split=", ")[[1]]}))
groups.vec<-cbind(groups.vec, 1)
plotDistPCoA(pcoa.tab = seq.pcoa, groups=groups.vec, axes = c(1,2), 
             plotCorrection = FALSE, 
             title = 'PCoA for sequences of length 7-10 AA', plotLegend=FALSE)