Determining haplotypes from HLA alleles
Introduction
Over the last few years, while doing my PhD, I’ve worked extensively with genomic data (mostly from the amazing UK Biobank!) that comes formatted as a giant matrix with:
- Individuals on the rows;
- SNPs (single-nucleotide polymorphisms) on the columns; and
- Minor allele counts in each cell.
This means that, for a particular person, I have information on millions of single base pair positions along their genome, and this information consists of the number of copies of the minor allele that they have at each location (they may have 2 copies of the major allele — i.e., the most common in the population — and 0 copies of the minor allele, or instead 1 or 2 copies of the minor allele).
Data in this form is easy to understand, explain and analyse: by fitting some kind of generalised linear model to relate a phenotype of interest to a given SNP, we can test whether this single-nucleotide variant is associated with that phenotype. For example, if our phenotype is Alzheimer’s disease, we would find several variants in the apolipoprotein E (APOE) gene that either increase or decrease the risk of developing the disease.1
More recently, I’ve started a new project looking at genetic data from the human leukocyte antigen (HLA) system on chromosome 6. This region of the genome is fascinating for multiple reasons:
- It contains a large number of genes (more than 250) which are crucial for regulating human immune response;
- It has a larger number of disease associations than any other locus in the genome (Sakaue et al., 2022);
- It’s extremely diverse genetically, a diversity which presumably evolved to maximise the ability of the human population to resist novel pathogens;
- It’s in very high linkage disequilibrium, which means that long stretches of DNA tend to be inherited jointly and so complicate the analysis in isolation of individual loci within the region.
Genomic landscape of the MHC (Trowsdale and Knight, 2013)
Notice the myriad of genes located in this short region of the genome and the relative sparsity of the recombination signals, which head to high linkage.
Beyond the specific biological characteristics of the HLA, there are some quirks to the way in which genetic data for this region is normally formatted. The study of the HLA began more than 60 years ago (Trowsdale and Knight, 2013) and because of this the conventions for how to organise and present data are rather arcane and distinct from the typical setup used for genotype data.
HLA alleles and haplotypes
HLA genes are divided into three classes, I, II and III. My understanding of HLA biology is very limited so for the purposes of this post I’ll simply point out that Class I contains genes A, B and C (among others) and Class II contains DQA1, DQB1, DRB1, DPA1, DPB1, etc.
Genetic information in the HLA is typically presented as strings denoting particular alleles for a given gene beginning with HLA_{gene}*{allele}, where {gene} takes values A, B, C, etc. and {allele} is a four-digit or longer number corresponding to a particular allele. For example, HLA_A*33:05 is one of many observed alleles at four-digit resolution for HLA gene A, and this allele will be determined based on mutations observed at multiple bases.
As if all this weren’t enough to confuse a computational student with no formal training in biology like me, HLA alleles are often grouped into haplotypes before further analysis! I mentioned previously that the HLA is a region of very high linkage where long stretches of the genome are inherited jointly from one parent. This leads to particular combinations of alleles being common and so these have historically been the main unit of genetic variation analysed in the HLA.
Let’s be more concrete: a person might have, on one of their two copies of chromosome 6, the following HLA alleles: A*01:01, C*07:01, B*08:01, DRB1*03:01, DQA1*05:01, DQB1*02:01. Since these six alleles are on the same chromosome, they were likely inherited together from a single chromosome from one parent (no recombination) and so constitute a haplotype. We might write it as a long hyphenated string: A*01:01-C*07:01-B*08:01-DRB1*03:01-DQA1*05:01-DQB1*02:01. This haplotype happens to be commonly observed in European populations.
Determining haplotypes from HLA alleles
This long introduction brings us to the purpose of this blog post. HLA data is often supplied as a series of (phased) HLA alleles. However, the unit of genetic variation for subsequent analysis is usually expected to be HLA haplotypes. How do we go from alleles to haplotypes?
Since this operation has to be commonly performed as a pre-processing step in genetic studies of the HLA, when I began working with this region I expected to find multiple software packages and online tutorials guiding me through the process of transforming alleles into haplotypes. However, to my surprise I wasn’t able to find any! I was lucky to have a senior researcher with lots of experience with the HLA to point me in the right direction and so I thought I would share the R code I wrote to do this in case it comes in handy to others in the future.
Building a toy dataset
Let’s start by making a simple dataset of HLA alleles that we will transform into haplotypes. It will be a data frame formatted in VCF style, with genetic variants on the rows and individuals on the columns:
## make toy HLA dataset
hla_i <- data.frame(ID = c("HLA_A*01:01", "HLA_A*02:01", "HLA_A*03:01",
"HLA_A*11:01",
"HLA_C*01:02", "HLA_C*02:02", "HLA_C*03:03",
"HLA_C*07:01",
"HLA_B*07:02", "HLA_B*08:01", "HLA_B*13:02",
"HLA_B*14:01"),
i1 = c("1|1", "0|0", "0|0", "0|0",
"0|0", "1|0", "0|1", "0|0",
"0|0", "0|0", "0|0", "1|1"),
i2 = c("0|1", "1|0", "0|0", "0|0",
"1|0", "0|0", "0|1", "0|0",
"1|0", "0|0", "0|0", "0|1"),
i3 = c("0|0", "0|0", "1|0", "0|1",
"0|0", "1|0", "0|1", "0|0",
"0|0", "1|0", "0|0", "0|1"),
i4 = c("1|1", "0|0", "0|0", "0|0",
"0|0", "0|0", "0|0", "1|1",
"0|0", "0|0", "1|1", "0|0"))
> hla_i
ID i1 i2 i3 i4
1 HLA_A*01:01 1|1 0|1 0|0 1|1
2 HLA_A*02:01 0|0 1|0 0|0 0|0
3 HLA_A*03:01 0|0 0|0 1|0 0|0
4 HLA_A*11:01 0|0 0|0 0|1 0|0
5 HLA_C*01:02 0|0 1|0 0|0 0|0
6 HLA_C*02:02 1|0 0|0 1|0 0|0
7 HLA_C*03:03 0|1 0|1 0|1 0|0
8 HLA_C*07:01 0|0 0|0 0|0 1|1
9 HLA_B*07:02 0|0 1|0 0|0 0|0
10 HLA_B*08:01 0|0 0|0 1|0 0|0
11 HLA_B*13:02 0|0 0|0 0|0 1|1
12 HLA_B*14:01 1|1 0|1 0|1 0|0
On each row we have a particular HLA allele. I’ve included four alleles each for genes A, B and C (in reality there are many more). On each column, we have one of four individuals and, for each, a phased allele count. Looking at cell [1,2], for example, we see that individual i1 has the HLA_A*01:01 allele on both copies of chromosome 6.
Note that, save for errors in the HLA data (which is usually imputed rather than directly genotyped), each individual should have exactly two alleles for each HLA gene. These may be identical, as in the example above, or different, as for individual i2 and gene A.
Separating the phased genotypes
Since each copy of a chromosome is inherited from a different parent and we want to define haplotypes which correspond to chunks of the genome which are jointly inherited, we will begin by decoupling the phased genotypes and create two separate datasets that we will then process separately. Let’s divide each cell by the |
separator:
## get haplotypes: first genotype
hla_i1 <- cbind(hla_i[, "ID", drop = FALSE],
as.data.frame(sapply(hla_i[, -1],
function(x)
as.numeric(sub("\\|\\d+$", "", x)))))
## get haplotypes: second genotype
hla_i2 <- cbind(hla_i[, "ID", drop = FALSE],
as.data.frame(sapply(hla_i[, -1],
function(x)
as.numeric(sub("^\\d+\\|", "", x)))))
We get two datasets that look like the following:
> hla_i1
ID i1 i2 i3 i4
1 HLA_A*01:01 1 0 0 1
2 HLA_A*02:01 0 1 0 0
3 HLA_A*03:01 0 0 1 0
4 HLA_A*11:01 0 0 0 0
5 HLA_C*01:02 0 1 0 0
6 HLA_C*02:02 1 0 1 0
7 HLA_C*03:03 0 0 0 0
8 HLA_C*07:01 0 0 0 1
9 HLA_B*07:02 0 1 0 0
10 HLA_B*08:01 0 0 1 0
11 HLA_B*13:02 0 0 0 1
12 HLA_B*14:01 1 0 0 0
> hla_i2
ID i1 i2 i3 i4
1 HLA_A*01:01 1 1 0 1
2 HLA_A*02:01 0 0 0 0
3 HLA_A*03:01 0 0 0 0
4 HLA_A*11:01 0 0 1 0
5 HLA_C*01:02 0 0 0 0
6 HLA_C*02:02 0 0 0 0
7 HLA_C*03:03 1 1 1 0
8 HLA_C*07:01 0 0 0 1
9 HLA_B*07:02 0 0 0 0
10 HLA_B*08:01 0 0 0 0
11 HLA_B*13:02 0 0 0 1
12 HLA_B*14:01 1 1 1 0
Determining haplotypes for each copy of chromosome 6
Let’s now focus on the first genotype (first data frame above) and think about how we can determine the haplotypes. We see that individual i1 has alleles HLA_A*01:01
, HLA_C*02:02
and HLA_B*14:01
. Their haplotype is therefore HLA_A*01:01-HLA_C*02:02-HLA_B*14:01
. Similarly, individual i2 has alleles HLA_A*02:01
, HLA_C*01:02
and HLA_B*07:02
and haplotype HLA_A*02:01-HLA_C*01:02-HLA_B*07:02
, and so on.
We will use the following R function (which requires the dplyr
and tidyr
packages) to transform the data frame from a count of alleles to a count of haplotypes:
library(dplyr)
library(tidyr)
## function to make data.frame with HLA haplotypes as rows
get_haps <- function(df, iids) {
## replace genotype with allele string
## the code below is much faster than dplyr::case_when()
df <- cbind(df[, "ID", drop = FALSE],
as.data.frame(sapply(df[, -1],
function(x) ifelse(x == 0, "", df[, 1]))))
## build haplotypes
haps <- lapply(df[, -1], function(x) paste(x, collapse = "-"))
## remove string of `-` at the beginning
haps <- lapply(haps, function(x) gsub("^(-)+", "", x))
## remove string of `-` at the end
haps <- lapply(haps, function(x) gsub("(-)+$", "", x))
## remove strings of `-` in the middle
haps <- lapply(haps, function(x) gsub("(-)+", "-", x))
haps <- unlist(haps)
## make 2-col df with each individuals haplotype as categorical variable
haps_df <- data.frame(IID = colnames(df)[-1],
haplotype = haps)
## transform haplotype categorical variable into series of binary variables
haps_df <- haps_df %>%
mutate(n = 1) %>%
pivot_wider(id_cols = IID,
names_from = haplotype,
values_from = n,
values_fill = 0) %>%
as.data.frame()
haps_df <- cbind(colnames(haps_df)[-1],
as.data.frame(t(unname(as.matrix(haps_df[, -1])))))
colnames(haps_df) <- c("ID", iids)
## sort haplotypes alphabetically
haps_df <- haps_df %>%
arrange(ID)
return(haps_df)
}
Calling this function, we get the following:
## make data.frame with haplotypes
hla_i1_haps <- get_haps(df = hla_i1,
iids = colnames(hla_i1)[-1])
> hla_i1_haps
ID i1 i2 i3 i4
1 HLA_A*01:01-HLA_C*02:02-HLA_B*14:01 1 0 0 0
2 HLA_A*01:01-HLA_C*07:01-HLA_B*13:02 0 0 0 1
3 HLA_A*02:01-HLA_C*01:02-HLA_B*07:02 0 1 0 0
4 HLA_A*03:01-HLA_C*02:02-HLA_B*08:01 0 0 1 0
We see that we have four different haplotypes (no two individuals have the same haplotype). Notice also that we got HLA_A*01:01-HLA_C*07:01-HLA_B*08:01
, part of the common haplotype for Europeans which we saw above.
Repeating this procedure for the second genotype, we get:
## make data.frame with haplotypes
hla_i2_haps <- get_haps(df = hla_i2,
iids = colnames(hla_i2)[-1])
> hla_i2_haps
ID i1 i2 i3 i4
1 HLA_A*01:01-HLA_C*03:03-HLA_B*14:01 1 1 0 0
2 HLA_A*01:01-HLA_C*07:01-HLA_B*13:02 0 0 0 1
3 HLA_A*11:01-HLA_C*03:03-HLA_B*14:01 0 0 1 0
Here we do get a repeated haplotype, HLA_A*01:01-HLA_C*03:03-HLA_B*14:01
, which is shared by both i1 and i2. The HLA_A*01:01-HLA_C*07:01-HLA_B*08:01
haplotype had already appeared in the data from the first genotype, while the other two are new. In total, we therefore have six unique haplotypes.
Combining the haplotypes from the two copies of chromosome 6
We must now make sure that the two data frames have the same haplotypes (the union of the two sets) in the same order and we can then simply sum them to obtain a final data frame of haplotypes. (In the data frames from one genotype only, each individual has a single haplotype; in the final data frame they will have two — these may or may not be identical in their two copies of chromosome 6.)
We start with a function which adds rows of zeros to a data frame with variant IDs that we must manually specify. We will use it to add to the first data frame rows of zeros with IDs corresponding to the haplotypes which are present only in the second data frame and vice versa.
## function to add rows of zeros to df
add_zeros <- function(df, row_names) {
## make data.frame of 0s with same colnames to rbind to df
add_df <- cbind(
row_names,
as.data.frame(
unname(
matrix(
0,
nrow = length(row_names),
ncol = ncol(df) - 1
)
)
)
)
colnames(add_df) <- colnames(df)
df <- rbind(df,
add_df)
return(df)
}
Now let’s find the union of the two sets of haplotypes:
all_haps_i <- sort(unique(c(hla_i1_haps$ID, hla_i2_haps$ID)))
> all_haps_i
[1] "HLA_A*01:01-HLA_C*02:02-HLA_B*14:01" "HLA_A*01:01-HLA_C*03:03-HLA_B*14:01"
[3] "HLA_A*01:01-HLA_C*07:01-HLA_B*13:02" "HLA_A*02:01-HLA_C*01:02-HLA_B*07:02"
[5] "HLA_A*03:01-HLA_C*02:02-HLA_B*08:01" "HLA_A*11:01-HLA_C*03:03-HLA_B*14:01"
Let’s add rows of zeros to the first data frame:
## add to i1
hla_i1_haps <- add_zeros(hla_i1_haps,
all_haps_i[!(all_haps_i %in% hla_i1_haps$ID)])
## sort haplotypes alphabetically
hla_i1_haps <- hla_i1_haps %>%
arrange(ID)
We get:
> hla_i1_haps
ID i1 i2 i3 i4
1 HLA_A*01:01-HLA_C*02:02-HLA_B*14:01 1 0 0 0
2 HLA_A*01:01-HLA_C*03:03-HLA_B*14:01 0 0 0 0
3 HLA_A*01:01-HLA_C*07:01-HLA_B*13:02 0 0 0 1
4 HLA_A*02:01-HLA_C*01:02-HLA_B*07:02 0 1 0 0
5 HLA_A*03:01-HLA_C*02:02-HLA_B*08:01 0 0 1 0
6 HLA_A*11:01-HLA_C*03:03-HLA_B*14:01 0 0 0 0
Notice that all haplotypes now appear, though the two that were only present in the second genotype have the value 0 on the full row.
We proceed similarly for the second data frame:
## add to i2
hla_i2_haps <- add_zeros(hla_i2_haps,
all_haps_i[!(all_haps_i %in% hla_i2_haps$ID)])
## sort haplotypes alphabetically
hla_i2_haps <- hla_i2_haps %>%
arrange(ID)
Finally, we confirm that both the sets and the order of the haplotypes and individuals is the same in both data frames so we can sum them:
## confirm both haplotypes and samples are in same order
if (!all.equal(hla_i1_haps$ID, hla_i2_haps$ID)) {
stop("Haplotype set or order is different.")
}
if (!all.equal(colnames(hla_i1_haps)[-1], colnames(hla_i2_haps)[-1])) {
stop("Sample IID set or order is different.")
}
## sum haplotype counts
hla_i_haps <- cbind(
hla_i1_haps[, "ID", drop = FALSE],
as.data.frame(
as.matrix(hla_i1_haps[, -1]) + as.matrix(hla_i2_haps[, -1])
)
)
And here is our final dataset! Each cell is now the total (i.e., non-phased) count of a particular haplotype for each individual in our data.
> hla_i_haps
ID i1 i2 i3 i4
1 HLA_A*01:01-HLA_C*02:02-HLA_B*14:01 1 0 0 0
2 HLA_A*01:01-HLA_C*03:03-HLA_B*14:01 1 1 0 0
3 HLA_A*01:01-HLA_C*07:01-HLA_B*13:02 0 0 0 2
4 HLA_A*02:01-HLA_C*01:02-HLA_B*07:02 0 1 0 0
5 HLA_A*03:01-HLA_C*02:02-HLA_B*08:01 0 0 1 0
6 HLA_A*11:01-HLA_C*03:03-HLA_B*14:01 0 0 1 0
Conclusion
In this post we’ve seen how to use R to transform a data frame with phased counts of HLA alleles into one with total counts of HLA haplotypes. We are now ready to take this dataset and use it directly for genetic association testing to find out if any of our haplotypes are associated with a particular phenotype.
Did I get something wrong? Do you have any other thoughts on my approach? I’d love to hear from you! Feel free to send me an email or reach out on Twitter.
-
This finding actually predates the genomic era and goes back to a landmark paper by Pericak-Vance et al. published in 1991 in the AJHG. ↩