HAPMIXMAP

a program to model HapMap haplotypes using tag SNP genotype data

Introduction

HAPMIXMAP is a program for modelling extended haplotypes in genetic association studies, similar to the FASTPHASE program developed by Scheet and Stephens (2006). The program models unphased genotype data on unrelated individuals, and fits a model in which linkage disequilibrium is generated by K independent Poisson arrival processes corresponding to K modal haplotype states. This corresponds to the observation that typically 2-4 common haplotypes account for most of the allelic diversity in any haplotype block, and that rarer haplotypes are typically slight variants of these modal haplotypes. The block-like structure of haplotypes in the genome, corresponding to ancestral recombination hotspots, is modelled by allowing the arrival rate to vary across the genome. This model is similar to that used in ADMIXMAP to model admixture between populations, and most of the code in HAPMIXMAP is derived from ADMIXMAP.

The program generates the posterior distribution of haplotypes across each chromosome, given the observed unphased genotype data. Score tests for association with an outcome variable are constructed by averaging over this posterior distribution. For a binary outcome variable (as in a case-control study) the program fits a logistic regression model. For a quantitative trait, the program fits a linear regression model, and for survival-time data the program fits a Cox regression model.

The program is intended to be used with HapMap genotype data: a dataset is available for a panel of 60 unrelated individuals in each of three continental groups. These genotype data can be combined with genotype data on the individuals under study (typically a case-control collection) at a subset of the HapMap loci. Usually this subset of HapMap loci will be tag SNPs such as those on the Affymetrix or Illumina arrays). The program then models the haplotype structure of the population, using data from both the HapMap panel and the individuals under study, and generates the posterior distribution of genotypes at all untyped HapMap loci in the individuals under study.

Using HAPMIXMAP, any genetic association study using a panel of tag SNPs can be analysed as if all loci in the HapMap had been typed. The score test for association allows correctly for uncertainty in inference of the genotypes at untyped loci. It also yields, as a by-product, a measure of the efficiency of the study design in testing each locus, in comparison with a study in which the locus is typed directly). This is the ratio of observed to complete information in the score test. This can be used to evaluate the adequacy of a tag SNP panel, and to decide whether additional genotyping of untyped loci is likely to yield any extra information.

The main differences between HAPMIXMAP and FASTPHASE are

1. A score test for association, and evaluation of the proportion of information extracted at each locus, is built into HAPMIXMAP. HAPMIXMAP includes a diagnostic test for residual linkage disequilibrium not accounted for by the model: this can be used to evaluate whether the number K of modal haplotype states is adequate.

2. HAPMIXMAP uses a more fully Bayesian approach, sampling the posterior distribution of model parameters, where FASTPHASE sets these parameters to their maximum likelihood values. HAPMIXMAP specifies a hierarchical model for the arrival rates (which determine the decay of LD over distance). The variance of the prior on the state-specific allele frequencies is allowed to vary across loci: this allows for some loci to be less informative for haplotype structure (possibly because of a higher "mutation rate") than others.

3. HAPMIXMAP specifies equal frequencies of the modal haplotype states and relies on specifying a value of K large enough for the observed haplotypes to be distributed uniformly across K modal states, where FASTPHASE allows the state frequencies to vary over the genome.

Computational requirements

The computation time scales arithmetically with the number of loci and the number of individuals. The memory requirement scales arithmetically with the number of loci. We have found that the best performance is obtained on AMD x86_64 processors, using a commercial compiler (Pathscale) that exploits this architecture.

We estimate that to model all 6 million loci in the HapMap, HAPMIXMAP requires at least 20 min of CPU time per individual genotyped. Thus to analyse a case-control collection of a few thousand individuals is feasible only if you have access to a computing cluster with hundreds of CPUs, and at least 4 Gb memory per CPU. Using the serial version of the program, the workload can be spread across processors by splitting the genome into a few hundred chunks. A parallel version of the program, which will allow all loci to be modelled simultaneously, is under development.

Downloading, installation and running the program

This section is in preparation. For enquiries contact Paul McKeigue or David O'Donnell