## HAPMIXMAPa program to model HapMap haplotypes using tag SNP genotype data |

HAPMIXMAP is a program for
modelling extended haplotypes in genetic association studies, similar to the
FASTPHASE program developed by Scheet
and Stephens (2006). The program models unphased genotype data on
unrelated individuals, and fits a model in which linkage disequilibrium is
generated by *K* independent Poisson arrival processes corresponding to *K*
modal haplotype states. This corresponds to the observation that typically
2-4 common haplotypes account for most of the allelic diversity in any haplotype
block, and that rarer haplotypes are typically slight variants of these modal
haplotypes. The block-like structure of haplotypes in the genome,
corresponding to ancestral recombination hotspots, is modelled by allowing the
arrival rate to vary across the genome. This model is similar to that used
in ADMIXMAP to model admixture between
populations, and most of the code in HAPMIXMAP is derived from ADMIXMAP.

The program generates the posterior distribution of haplotypes across each chromosome, given the observed unphased genotype data. Score tests for association with an outcome variable are constructed by averaging over this posterior distribution. For a binary outcome variable (as in a case-control study) the program fits a logistic regression model. For a quantitative trait, the program fits a linear regression model, and for survival-time data the program fits a Cox regression model.

The program is intended to be used with HapMap genotype data: a dataset is available for a panel of 60 unrelated individuals in each of three continental groups. These genotype data can be combined with genotype data on the individuals under study (typically a case-control collection) at a subset of the HapMap loci. Usually this subset of HapMap loci will be tag SNPs such as those on the Affymetrix or Illumina arrays). The program then models the haplotype structure of the population, using data from both the HapMap panel and the individuals under study, and generates the posterior distribution of genotypes at all untyped HapMap loci in the individuals under study.

Using HAPMIXMAP, any genetic association study using a panel of tag SNPs can be analysed as if all loci in the HapMap had been typed. The score test for association allows correctly for uncertainty in inference of the genotypes at untyped loci. It also yields, as a by-product, a measure of the efficiency of the study design in testing each locus, in comparison with a study in which the locus is typed directly). This is the ratio of observed to complete information in the score test. This can be used to evaluate the adequacy of a tag SNP panel, and to decide whether additional genotyping of untyped loci is likely to yield any extra information.

The main differences between HAPMIXMAP and FASTPHASE are

1.
A score test for association, and evaluation of the proportion of information
extracted at each locus, is built into HAPMIXMAP. HAPMIXMAP includes a
diagnostic test for residual linkage disequilibrium not accounted for by the
model: this can be used to evaluate whether the number *K* of modal
haplotype states is adequate.

2. HAPMIXMAP uses a more fully Bayesian approach, sampling the posterior distribution of model parameters, where FASTPHASE sets these parameters to their maximum likelihood values. HAPMIXMAP specifies a hierarchical model for the arrival rates (which determine the decay of LD over distance). The variance of the prior on the state-specific allele frequencies is allowed to vary across loci: this allows for some loci to be less informative for haplotype structure (possibly because of a higher "mutation rate") than others.

3. HAPMIXMAP
specifies equal frequencies of the modal haplotype states and relies on
specifying a value of K large enough for the observed haplotypes to be
distributed uniformly across *K *modal states, where FASTPHASE allows the
state frequencies to vary over the genome.

The computation time scales arithmetically with the number of loci and the number of individuals. The memory requirement scales arithmetically with the number of loci. We have found that the best performance is obtained on AMD x86_64 processors, using a commercial compiler (Pathscale) that exploits this architecture.

We estimate that to model all 6 million loci in the HapMap, HAPMIXMAP requires at least 20 min of CPU time per individual genotyped. Thus to analyse a case-control collection of a few thousand individuals is feasible only if you have access to a computing cluster with hundreds of CPUs, and at least 4 Gb memory per CPU. Using the serial version of the program, the workload can be spread across processors by splitting the genome into a few hundred chunks. A parallel version of the program, which will allow all loci to be modelled simultaneously, is under development.

This section is in preparation. For enquiries contact Paul McKeigue or David O'Donnell