HAPMIXMAP

a program to model HapMap haplotypes using tag SNP genotype data

Using HAPMIXMAP to test all loci in a candidate gene in which tag SNPs have been typed in cases and controls

back to HAPMIXMAP main page

Data files

The dataset consists of a case-control study in which 1000 individuals of European ancestry were typed at 9 SNPs in a candidate gene on chromosome 7. These 9 SNPs were chosen as tag SNPs, using the program TAGGER, based on an early version of the HapMap. There are four files in the 'data' directory:

  1. dn_loci.txt - a locusfile containing loci around the region typed.
  2. dn_genotypes.txt - a genotypesfile containing CEU (European) HapMap genotypes for the loci in the locusfile.
  3. CaseControlGenotypes - a ccgenotypesfile with genotypes for the case-control individuals.
  4. outcome.txt - a binary outcomevarfile containing case/control indicators for the typed individuals.

These files are already prepared for analysis by HAPMIXMAP. To produce them, the following steps are required. See here for details on file formats. 

All these steps, except the last, can be automated using the conversion tools supplied as follows:

Data Preparation

This is only a brief guide; more detailed instructions are given in the hapmixmapFormatter/README.txt file. Also supplied in the data directory is rawgenotypes.txt, which is the same as CaseControlGenotypes.txt but with genotypes coded as bases (e.g. A:G instead of "1,2"). It is necessary to supply them like this to ensure consistency of encoding with the HapMap genotypes. You will need an internet connection for the download step to work and you will need gzip available to unzip the downloaded files. The typed loci are on chromosome 7 so we need '-c=7' and the individuals are Europeans so we need the population code -p=CEU. We are downloading, unxipping and formatting so wee need the '-d', '-u' and '-f' flags. We will gve the output files different names so they don't overwrite the originals. You should find they are identical.

perl hapmixmapFormatter/getdata.pl -c=7 -p=CEU -d -f -u -genotypesfile=data/copyGenotypes.txt -locusfile=data/copyLoci.txt -gdata/rawgenotypes.txt -ccgfile=data/copyCCGenotypes.txt

 A full list of options for the script can be produced by typing:

perl hapmixmapFormatter/getdata.pl -h

You can also invoke the conversion program FPHD directly once you have downloaded and unpacked the HapMap data. A full list of options for this program can be produced by typing:

hapmixmapFormatter/FPHD -h

Running the analysis

To reduce the time taken for the program to converge, we undertake an initial run using only the phased HAPMAP haplotypes and save a single draw from the posterior distribution of model parameters. The saved values are used as initial values for the subsequent run in which the posterior distribution is generated given both HAPMAP haplotypes and diploid case-control genotypes at tag SNPs.  The instructions below assume that the hapmixmap executable and the R script, AdmixmapOutput.R, are in the same directory you are working in (as they will be if you have just unpacked the binary package or built from the source package).

For the first run, type

./hapmixmap training-initial.conf

This short run starts the sampler and stores a final draw from the sampled distribution of model parameters for use as initial values in the next run

When this short initial run has completed, type 

./hapmixmap training-resume.conf 

then type

R CMD BATCH --vanilla AdmixmapOutput.R results/Rlog.txt

to run the R script that processes the output files.  

Note the value of the posterior mean of the  "energy" (minus the log-likelihood of the model parameters given the data), which is written to the log and to the console.  

Repeat the training run by typing hapmixmap training-resume.conf
analysis until the energy is no longer falling with each new run.

then type

R CMD BATCH --vanilla AdmixmapOutput.R results/Rlog.txt

to run the R script that processes the output files.  

The results of each run are written to the folder results. Results of previous runs will be overwritten unless you rename the results directory or use the option 'deleteolddata=0'. 

To determine whether the program has been run with a long enough burn-in, examine the plot of the realized values of "energy" that has been written to the file EnergyTracePlot.ps.  If there is a downward trend, the burn-in was too short. Another check is to look at the Geweke diagnostics for the global model parameters in the file ConvergenceDiagnostics.txt:  For each parameter, the test compares the mean over the first 10% of iterations (after the burn-in) with the mean over the last 50% of iterations.

For the case-control analysis, type

./hapmixmap testing.conf

When this has completed, run the R script again

Interpretation of results of score tests

To view the output you will need a postscript viewer such as Ghostview.  

To determine whether the program has been run long enough for the score tests to be computed accurately, open the file TestAllelicAssociations.ps with a postscript viewer and examine the plot of log p-values based on all iterations so far.  If the program has been run long enough, the p-values will have stabilized: that is, the plot of cumulative results should be horizontal lines by the end of the run.

The file PosteriorQuantiles.txt contains the posterior means, medians and 95% credible intervals for the global model parameters. The number of arrivals per megabase (Mb) is inferred as about 60.  This is equivalent to an average haplotype block length of about 19 Kb (1000 / (60 x 7/8) ).  

To evaluate the extent to which our tag SNP panel has extracted information about associations with HapMap loci, we look at the file InfoExtractedMap.ps

At each locus, the proportion of information extracted is the ratio of observed information to complete information in the score test. This can be interpreted as a measure of the efficiency of the study design, compared with a design in which the locus has been typed directly.  The "information" here is Fisher information, defined as the expected value of minus the second derivative of the log-likelihood with respect to the parameter under test, which is the regression parameter b for the effect of allele 2 (coded as 0, 1, 2 copies).  The log-likelihood function is evaluated at the null value, b = 0, of the parameter under test, which is the effect of the locus in an additive regression model).  The peaks are the 9 typed loci, at which nearly 100% of information has been extraced, as one would expect.  For some of the untyped loci, some useful information has been extracted, but there is a large gap in coverage of the candidate gene region between about 0.11 and 0.13 Mb.   The information extracted falls off to zero beyond about 200 kb from the first and last typed loci. 

Next we can examine the haplotype structure of the region, shown in the file ArrivalRateMap.ps

The spikes of high arrival rates represent boundaries between haplotype blocks (sometimes interpreted as recombination hotspots).  There is a block boundary at about 0.11 Mb, and the tag SNPs do not adequately cover the region to the right of this boundary.  

The results of the tests for associations are contained in the file AllelicAssociationTestsFinal.txt.  This is most easily viewed using a spreadsheet program.  The table has one row per locus. The column labels are explained below

This table shows 180 lines of this file: the flanking regions with no information extracted have been omitted.
  

The test statistic and p-value are not evaluated where the proportion of information extracted is less than 10% - below this information level the asymptotic properties of the score test may not hold, and computational difficulties arise in calculating the test statistic accurately.  

There's some weak evidence of association in the region containing the first three typed loci, but it's clear that we need to type more SNPs to achieve adequate coverage of this candidate gene.  In the next section, we look at the results with an augmented dataset that has more typed SNPs.  

back to top