Consideration of haplotypes of candidate genes is the basis for an effective association study of complex diseases. A problem, however, arises because composite genotypes are the observed data. Haplotypes are not directly observed; they are estimated from genotype data. Generally, haplotypes are maximum likelihood estimates (MLE) established using, for example, an expectation-maximization (EM) algorithm. We implement a haplotype estimation method that considers pedigree structure and a partition-ligation procedure. If such estimated haplotypes are considered and analyzed as if they were directly observed data, without considering the phase uncertainly, inflation of the type I errors may occur and tests become anti-conservative. One solution is to use a likelihood approach to consider all possible haplotypes with their corresponding probabilities. Here we discuss an alternative approach using Monte Carlo testing, and introduce hapMC, a Java program, that performs such analyses. The key to the Monte Carlo procedure is to appropriately match the observed statistic and the simulated null statistics that form the null, such that a valid test of the correct size is maintained. This is achieved as follows: MLE haplotypes are established for the observed data, and the statistic of interest calculated, ignoring the phase uncertainty. Haplotype frequencies are estimated from the observed data as well and these frequencies are used to assign haplotypes to individuals (independent of case/control status). This generates a null configuration of haplotype data. The known phase is then ignored, considered as genotypes and MLE haplotypes are estimated for the null configuration. Again, the MLE haplotypes are estimated using the null data and the statistic of interest is calculated on the null data. Repeating, a null distribution is produced for testing the observed statistic against. Our program, hapMC, performs the necessary Monte Carlo procedure and EM estimates of haplotypes to provide valid haplotype tests for various standard association statistics, appropriately accounting for phase uncertainty. We also implement the use of pseudocontrols for the family-based data.
|Instructions to run hapMC|
1. Java 1.6 JRE must be installed on your system (Download here)
- To check if Java is installed go to a command prompt and type java.
- To check the Java version installed go to a command prompt and type java -version.
2. Download hapmc.jar
3. Create .rgen and .dat files. Examples can be found here.
- Note that the .rgen and .dat files can be placed anywhere on your system, but precaution needs to be taken when specifying where they are located when you execute the program.
- In the simplest situation, the .rgen and .dat are in the same directory as the .jar files. In this scenario, the .rgen would have specify the .dat file as being in the same directory (i.e. genotypedata="GenotypeData.dat").
- The .dat file also should not contain any extra lines at the bottom. This will cause an error while the program is reading the data.
4. Go to the directory with Genie.jar and type java -jar hapmc.jar hapMC <.rgen file name> .
- If the .rgen file is in another directory, then it is necessary to specify that location in the command line.
1. Pedigree phase configurations output with posterior probabilities and haplotype estimates and poster.
- See examples for example output
- See rgen for explanation of parameters.
2. Mendelian inheritance error checking.
- Simple error checking for Mendelian inheritance discrepancies.
Running out of heap space?
- For larger datasets or use of large number of Monte Carlo simulations (i.e. 80,000 - 100,000) the default Java Virtual Machine (JVM) memory allocation may not be sufficient. In this case, more memory for the JVM can be allocated provided the system being used has the memory by using -Xms and -Xmx when executing the program. Example: java -Xms1024m -Xmx1536m -jar hapmc.jar hapMC <.rgen file name>. The example will allocate a maximum of 1.5 Gb and a minimum of 1 Gb of memory for the JVM to use while executing the program. The maximum amount of memory allocation for 32 bit systems is 2 Gb.