PCAtag Software for Selecting Tagging-SNPs using Principal Component Analysis

Getting Started

Read Me

Download

Requirements

Execute PCAtag

Examples

FAQ

 

Welcome to PCAtag 2.1

To be able to comprehensively test the role of candidate genes in association studies the selection of informative SNPs is paramount.
Specifically, it is important to select tagging-SNPs (tSNPs) that represent a large portion (>90%) of the genetic variation of a gene.
Here we describe a new software tool, PCAtag , that performs tSNP selection using principal component analysis (PCA)
as described in Horne and Camp (2004). The advantage of PCA analysis for tSNP selection is that LD groups do not need to be contiguous and can be overlapping. This flexible framework does not impose over-simplified assumptions on the genetic architecture structure, and likely fits reality much better.

Algorithms used by PCAtag

  • Bayesian method for reconstructing haplotypes is used by interfacing with the software fastPHASE (Stephens et al 2006).
  • Principal Component Analysis (PCA) using a varimax rotation is performed by interfacing with the FactoMineR add-on package available in R. R is a language and enviroment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3090051-07-0
  • Procedure for determining LD groups and tSNP selection follow from the two step PCA method outlined in Horne and Camp (2004) into multi-step PCA.

Novel Features

   Genotype Data:

  • The majority of tagging methods use an input of genotype data and phase the data as part of the process of selecting tagging SNPs. For example, many methods are based on pairwise allelic r2. This r2 is a measure of allelic correlation, i.e. the co-occurrence of alleles on a haplotype. Its calculation involves what amounts to phasing the pairwise genotype data to haplotype data before calculating the allelic correlation which are then used to identify tSNPs.
  • Our genotype option completely omits the phasing stage and instead uses the correlations between the genotype calls themselves, i.e. the PCA analysis is performed directly on the genotype calls.

   Phenotype Data:

  • Allele frequencies, haplotype frequencies and LD structure may differ between cases and controls.
  • If phenotype data (or any dichotomous subset criteria) is entered, tagging will be performed in the cases and controls separately, as well as together.
  • Knowledge of such difference at tSNP stage will allow for more powerful subsequent association analyses.

Version 2.1, Last update May 25, 2010