GWAS: STATISTICS AND BIOINFORMATICS TO ANALYSE GENOTYPE TO PHENOTYPE RELATIONSHIPS FROM MOLECULAR TRAITS TO DISEASE PHENOTYPES.
CHRISTOPH LIPPERT & OLIVER STEGLE
Large-scale genotyping and phenotyping initiatives provide for exciting avenues to unravel the genotype to phenotype map across a wide range of biological systems and species. In this tutorial, we provide a hands-on guide on how to carry out GWAS and association mapping in practice, but also cover the necessary background in statistics and genetics.
First, we give a brief overview across related application domains, such as the genetic mapping of molecular traits, of disease susceptibility and of global-level traits in model systems. The main focus of the tutorial will be on methods that tackle pressing statistical and computational challenges posed, such as accounting for known and unknown confounding by population structure or environmental influences and the ever-increasing scale of genomic datasets that need to be processed. Finally, practical components of the tutorial will cover software and workflows for major analysis tasks, motivated and demonstrated in the context of real-world examples.
Motivation
With the advent of cheap high-throughput profiling of large numbers of genomes, the statistical mapping of genetically regulated phenotypes has gained considerable importance and relevance. Genome-wide association studies (GWAS) can now be carried out even by smaller labs, which has lead to an explosion in the number of studies conducted. Examples of GWAS successes include studies of rare and common human diseases (WTCCC 2007) , the genetic mapping of gene expression profiles in human (Stranger et al. 2007) and the analysis of fitness traits in model systems such as yeast and Arabidopsis (Fournier-Level et al. 2011).
While data is now abundant, accurate and scalable statistical analysis has turned into a major bottleneck. For example, GWAS generate millions of hypotheses; it requires special consideration to reduce the burden of multiple testing (Storey et al. 2003) so that the rate of false discoveries can be controlled while retaining sufficient statistical power to detect true genetic associations with single nucleotide polymorphisms (SNPs). One can begin to tackle the issue of power by incorporating prior information (e.g. Lee et al. 2009, Parts et al. 2011) or using multivariate modelling (Yang et al. 2010). Power to detect SNPs with true regulatory effect is also impacted by confounding factors other than genotype. Both, population structure between samples as well as technical and environmental influences can severely distort analysis results, if not appropriately taken into consideration (Stegle et al. 2010, Listgarten et al. 2010). At the same time as we are increasing the complexity of models to understand genotype to phenotype relationships the size of available data is steadily increasing, demanding for fast and scalable algorithms (Kang et al. 2010, Zang et al. 2010, Lippert et al. 2011).
This tutorial will focus on practical strategies to tackle these challenges, allowing for accurate and scalable GWAS analyses. We will introduce key applications of genotype to phenotype mapping, including analysis of global level phenotypes, molecular traits and disease phenotypes. Based on real world examples, we then give a practical guide on how to use state-of-the-art methods and software to carry out meaningful analyses. At the same time, we aim to convey the theoretical background behind these methods.
Overall goals
This tutorial will provide hands on instructions and examples how to carry out genotype to phenotype analyses in the context of real world examples. The specific outcomes and areas of instruction will be:
- Introduction to GWAS and analyses techniques for non-experts. The breadth of applications and the number of labs that are conducting GWAS motivates a general-audience tutorial that covers the key issues and hurdles in genome-wide analyses of genotype-phenotype relationships. The tutorial will cover the basic background of statistical testing using linear models, multiple testing challenges and the estimation of genome-wide significance levels.
- Accounting for major confounding effects. Confounding variation is a major challenge in almost any analysis of genomic data and thus will be given particular importance. Participants will learn pitfalls and typical sources that lead to false positive results or loss of power and ways to circumvent these. A focus of methods will be on linear mixed models which provide robust and flexible ways to deal with these matters and are a research focus of both presenters.
- Software protocols and practical hands-on guides on example studies. The theoretical discussion and tutorials will be interleaved with hands-on examples and use cases. We will use publicly available software and data to go through the major steps of GWAS analyses, including data preprocessing, imputation of genome data, quality filtering and demonstration of pitfalls. All materials and protocols will be made available for the participants to review them later and use them as guideline for their own work.
An overall goal of the tutorial will be to foster knowledge of how to carry out meaningful genotype to phenotype analyses. At the same time, we aim to bridge links between ongoing methodological research and applied domains.
Prerequisites and intended audience
The tutorial is aimed at bioinformaticians and computational biologists who need to carry out any genotype to phenotype analysis or would like to develop methods in this area in the future.
Participants will be required to have broad bioinformatics background, however no prior knowledge in statistical genetics or genotype to phenotype mapping is required. Basic knowledge in maths is advantageous and users with R or python knowledge will be able to directly use supplied material during the tutorial to carry out their own analyses.
Tutorial Outline
09:00-10:30 Welcome, Introduction to GWAS
The first session will cover the basic background of GWAS for non-experts.
- Study designs: GWAS versus linkage mapping.
- Preprocessing: imputation, quality control
- Introduction to basic statistics, p-values, multiple-testing correction
Practical component:
- Data loading in R / python / plink
- Exploration of genotype data, biases, quality filtering
- First example GWAS results: human & Arabidopsis
10:30-11:00 coffee
11:00-12:30 Introduction to linear models I
- Parameter inference in linear models
- Statistical testing with linear models
- Linear mixed models to correct for confounding variation
- Population structure
Practical component:
- Example GWAS using linear models. Comparison of F-test and alternatives.
- Evaluation of linear fits, pitfalls due to overfitting.
- Empirical evaluation of the importance of population structure correction.
12:30-13:30 lunch
13:30-15:00 Introduction to linear models II
- Multi-locus models
- LASSO penalized models
Practical component:
- Illustration of complex genetic models, involving multiple SNPs.
- LASSO versus single-locus linear models
15:00-15:30 coffee
15:30-17:00 GWAS for high-dimensional phenotypes
- Accounting for hidden confounding in high-dimensional studies.
- Joint testing of multiple related traits at once.
Practical component:
- Example GWAS on gene expression, stress/non-stress condition.
Christoph Lippert is a Researcher in the eScience Group at Microsoft Research, Los Angeles. His research interests lie at the interface between computational biology, statistics and machine learning, where he worked on models for complex non-iid data with a focus towards computational genomics and genetics. He developed highly efficient algorithms that enabled accurate correction for population structure and other types of confounding in GWAS on large cohorts. He has contributed open-source software for GWAS that is being applied in a considerable number of ongoing studies.
Oliver Stegle is currently a postdoctoral fellow at the Max Planck Institutes in Tuebingen, Germany; in autumn 2012 he will start his own lab at the EMBL-European Bioinformatics InstituteHinxton, near Cambridge, UK. He previously received his PhD from the University of Cambridge for work on probabilistic machine learning and computational biology with David MacKay. He has broad interests in computational biology and genomics, including RNA-Seq processing, structure prediction from sequence and two-sample tests for high-dimensional phenotypes. Currently, his main research focus are statistical methods to analyse the genetics of molecular traits. Important contributions in this domain include models to account for hidden confounding factors in GWAS and approaches to link inference of regulatory networks with systems genetics, both of which have found applications in several high-profile studies including the 1000genomes project.
Web page: www.stegle.info
References
- The Wellcome Trust Case Control Consortium, Nature 447, 661-678 (2007)
- BEE Stranger, ACC Nica et al,, Nat Genet 42, 1217 (2007).
- A. Fournier-Level, A. Korte et al., Science (2011).
- JD. Storey, R. Tibshirani, Proc Nat Acad Sci 16, 9440 (2003).
- S.-I. Lee, A.M. Dudley et al., PLoS Genet 5, e1000358 (2009).
- L. Parts, O. Stegle et al., PLoS Genet 7, e1001276 (2011).
- J. Yang et al., Nat Genet 42, 565–569 (2010)
- O. Stegle, L. Parts et al., PLoS Comput Biol. 5, e1000770 (2010).
- J. Listgarten et al., Proc Nat Acad Sci 107, 16465 (2010).
- HM. Kang, J.H. Sul et al., Nat Genet 42, 348–354 (2010).
- Z. Zhang, Z. Ersoz, CQ. Lai, et al., Nat Genet 42, 355–360 (2010).
- C. Lippert, J. Listgarten et al., Nat Meth 8, 833-835 (2011).