(Circulation. 2008;118:1593-1597.)
© 2008 American Heart Association, Inc.
Statistical Primer for Cardiovascular Research |
From the Division of Biostatistics, Mayo Clinic, Rochester, Minn.
Correspondence to Karla V. Ballman, PhD, Division of Biostatistics, Mayo Clinic, Harwick 8, 200 First St SW, Rochester, MN 55905. E-mail ballman{at}mayo.edu
Key Words: gene expression microarrays genes statistics
| Introduction |
|---|
|
|
|---|
The intent of the present article is to present an overview of the statistical components related to the design and analysis of a microarray experiment. In general, the usual statistical standards of scientific research apply to microarray studies. This article reviews common study objectives and designs, describes the typical analytical techniques, and concludes with a discussion of threats to the validity of a microarray study. Because articles in this statistics series have word-count restrictions that necessarily limit their breadth and depth, readers wanting more detail are directed to more comprehensive reviews by Allison et al,1 Dobbin and Simon,2 and Dupuy and Simon.3
| Study Objectives |
|---|
|
|
|---|
Class Comparison
Class comparison is also commonly known as differential analysis or analysis of differential expression. It involves the comparison of gene expression profiles of samples from distinct, predefined groups to identify the genes that are differentially expressed among the groups. Examples include comparing treated and untreated cell lines to ascertain the effect of a new drug on gene expression levels or a comparison of healthy tissue with diseased tissue to identify genes with altered expression. Such studies yield lists of genes that are significantly altered among the groups. The purpose of the lists is to provide insight into the underlying biological mechanisms and perhaps to identify potential therapeutic targets.
Class Prediction
A class prediction study develops a classifier that uses the gene expression level of multiple genes that can be applied to the expression profile of a newly acquired biospecimen to predict its (unknown) class. As in class comparison studies, the focus is to identify genes that differ across predefined classes. However, in class prediction studies, the gene expression values are explanatory variables rather than outcome variables. Furthermore, the goal of the analysis of class prediction studies is to identify a small set of genes able to accurately discern among the distinct classes; it is not to identify all genes that differ. There are many different possibilities for classes of interest. Class prediction studies could be yield classifiers that are (1) diagnostic, eg, a classifier that distinguishes between 2 different disease states; (2) prognostic, eg, a classifier that distinguishes short-term survivors from long-term survivors; or (3) predictive, eg, a classifier that predicts whether a patient will respond to a particular drug.
Class Discovery
Class discovery differs from class comparison and class prediction studies in that the classes are not predefined. Typically, the objective of these studies is to determine whether subsets of samples with seemingly homogeneous phenotypes can be discerned on the basis of differences in their gene expression profiles. For example, there are many diseases for which individuals with seemingly similar phenotypes have a large variability in outcomes such as survival. The hypothesis is that differences at a molecular level are the cause of the observed variability in outcome. Class discovery studies also may uncover biological features of the disease that provide insight into the disease pathogenesis; in particular, identifying molecular differences that define disease subgroups may lead to improved therapeutic agents.
| Study Design |
|---|
|
|
|---|
Study Design Issues
Good study design minimizes potential bias, defined here as a systematic erroneous association of a variable with a group in a way that distorts the comparison between groups. As an example of bias, suppose, in a study designed to assess the ability of a multigene expression test to discern between groups of patients with and without left ventricular dysfunction, that the samples from individuals in the left ventricular dysfunction group have been stored an average of 7 years and those in the comparison group have an average storage time of 2 years. If the test is affected by tissue storage time, the bias could account for the ability of the test to distinguish between the groups.
To avoid bias, study groups should be created that differ only with respect to the variable of interest; they should be "equal" with respect to all other variables. An experimental method such as a randomized controlled trial provides the most effective way to create equal groups and to minimize bias. Often, the experimental method cannot be used to address research questions of interest in microarray studies such as diagnosis and prognosis. However, the principle of experimental method can be applied to these observational studies, which can minimize some common biases that arise in microarray studies.8 Note that the issue of bias is a relatively new challenge for laboratory researchers who are used to tightly controlled conditions of experimental research. Designs for microarray studies should strive to create homogeneity across groups on all variables but the one of interest. In particular, consideration should be given to the equality of characteristics of individuals, specimen collection, handling, and storage. A mechanism to facilitate obtaining subject (biospecimen) homogeneity is the use of inclusion/exclusion criteria. Likewise, there should be similar protocols used for the collection, storage, and handling of specimens for all groups. Applying similar protocols across groups poses a challenge because disease tissues often undergo different collection, handling, and storage procedures compared with normal tissues.
Sample Size
Microarray studies tend to be underpowered, meaning they have too few samples for detecting features of interest. The reason cited is often the large cost of the arrays. However, from a scientific perspective, it is essential to have adequate sample sizes. One simple, common approach in the setting of class comparison is to perform power and sample size calculations based on a 2-sample t test. Appropriately large samples are required to avoid missing important expression differentials. Other approaches to sample size estimation would be used for prediction or discovery settings.
The approach to sample size planning for a microarray class comparison study is the same as for any other medical research study and is based on 4 parameters:
, the level of significance or probability of a type I error; 1–β, the statistical power (β is the probability of a type II error);
, the minimum detectable difference in gene expression levels (usually on a log base 2 scale); and
, the SD of a gene expression level (usually on a log base 2 scale). The investigator sets the values of the first 3 parameters. Recommended values for
and β for gene expression microarray studies are
=0.001 and β=0.05. The small
value is to protect against multiple comparisons (more below); the small β value is to provide good statistical power for identifying genes that really are differentially expressed. The value of
differs for each gene, and investigators often set
=1, which corresponds to a 2-fold difference on a log base 2 scale. Finally, estimates are needed for the variation of expression values among samples. This variability depends on the heterogeneity of the biospecimens, with cell line samples being the least heterogeneous, human samples being the most heterogeneous, and animal model samples falling somewhere in between. Cell line samples are derived from multiple cultures of the same cell line, 1 per culture. Furthermore, the amount of variability depends on a particular gene.
Simon et al9 report that the average within-class expression level variability (log 2 scale) ranges from 0.10 for cell lines to 0.25 for animal model tissue to 0.5 to 0.75 for human tissues. The Table gives general guidelines for minimum sample sizes according to the numbers above and using a 2-sided t test. As microarray data become more abundant, estimates of
should be based on previous experiments using similar biospecimens from similar populations.
|
Objectives of microarray experiments also include class prediction and class discovery. At the present time, there are no generally accepted methods for determining sample sizes for class prediction models; however, methods do exist in the literature such as that of Dobbin and Simon.10 The development of sample size determination for class discovery is currently an active area of research. It is generally acknowledged that these types of studies require larger sample sizes than a class comparison or class prediction study.
| Analysis |
|---|
|
|
|---|
Low-Level Analysis: Data Preprocessing
Data preprocessing (low-level analysis) consists of background subtraction and normalization. The purpose of background subtraction is to make the foreground intensity more properly proportional to the abundance of bound material (ie, cDNA). There is a tradeoff between performing a background subtraction or not in terms of variance and bias. If no background subtraction is done, the variance in expression levels is the smallest, but the bias in terms of the measured intensity reflecting the amount of bound material is greatest. Performing background subtraction reduces the bias but increases the amount of variability in gene expression level. Some investigators choose not to perform any background correction because currently little is known of the biological effects of a given level of gene expression. Hence, having bias in the estimate in the absolute level of gene expression is of little consequence, and the gain is less variability, which translates to greater power. Global background correction techniques11,12 use the same constant to represent the background for all spots on the array. It is easy to perform but does not take into account background variation that exists across the array. Regional background correction13–15 provides greater flexibility than global correction in terms of accounting for variation in background across the array.
In addition to the biological variability among samples, other sources of variability or noise are introduced through the processes of biospecimen handling, RNA extraction, fluorescent labeling, hybridization to the array, and laser scanning to measure the intensities. Before the gene expression values between arrays are compared, the arrays typically are normalized to remove unwanted/systematic variability in expression levels for a gene across arrays induced by the process of RNA extraction, hybridization, and scanning while preserving the true biological variability. Many normalization procedures are based on the assumption that the expression levels of most genes do not change among the groups being compared. This is a reasonable assumption for most microarray experiments that use general-purpose arrays designed to interrogate all genes. There are 2 general types of normalization techniques: linear normalization16 and nonlinear normalization.17–19 Linear normalization uses the same normalization factor for all genes on the array. Nonlinear normalization determines the normalization factor for a spot on the basis of its level of expression. It is generally agreed that nonlinear normalization techniques are superior to linear normalization techniques.
Class Comparison
The goal of a class comparison study is to identify genes that are differentially expressed between prespecified classes. Tests used to identify genes as differentially expressed between groups need to account for variability. In early microarray studies, investigators selected genes with mean values that differed between groups by a predefined magnitude such as a
2-fold difference. This completely ignored the amount of variability in gene expression levels, greatly increasing the probability of declaring a gene differentially expressed when it truly is not.20 Methods that account for the magnitude of the difference in mean expression between groups and level of variability of gene expression within a group include the t test, permutation test, and Wilcoxon rank-sum test. The last 2 are used when there is concern that the gene expression levels are not normally distributed. If there are >2 groups, there are natural extensions of these tests: ANOVA, permutation F tests, and the Kruskal-Wallis test, respectively.
Class Prediction
There are numerous methods for class comparison analyses, and the number keeps increasing. Methods include logistic regression,21 linear and quadratic discriminant analysis,22 nearest neighbor classifiers,23 decision trees,24 shrunken centroids,25 neural networks,26 random forests,27 support vector machines,28 and many more. It has been shown that simpler methods tend to work better,29 likely because many of the more sophisticated methods that uncover even slight structure in the data do not work well when the number of cases is small compared with the number of variables because they then tend to model the noise in the data in addition to, or rather than, the true structure.
Class Discovery
The idea of class discovery is to discover patterns in microarray data. There are 2 types of problems in this category: identifying groups of coexpressed genes and finding patterns in expression profiles of different specimens that do not have predefined profiles from among a set of samples with similar phenotypes. The techniques used for class discovery are called unsupervised methods. A primary tool used for class discovery is clustering algorithms; their goal is to form subgroups so that the objects (specimens or genes) within a subgroup are more similar to one another than objects in different groups. There are 2 main types of clustering algorithms: hierarchical methods and partitional methods. Hierarchical algorithms derive a nested series of partitions of data points either by merging 2 clusters at each step (agglomerative hierarchical clustering) or by splitting a cluster into 2 clusters at each step (divisive hierarchical clustering). Partitional methods aim to produce a single partition of the items; there is no nesting. K-means clustering30 and self-organizing maps31 are both partitional methods.
The results of clustering algorithms are typically displayed graphically. In the case of hierarchical clustering, a tree structure or dendrogram is generated (see the Figure). Another popular display for hierarchical clustering results in a heat map (also called a color image plot). A heat map is a rectangular array of colored blocks with the block color representing the expression level of 1 gene on 1 array (specimen). Each column of boxes represents a specimen, and each row of boxes represents a gene. The columns are ordered according to how they are ordered in a hierarchical clustering dendogram. Typically, the rows also are ordered according to the dendogram that resulted from a cluster analysis of the genes (see the Figure). The results are patches of 2 different colors (typically red and green) indicating combinations of genes and specimens that exhibit high or low expression.
|
| Discussion |
|---|
|
|
|---|
A problem caused by chance that is often less familiar to investigators performing gene expression microarray studies is overfitting.34 Overfitting occurs when a multivariable model designed to generate a classifier that discriminates between groups of patients is made to fit the data perfectly. A problem occurs when a pattern is found that fits perfectly by chance. Given that thousands of genes are typically considered when developing a multigene classifier, there is a large likelihood that this will occur. Overfitting is not inherent in microarray studies; it can occur whenever a multivariable analysis is done to assess associations between a large number of explanatory variables and an outcome with a relatively small sample size (especially when the number of explanatory variables exceeds the sample size). Overfitting can easily be checked for by assessing reproducibility in a completely independent group of individuals. Overfitting remains a problem in microarray studies mainly because of failure to adequately carry out such checking.
| Conclusions |
|---|
|
|
|---|
| Acknowledgments |
|---|
None.
| References |
|---|
|
|
|---|
2. Dobbin K, Simon R. Experimental design of DNA microarray studies. In: Jorde L, Little P, Dunn M, Subramaniam S, eds. Encyclopedia of Genetics, Genomics, Proteomics, and Bioinformatics. New York, NY: Wiley; 2005.
3. Dupuy A, Simon R. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst. 2007; 99: 147–157.
4. Ransohoff DF. Gene-expression signatures in breast cancer. N Engl J Med. 2003; 348: 1715–1717.
5. Mehta T, Tanik M, Allison DB. Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nat Genet. 2004; 36: 943–947.[CrossRef][Medline] [Order article via Infotrieve]
6. Marshall E. Getting the noise out of gene arrays. Science. 2004; 306: 630–631.
7. Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A. 2002; 99: 6562–6566.
8. Ransohoff DF. Bias as a threat to the validity of cancer molecular-marker research. Nat Rev Cancer. 2005; 5: 142–149.[CrossRef][Medline] [Order article via Infotrieve]
9. Simon R, Radmacher MD, Dobbin K. Design of studies using DNA microarrays. Genet Epidemiol. 2002; 23: 21–36.[CrossRef][Medline] [Order article via Infotrieve]
10. Dobbin K, Simon R. Sample size planning for developing classifiers using high dimension DNA microarray data. Biostatistics. 2007; 8: 101–117.
11. Brown CS, Goodwine PC, Sorger PK. Image metrics in the statistical analysis of DNA microarray data. Proc Natl Acad Sci U S A. 2001; 98: 8944–8949.
12. Statistical Algorithms Reference Guide. Santa Clara, Calif: Affymetrix; 2001.
13. Yang YH, Buckley MJ, Dudoit S, Speed TP. Analysis of cDNA microarray images. Brief Bioinform. 2001; 2: 341–349.
14. Jain AK, Tokuyasu TA, Snijders AM, Segraves R, Albertson DG, Pinkel D. Fully automatic quantification of microarray image data. Genome Res. 2002; 12: 325–332.
15. GenePix Pro 4.0 Users Guide. Foster City, Calif: Axon Instruments Inc; 2001.
16. Affymetrix Microarray Suite User Giude, Version 5. Santa Clara, Calif: Affymetrix; 2001.
17. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA microarray data: a robust composite method addressing signe and multiple slide systematic variation. Nucleic Acids Res. 2002; 30: e15.
18. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods of high density array data based on variance and bias. Bioinformatics. 2003; 19: 185–193.
19. Ballman KV, Grill DE, Oberg AL, Therneau TM. Faster cyclic loess: normalizing RNA arrays via linear models. Bioinformatics. 2004; 20: 2778–2786.
20. Miller RA, Galecki A, Shmookler-Reis RJ. Interpretation, design, and analysis of gene array expression experiments. J Gerontol. 2001; 56A: B52–B57.
21. Cox DR. Analysis of Binary Data. London, UK: Methuen; 1970.
22. Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. 4th ed. Upper Saddle River, NJ: Prentice-Hall; 1999.
23. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, NY: Springer; 2001.
24. Brieman L, Friedman J, Stone C, Olshen R. Classification and Regression Trees. Belmont, Calif: Wadsworth; 1984.
25. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A. 2002; 99: 6567–6572.
26. Ripley BD. Pattern Recognition and Neural Networks. Cambridge, UK: Cambridge University Press; 1996.
27. Brieman L. Random forests. Machine Learning. 2001; 45: 5–32.[CrossRef]
28. Vapnik V. Statistical Learning Theory. New York, NY: Wiley; 1998.
29. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for classification of tumors using DNA microarrays. J Am Stat Assoc. 2002; 97: 77–87.[CrossRef]
30. MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, Calif; University of Berkeley; 1967; 281–297.
31. Kohonen T. Self-Organizing Maps. Berlin, Germany: Springer; 1997.
32. Chen MM, Ashley EA, Deng DXF, Tsalenko A, Deng A, Tabibiazar R, Ben-Dor A, Fenster B, Yang E, King JY, Fowler M, Robbins R, Johnson FL, Bruhn L, McDonagh T. Novel role for the potent endogenous inotrope apelin in human cardiac dysfunction. Circulation. 2003; 108: 1432–1439.
33. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc Ser B. 1995; 57: 289–300.
34. Ransohoff DF. Rules of evidence for cancer molecular-marker discovery and validation. Nat Rev Cancer. 2004; 4: 309–314.[CrossRef][Medline] [Order article via Infotrieve]
35. Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the analysis of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst. 2003; 95: 14–18.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Circulation Home | Subscriptions | Archives | Feedback | Authors | Help | AHA Journals Home | Search Copyright © 2008 American Heart Association, Inc. All rights reserved. Unauthorized use prohibited. |