A repeated-measures design is one in which multiple, or repeated, measurements are made on each experimental unit. The experimental unit could be a person or an animal, and repeated measurements might be taken serially in time, such as in weekly systolic blood pressures or monthly weights. The repeated assessments might be measured under different experimental conditions. Repeated measurements on the same experimental unit can also be taken at a point in time. For example, it might be of interest to measure the diameter of each of several lesions within each person or animal in a study. The dependency, or correlation, among responses measured in the same individual is the defining feature of a repeated-measures design. This correlation necessitates a statistical analysis that appropriately accounts for the dependency among measurements within the same experimental unit, which results in a more precise and powerful statistical analysis.
Repeated-measures analysis encompasses a spectrum of applications, which in the simplest case is a generalization of the paired t test.1 A repeated-measures within-subjects design can be thought of as an extension of the paired t test that involves ≥3 assessments in the same experimental unit. Repeated-measures analysis can also handle more complex, higher-order designs with within-subject components and multifactor between-subjects components. The focus here is on within-subjects designs.
A completely randomized design is one in which each experimental unit (eg, person or animal) is assigned randomly to 1 of several competing treatments. For example, a study is proposed to compare 4 treatments (eg, a control and 3 distinct active treatments or a control and 3 different doses of the same treatment), and a sample of 20 animals are randomized to the 4 treatments. The randomization could be implemented by 1 of a number of possible techniques ranging from a simple randomization (in which a single sequence of the numbers 1 through 4 is produced, and animals are assigned according to the sequence) to a more involved randomization that uses stratification or permuted blocks.2 The permuted blocks strategy is used to ensure balance in the randomization process such that equal numbers of animals are assigned to each treatment. This strategy can be designed to ensure balance at specified enrollment points, for example, balance among the 4 treatments after randomization of 8 (2 per treatment) or 12 (3 per treatment) experimental units. This strategy is generally used when enrollment into a study occurs over time. With the permuted blocks strategy, 5 animals would be randomly assigned to each treatment in the present example.
The goal of the analysis is to compare responses among the 4 treatments. If the dependent or outcome variable is continuous, this test is performed with ANOVA.3 If the outcome variable is categorical, this test is performed with a χ2 test.4 These tests are based on the assumption that the measurements within and across treatments are independent or unrelated. If the experimental units are unrelated (ie, not family members or littermates), and 1 measurement has been made per unit, then this assumption is reasonable.
In contrast, in a repeated-measures design, multiple measurements are taken on each experimental unit. Consider again the application described above in which the goal of the analysis is to compare the 4 competing treatments. A repeated-measures design could involve 5 animals, each measured 4 times, once under each experimental condition. The repeated-measures design involves a smaller number of animals, which is both efficient and ethically appealing.
If 5 animals are measured under each of 4 different experimental conditions, a total of 20 measurements will again be available for analysis. The 20 measurements, however, are not independent but are related within the subjects. Because the measurements might be affected by within-subject characteristics (eg, age or genetic factors), statistical tests that properly account for within-subject correlation are needed. If we assume that measurements taken in the same individual are correlated, the test for a difference in treatments will involve a smaller residual or error variance than that based on a completely randomized design, thereby increasing precision in the analysis.
A randomized block design is one in which a set of experimental units are organized into homogeneous groups or blocks on the basis of a characteristic assumed to affect the outcome. The goal is to have r replicates of each of k treatments in each of b blocks, with the total sample size n=kbr. Consider again the study comparing 4 competing treatments (k=4). Suppose the outcome of interest is known to be affected by age. With n=20 independent experimental units, these might be organized into 5 age groups (eg, quintiles of age) with 1 replication per group (k=4, b=5, r=1). In a randomized block design, experimental units within each block are randomly assigned to treatments, and this technique reduces variation due to differences in age. The design can be thought of as replications of a completely randomized experiment in which there are as many replications as there are blocks.
Some repeated-measures designs can be viewed as a special case of the randomized block design in which the block is the individual experimental unit (eg, person or animal). The randomized block design is often used with siblings or littermates. The family unit is the block, and assessments are repeated on each member of the family. The assessments within a family or litter are related. Accounting for the dependencies within the block results in a more precise test of treatment differences.
Repeated-measures analysis can be used to assess changes over time in an outcome measured serially or to test for differences in 1 or more treatments based on repeated assessments in the same subjects. The simplest application has 1 within-subjects factor (eg, each of n subjects are measured under k distinct experimental treatments), and the goal of the analysis is to test for a difference in experimental treatments. This is achieved by constructing a test statistic as the ratio of the variance due to the treatments to the residual or error variance. In repeated-measures analysis, the total variance can be partitioned into variance between subjects and variance within subjects. Variance between subjects reflects individual subject differences. Variance within subjects consists of 2 components, differences between treatments and error or residual variation. The test statistic for testing the null hypothesis of equality of means is the ratio of the variation due to treatments to the residual variation, after between-subject variation has been removed. The components of variance and the test statistic are illustrated in Figure 1. The details of the computations are illustrated in example 1.
An animal study is performed to assess transgene activation in the heart. The primary outcome is percent fractional shortening, which is measured at baseline and again after 2, 4, and 6 weeks of treatment. A final assessment is made after 6 weeks of treatment and 2 weeks off treatment. At baseline, the mice were 12 weeks of age; thus, at subsequent assessments, they were 14, 16, 18, and 20 weeks of age. Three mice completed the protocol, and the data on percent fractional shortening measured at each time point are shown in Table 1. The research hypothesis is that the mean percent fractional shortening scores are different over time. The means at each time point are shown in the bottom row of Table 1 and decrease over time. To test for a significant difference in means over time, a repeated-measures ANOVA is used. The results of the repeated-measures ANOVA are contained in Table 2. The test statistic for equality of means over time is F=95.4 (df=4,8), which is highly statistically significant at P<0.0001. Thus, a highly statistically significant difference exists in the mean percent fractional shortening over time.
Suppose that these same data were incorrectly analyzed as if they were derived from a completely randomized design. The results of the ANOVA, testing for a difference in means over time, are contained in Table 3. If the data are treated incorrectly as 15 independent observations and analyzed with ANOVA, the F statistic is F=45.1 (df=4,10), which is still highly statistically significant. Notice the difference in the error or residual variation between methods. The denominator of the F statistic for testing differences in means over time is the mean square error. In the repeated-measures ANOVA, the mean square error is 6.1 compared with 12.9 in the ANOVA that assumed independence. In this particular example, the incorrect analysis still produced a significant result. In other applications, failure to account for the dependencies among observations could result in a nonsignificant finding. The repeated-measures ANOVA that appropriately accounts for dependencies in the data produces a more precise test.
If a significant difference is found, it may be of interest to test for differences between pairs of treatments or, in example 1, between specific time points. These tests should be handled with a multiple comparison procedure that again appropriately handles the correlation in the data and also controls the type I error rate (see Larson3 and Cabral5 for details).
Repeated-Measures Analysis With Repeated Measures on 1 Factor
A popular extension of the one-way repeated-measures ANOVA is the two-factor ANOVA with repeated measures on 1 factor. In this application, a treatment group (eg, medical versus surgical treatment, treatment versus placebo, or challenged versus unchallenged) is often used, and different subjects are assigned to each treatment group, but the outcome is again measured repeatedly over time. The goal is to compare the treatments with respect to differences in the outcome. The treatment factor is a between-subjects factor and has no repeated measures. However, repeated assessments are taken on each subject within each treatment over time, and thus, the time factor must be handled appropriately in the analysis. The procedure again partitions the variation to produce F statistics to test the hypotheses of equality of outcomes between treatments and equality of outcomes over time. The variance is partitioned as shown in Figure 2, and the following tests of hypothesis are performed. The first test is a test for treatment effect. This is done by constructing an F statistic as the ratio of the treatment variation to the error variation due to subjects within treatments (Figure 2). The second test is for differences in outcomes over time, the repeated factor. This is again performed by constructing an F statistic. The F statistic for differences over time is based on the ratio of time variation to error or residual variation. Because this is a two-factor design, a possibility of an interaction between the treatment and time factors (ie, a different effect of treatment over time) may also exist, and this is tested by constructing an F statistic as the ratio of the treatment-by-time variation to the error or residual variation (Figure 2). Some investigators first test the treatment and time effects and then perform a test for interaction, whereas others first test for an interaction and then test for treatment and time effects. If a statistically significant interaction exists, the treatment effect is different over time, and therefore, the tests for an overall treatment effect and an overall time effect do not completely explain differences in outcome (see Kleinbaum et al6 for more details).
These data can be analyzed in several different ways. An important issue is the appropriate specification of the nature of the correlations between measurements in the same person, called the covariance structure. Most statistical computing packages offer a variety of covariance structures for these types of analysis, and the covariances must be modeled correctly. Three structures are very popular and fit many applications. The first is called “compound symmetry” and assumes that the correlations between all pairs of measures are the same. This may be reasonable for a repeated-measures study in which each subject is measured under k different experimental conditions. The second is called “autoregressive of order 1,” or AR(1), and assumes that the correlations between adjacent pairs are greater than the correlations between more distant pairs. This may be reasonable for data measured serially in time, whereby more proximal measures are more highly correlated than measures taken more distantly in time. For this structure, the time points should be approximately equally spaced in time. A third popular structure is called “unstructured,” and as the name implies, it assumes that each pair of measurements has its own correlation. Although the latter might seem appealing, it actually produces a less powerful analysis because the data first must be used to assess the correlation structure and then to perform the primary analyses. Some statistical computing packages (eg, SAS, SAS Institute, Cary, NC) offer metrics to determine which structure best fits the data. One such measure is the Akaike information criterion, with which smaller values indicate a better fit. As in all statistical analyses, it is important to plan and implement parsimonious models that are biologically sensible. An example of a two-factor ANOVA with repeated measures on 1 factor is contained in example 2.
A randomized, placebo-controlled study is performed to estimate the short-term effects of an antihypertensive medication on systolic blood pressure. Subjects are randomly assigned to receive either the treatment or a placebo. Systolic blood pressures are measured before the first dose of treatment is administered (baseline) and again at 2, 4, and 6 weeks after the initiation of treatment (or placebo). The study involves 6 participants, 3 of whom are randomly assigned to each treatment arm; the data on systolic blood pressure measured at each time point are shown in Table 4. The research hypothesis is that the mean systolic blood pressures are different between treatments.
Figure 3 displays the mean systolic blood pressures over time for participants undergoing treatment and given a placebo. The mean systolic blood pressures decreased over time in both groups, with a sharper decrease in the treatment group. The results of the two-factor ANOVA with repeated measures are contained in Table 5. The test statistic for equality of treatment means over time is F=36.1 (df=1,4), which is highly statistically significant at P=0.0039. Thus, a highly statistically significant difference is present in mean systolic blood pressures between patients given the antihypertensive medication and those given placebo. The test for a difference in mean systolic blood pressures over time is also highly statistically significant [F=27.1 (df=3,12), P=0.0001]. The test for the interaction between treatment and time is marginally significant [F=3.2 (df=3,12), P=0.0626]. This test assesses the homogeneity of the difference in mean blood pressures between the treatment and placebo groups over time. Figure 3 shows that the difference in means is widening over time, which is driving the test for interaction to approach statistical significance.
The analyses reported in Table 5 assume equal correlations between measurements (ie, compound symmetry). An alternative analysis for these data would be an autoregressive correlation structure in which correlations between measures taken closer together in time are higher than those measured more distantly. If we assume an AR(1) covariance structure, the test statistic for equality of treatment means over time is F=12.7, which is significant at P=0.0235. The test for a difference in mean systolic blood pressures over time is highly statistically significant (F=30.4, P=0.0001), and the test for the interaction between treatment and time is significant (F=3.7, P=0.0423). The Akaike information criterion is 102.7 for the compound symmetry model and 98.0 for the AR(1) model. Because smaller values indicate better fit, the AR(1) model is a better choice for these data.
Estimates of treatment effect are provided in Table 6 for both the model that assumes compound symmetry and the model that assumes an AR(1) covariance structure. Notice that the estimates of the treatment effect are the same; however, the standard errors are different, which affects the significance of the difference.
Alternative Approaches to Analysis of Repeated-Measures Data
When repeated measures have been taken on each experimental unit, several approaches to the statistical analysis are possible. Thinking again of the two-factor ANOVA with repeated measures on 1 factor, a simple approach to handling the correlation among repeated measures in the same person involves computing mean scores for each person over time. In example 2, this would reduce the sample sizes to n1=3 and n2=3, and the test for treatment differences could be performed with the unpaired t test. Using the data in example 2, this would produce t=6.0, P=0.0039, which indicates that a significant difference is present in mean systolic blood pressures between groups. This t test is based on only 3 observations per group and 1 observation per participant (the mean systolic blood pressure over time). This approach is analytically correct but does not take full advantage of the data. This approach is much less powerful than the repeated-measures approach. A second alternative is to assess treatment differences at each time point. In example 2, this translates to conducting 4 unpaired t tests, 1 at each observation point. This approach is again inefficient, because it does not allow for any assessment of trend over time. In addition, this approach increases the likelihood of a false-positive result due to multiple statistical testing.5,7 The most efficient approach is to explicitly account for the dependency in the data by use of repeated-measures techniques, and this can be done in many different ways.
Assumptions and Analytic Details
An important assumption in repeated-measures analysis is sphericity, or homogeneity of variances over time. Most statistical computing packages offer tests for sphericity. If the assumption is violated, then mixed models can be used to explicitly address differences.8
A number of statistical computing packages are available that offer procedures for repeated-measures ANOVA. Within these packages, several options are available for conducting the tests. SAS, for example, offers several procedures that can handle repeated-measures data. Careful attention must be paid to the data layout, the specification of factors (eg, as fixed or repeated), the appropriate error terms for test statistics, and the nature of the correlations between observations measured in the same individual (ie, the covariance structure). Littell et al7,8 provide a detailed approach to using SAS for repeated-measures analysis.
Davis RB, Mukamal KJ. Hypothesis testing: means: statistical primer for cardiovascular research. Circulation. 2006; 114: 1078–1082.
Larson MG. Analysis of variance. Circulation. 2008; 117: 115–121.
D’Agostino RB, Sullivan LM, Beiser AS. Introductory Applied Biostatistics. Belmont, Calif: Brooks/Cole; 2004.
Cabral HJ. Multiple comparisons procedures. Circulation. 2008; 117: 698–705.
Kleinbaum DG, Kupper LL, Muller KE. Applied Regression Analysis and Other Multivariable Methods. 2nd ed. Boston, Mass: PWS-Kent; 1988.
Littell RC, Milliken GA, Stroup WW, Wolfinger RD. SAS System for Mixed Models. Cary, NC: SAS Institute Inc; 1996.