Rank Score Tests
Jump to
Nonparametric statistical methods are useful tools for data analysis when there is reason to believe that the outcome variables of interest may fail certain distributional assumptions required for parametric methods. Variables may be ordered categories in nature and thereby not suitable for analysis methods that assume normally distributed variables, such as t tests or analyses of variance and covariance. Variables may also be metric or continuous but subject to excessive variability or the presence of outliers. When the research hypothesis involves comparing a sample of subjects under 2 conditions or at 2 time points or comparing 2 samples of subjects with respect to an outcome variable of interest, then univariate nonparametric methods based on rank score tests can be invoked. A study design feature such as random assignment of conditions or treatments is typically all that is required for these methods to be valid. Furthermore, the methods can be quite powerful under a number of alternatives, particularly those involving shifts in the median.
The Wilcoxon signed rank test, the Spearman rank correlation coefficient, and the Wilcoxon rank sum test are among the most commonly used nonparametric tests and cover a variety of research questions. These tests are described here. Although the focus is on hypothesis testing, related methods for estimation of confidence intervals are also presented. Extensions of nonparametric methods to handle stratification and covariate adjustment are also described. Scenarios in which nonparametric methods may be most useful and the power they can be expected to yield are discussed. The methods are illustrated with data from a clinical trial assessing the impact of exposure to low levels of carbon monoxide on exercise capacity in patients with ischemic heart disease.
Wilcoxon Signed Rank Test
When the response variable of interest is a metric measurement that follows a symmetrical distribution with substantial variability or outliers, then the Wilcoxon signed rank test^{1,2} is a useful test for differences between paired samples of subjects or paired conditions on a sample of subjects. For a sample of size n, let i=1,…,n identify the subjects in the sample, let X_{i} and Y_{i} denote the observed values of the outcome variable under the 2 conditions or at the 2 time points of measurement for the ith subject, and let d_{i} denote the difference (X_{i}−Y_{i}). If Δ represents the median of the distribution of difference values {d_{i}}, then the null hypothesis is that Δ=0. That is, the null hypothesis is the hypothesis that the true, underlying difference between the 2 conditions is zero. The following steps describe the calculation of the test statistic to assess the null hypothesis against the alternative of a shift in location associated with one of the conditions or time points:
Differences {d_{i}} are formed for each pair of observations on each subject.
The ranks of the absolute values of the nonzero differences are computed, with ties assigned the average of the applicable ranks (called midranks); zero differences are ignored.
The signs of the differences are computed as −1 or 1.
Signed ranks are then computed by multiplying the sign of the difference by the corresponding rank.
The test statistic is calculated by dividing the sum of the signed ranks U by the square root of the sum of squares of the signed ranks S to form U/S; S is the standard deviation of U.
The null hypothesis that the median difference is zero is assessed by comparing the test statistic U/S to critical values of the standard normal distribution for large sample sizes (eg, n≥20) or by tabulating the exact critical region for small sample sizes. Computation of P values from critical values for both the normal approximation and the exact distribution is available through commercial statistical software packages (eg, SAS Proc Univariate^{3} and StatXact^{4}).
When there are no ties among the observed differences and no differences equal to zero, then the signed rank test statistic simplifies to a commonly used form. If T denotes the sum of the positive ranks, then U=2T−n(n+1)/2 and equation
The significance of U/S is determined by comparison with a standard normal distribution or by computation of the exact critical region, as above.
If it is assumed that the distribution of the {d_{i}} is continuous and symmetrical, a point estimate of the median difference Δ is given by equation
The n(n+1)/2 quantities involved here are the n differences and their n(n−1)/2 pairwise averages. Furthermore, a confidence interval can be constructed about Δ via the methods of Hodges and Lehmann on the basis of the exact distribution of T.^{1,5,6}
The asymptotic relative efficiency (ARE) is a useful way to compare a nonparametric test with its parametric counterpart. Briefly, the ARE can be defined as the ratio of sample sizes required by the 2 statistics to achieve the same power under a certain distributional assumption.^{7} For a test of paired samples or paired conditions on a sample of subjects, the paired t test would be the parametric test of choice. The ARE of the Wilcoxon signed rank test relative to the paired t test is at least 0.864 in the entire class of continuous symmetrical distributions and at least 0.955 when the differences {d_{i}} follow a normal distribution.^{5}
Example Dataset
The methods described above are illustrated with an example from cardiovascular research. Example data are from a clinical trial designed to assess the impact of low levels of exposure to carbon monoxide (CO) on exercise tolerance in patients with ischemic heart disease.^{8} A total of 42 nonsmoking patients with documented obstructive coronary artery disease and a history of exerciseinduced ischemia were enrolled in this 2period crossover study. The study period consisted of 3 days, a training day and 2 exposure days during which patients were exposed to either air or CO in an environmentally controlled chamber. On all 3 days, patients followed a bicycle exercise protocol in which exercise was conducted at increasing work loads until angina, fatigue, or hypertension occurred. On the first or training day of the study, patients became familiar with the environmentally controlled chamber and conducted a training exercise on the bicycle. Patients demonstrating exerciseinduced ischemia during training were then randomly assigned to one of 2 exposure sequences, exposure to air followed by carbon monoxide (Air:CO) and the reverse (CO:Air). Cardiac function and exercise capacity were measured on each day after exposure in the chamber.
A total of 30 patients (8 women and 22 men) successfully completed training and were randomized to exposure sequence. The outcome variable used for illustration here is duration of exercise (seconds) after the exposure condition, provided in Table 1 for each patient. This outcome variable typically has a somewhat skewed distribution and can be subject to outliers, and therefore the use of nonparametric methods for hypothesis testing is particularly appealing. The order variable groups the patients according to order of exposure (1=CO first and 2=Air first). Therefore, 16 patients were exposed to CO on the first day and exposed to Air on the second day, and 14 patients were exposed to Air on the first day, followed by CO on the second day. The baseline measure corresponds to the duration of exercise recorded on the training day before randomization.
The Wilcoxon signed rank test can be used to test for differences between duration of exercise under the 2 exposures. First, differences are formed between the 2 exercise times, and the absolute values of the nonzero differences are ranked across the 30 patients. Differences of zero are ignored, and midranks are used in the case of ties. Signs are then applied to indicate which differences are <0 or >0, corresponding to a decrease and an increase in exercise time, respectively. Table 2 provides the rank matrix for the example data. In this example, the sum of the positive ranks is T=166.5, the sum of the signed ranks is U=123, the square root of the sum of squares of the signed ranks (the standard deviation of U) is S=53.56, and the test statistic is U/S=2.296. The exact P=0.0198, and the approximate P=0.0217, both indicating that the null hypothesis of a zero median for the difference between exposure conditions in exercise times is rejected in favor of a significant difference. Subjects were able to exercise for significantly longer periods of time after exposure to Air than after exposure to CO. The HodgesLehmann point estimate for the median difference is 54.0 seconds with a 95% confidence interval of 15.5 to 110.0.
Spearman Rank Correlation Coefficient
The Spearman rank correlation coefficient is a measure of association between 2 variables that is particularly useful when 1 or both variables are either ordered categorical or are continuous but from a highly skewed distribution. The coefficient can be used to test the null hypothesis of no association between the 2 variables versus the alternative hypothesis of an association. To compute the Spearman rank correlation coefficient, the values of each of the 2 variables are first ranked, with the use of midranks in the case of ties, and the standard Pearson correlation coefficient is then computed with the use of these ranks. Let i=1,…,n identify the n subjects in the sample, and let X_{i} and Y_{i} denote the variable values for the ith subject. Let R_{i} and S_{i} denote the ranks of X_{i} and Y_{i}, respectively. Then the Spearman rank correlation coefficient is given by the following equation: equation
A test of significance for the association between the 2 variables of interest (X and Y) is given by (n−1)r_{S}^{2}, which is approximately χ^{2} distributed with 1 degree of freedom, when the 2 variables are independent and thereby have no association (ie, the null hypothesis is true).^{9}
The Spearman rank correlation coefficient is appropriate for both ordered categorical and continuous variables. The computations are valid with the use of midranks, and therefore ties with respect to either variable can be accommodated. Critical values of the test statistic can be computed with the largesample χ^{2} approximation when sample sizes are large (eg, n≥40) and through tabulation of the critical regions of the exact distribution, when sample sizes are small. The statistical procedures SAS Proc FREQ^{3} and StatXact^{4} both provide exact probability levels for the Spearman rank correlation test.
Example Dataset, Continued
In the example study, it may be of interest to determine whether the differences in exercise times under the 2 exposure conditions vary with baseline values. If baseline appears to be correlated with differences in exercise times, then an analysis that adjusts for baseline differences among subjects may be warranted. The Spearman rank correlation coefficient can be applied to assess this correlation. To compute the correlation coefficient, the differences in exercise times between the 2 conditions require ranking, irrespective of zero values. Midranks are again used in the case of ties. The ranked values are provided in Table 3.
The Spearman rank correlation coefficient for baseline by differences in exercise times is −0.1093 with P=0.5564, indicating no significant association between these 2 variables. A logical followup question is whether baseline is correlated with exercise times after either exposure condition. The Spearman rank correlation coefficient between the baseline value and exercise time after exposure to Air is 0.7843 (P<0.0001) and between baseline and exercise time after exposure to CO is 0.8234 (P<0.0001). Baseline values are therefore strongly associated with postexposure exercise times, regardless of the condition. The difference between conditions with respect to exercise duration does not, however, appear to vary with baseline.
Wilcoxon Rank Sum Test
When the research question of interest involves comparing 2 samples or groups of subjects with respect to a response variable, such as comparing disease outcomes among patients randomized to receive a test versus control treatment, the Wilcoxon rank sum test has utility. The null hypothesis in this setting is that of no association between the 2 groups of subjects and the response variable, and the alternative hypothesis is that of a location shift for the population represented by one group versus the other, eg, relatively more higher values of the response variable as a result of the test treatment. The Wilcoxon rank sum test can be applied regardless of whether the response variable is metric or ordered categorical, and, as is the case for the signed rank test, the methods of Hodges and Lehmann can be applied to compute point estimates and confidence intervals for the difference in medians between the 2 samples.
Let n_{1} denote the sample size in the first group and n_{2} denote the sample size in the second group. The total sample size is n=n_{1}+n_{2}. The Wilcoxon rank sum test is computed as follows:
Ranks are assigned to all observations of the response variable, pooling across groups of subjects and using midranks for ties.
The test statistic can most easily be expressed in the manner previously described for the Spearman rank correlation coefficient by letting S have the value 1 for the n_{1} subjects in group 1, and the value 0 for the n_{2} subjects in group 2 (where n=n_{1}+n_{2}), and letting R correspond to the ranks of the response variable.
The significance level is calculated by comparing the test statistic to the χ^{2} distribution with 1 degree of freedom, when sample sizes are large (eg, ≥20 per group). When sample sizes are small and subjects are randomly allocated to groups (either by the design of the study or as implied by the null hypothesis), the significance level is calculated by comparing the test statistic to the critical region of the exact distribution.^{10}
If there are no ties among the ranks, then the test statistic simplifies to a commonly used form. Let T be the sum of the ranks in group 1. Then the rank sum test statistic is given by equation
The statistical procedures SAS Proc NPAR1WAY^{3} and StatXact^{4} both provide exact probability levels for the Wilcoxon rank sum test.
If it is assumed that metric distributions for the 2 groups have the same shape and scale, HodgesLehmann estimates for the difference in medians between the 2 groups of patients, Δ, and confidence limits about Δ are available. The point estimate corresponds to the median of all pairwise differences between observations in one group versus those in the other group. There are n_{1}n_{2} such differences.
The ARE for the Wilcoxon rank sum test relative to the t test for comparing 2 independent samples is at least 0.864 when the alternative hypothesis is a location shift in the distributions of the 2 samples and all continuous distributions are considered. When the distributions are normal, the ARE is at least 0.955. Note that when the distributions of the response variables are highly skewed, with long tails at either end, then the ARE can exceed 1.0, indicating that the Wilcoxon rank sum test will be more powerful than a t test in this instance.^{11}
Example Dataset, Continued
The association between order of exposure and difference in exposure times can be assessed by applying the Wilcoxon rank sum test because order of exposure defines 2 groups of patients for comparison. The sum of the ranks for subjects exposed to CO first is 245.5, and the sum of the ranks for patients exposed to Air first is 219.5. Expected values of these 2 quantities under the null hypothesis of no association are 248.0 and 217.0, respectively, where expected value is defined as the average value of the sum of the ranks across all possible randomizations of n_{1} subjects to the first group and n_{2} subjects to the second group. The Wilcoxon rank sum test statistic=219.5, and the exact P=0.9250, compatible with no difference between the 2 groups with respect to differences in exercise duration. The approximate P=0.9178. The Spearman rank correlation coefficient for order of exposure by differences in exercise times is −0.0197, and P=0.9178, identical to that associated with the Wilcoxon rank sum test, which is compatible with no correlation between these 2 variables.
For this example dataset, the fact that the order of exposure conditions did not appear to be related to the response variable validates the use of a signed ranks analysis of exercise times under the 2 conditions, ignoring the order of exposure. Had order been related to response, then a proper crossover analysis of the study data would be required that accounted for the order of exposure in assessing the impact of CO versus air on exercise times.^{12}
Extensions of the Rank Sum Test
The Wilcoxon rank sum test can be extended to allow for covariate adjustment in a nonparametric analog to ANCOVA. Rank ANCOVA^{13} can be performed through the following steps:
Ranks are computed for the response variable, ignoring groups and using midranks in the case of ties.
Ranks are then computed for the covariate, ignoring groups and using midranks in the case of ties.
A linear regression model is fit, regressing the ranked response variable onto the ranked covariates (with groups ignored), and the residuals are output.
A test of the association between group and the response variable, adjusting for the covariate, is provided by applying computations like those described for the Spearman rank correlation coefficient (the formula for r_{s}), but with R equal to the residuals, S=1 for subjects in group 1, and S=0 for subjects in group 2.^{9}
The method in step 4 for comparing the residuals in the 2 groups is an extension of the Wilcoxon rank sum test to provide covariate adjustment. Rank ANCOVA can provide additional power through the variance reduction typically associated with a baseline covariate adjustment, even when the response variable does not follow a normal distribution.^{14}
Stratification may also be an important aspect of the study design resulting from patients being sorted into subsets before the conduct of the study, (eg, male and female strata or strata consisting of patients from different clinical centers in a multicenter study). Extensions for the Wilcoxon rank sum test and the test for the Spearman rank correlation coefficient that account for stratification are available.^{9,10} The stratified extension for the Wilcoxon rank sum test is referred to as the van Elteren statistic.^{15} Computations for this method are shown in the technical appendix in the onlineonly Data Supplement and are available through SAS Proc FREQ.^{3}
Example Dataset, Continued
Rank ANCOVA can be applied to the example dataset to assess the impact of order of exposure on the differences in exercise times between exposure conditions while controlling for baseline values. Following the steps outlined above, the difference scores between air and CO are ranked as for the Wilcoxon rank sum test (Table 3). Baseline values are also ranked, and a linear regression model is fit with the use of SAS Proc GLM.^{3} The residuals from the model are output, and the Pearson correlation coefficient between the residuals and the order variable is computed as −0.0372. The probability value from the exact test is 0.8437 and from the asymptotic test is 0.8412, both showing little association between order of exposure and difference scores, after adjustment for baseline values. These results are similar to the unadjusted test of association computed above. Because the Spearman rank correlation coefficient showed no evidence of association between baseline exercise times and the difference in exercise times, the fact that the covariate adjustment had little impact on the results of the unadjusted Wilcoxon rank sum test is not surprising.
To illustrate the extension of these methods for stratification, subjects were stratified according to whether their baseline value was below the median of 540 seconds versus equal to or above the median. A stratified Wilcoxon rank sum test (ie, van Elteren test) was then performed. The probability value for testing the null hypothesis of no association in all strata is 0.8810 and therefore is compatible with no association between order of exposure and differences in exercise duration after stratifying on baseline value (below versus above the median).
Discussion
Nonparametric methods such as those described here are most useful in situations in which their use is prespecified before data analysis. When response variables are ordered categories or when it is known in advance that assumptions about metric response variables will likely fail, then such prespecification is possible at the time the statistical analysis plan is prepared. In these situations, nonparametric methods will have good power properties, with power as high as 93% of that expected for standard parametric methods (eg, t tests) applied under ideal circumstances.
The methods described here all address a shift in location as the alternative hypothesis, in which location corresponds to the median of the response variable distribution. If transformed data are expected to be normally distributed (eg, the response variable follows a lognormal distribution), then nonparametric methods will be at least 95% efficient, and their use precludes identifying the most optimal transformation.
When the response variable is so highly skewed that the distribution appears to have an “L” or “J” shape, the Wilcoxon tests will not have good power, and Savage (or logrank) tests will be better.^{16} With highly skewed distributions, both groups of subjects will tend to have ranked values on one side of the median, but only one group will have ranked values on the other or tail side of the distribution. In this case, only the rank values on the tail side are informative, and against this alternative, the Wilcoxon rank sum test will not be the most appropriate test. Because the Wilcoxon tests address shifts in location only, ranked values from both groups are expected to occur to the left and to the right of the median, and both are informative. Under the alternative hypothesis of a shift in location, one group will tend to have ranks on one side of the overall median, whereas the other will tend to have ranks on the opposite side. This is precisely the setting in which the Wilcoxon tests are most useful and nearly as powerful as parametric methods applied when all assumptions hold.
Acknowledgments
Disclosures
None.
Footnotes

The onlineonly Data Supplement, consisting of a technical appendix, is available with this article at http://circ.ahajournals.org/cgi/content/full/114/23/2528/DC1.
References
 ↵
Hollander M, Wolfe DA. Nonparametric Statistical Methods. New York, NY: John Wiley & Sons; 1973.
 ↵
Conover WJ. Practical Nonparametric Statistics. New York, NY: John Wiley & Sons; 1971.
 ↵
SAS Institute Inc. SAS/STAT User’s Guide, Version 9. Cary, NC: SAS Institute Inc; 2004.
 ↵
Cytel Software Corporation. StatXact 7 Online User Manual. Cambridge, Mass: Cytel Software Corporation; 2005.
 ↵
Woolson RF. Wilcoxon signedrank test. In: Armitage P, Colton T, eds. Encyclopedia of Biostatistics. Vol 6. West Wessex, England: John Wiley & Sons Ltd; 1998: 4739–4740.
 ↵
 ↵
DasGupta A. Encyclopedia of Biostatistics. Vol 1. Armitage P, Colton T, eds. West Wessex, England: John Wiley & Sons Ltd; 1998: 210–215.
 ↵
Adams KF, Koch GG, Chaterjee B, Goldstein GM, O’Neil JJ, Bromberg PA, Sheps DS, McAllister S, Price CJ, Bissette J. Acute elevation of blood carboxyhemoglobin to 6% impairs exercise performance and aggravates symptoms in patients with ischemic heart disease. J Am Coll Cardiol. 1988; 12: 900–909.
 ↵
Stokes ME, Davis CS, Koch GG. Categorical Data Analysis Using the SAS System. 2nd ed. Cary, NC: SAS Institute Inc; 2000.
 ↵
Landis RJ, Sharp TJ, Kuritz SJ, Koch GG. MantelHaenszel methods. In: Armitage P, Colton T, eds. Encyclopedia of Biostatistics. Vol 3. West Essex, England: John Wiley & Sons Ltd; 1998: 2378–2391.
 ↵
Moses L. WilcoxonMannWhitney test. In: Armitage P, Colton T, eds. Encyclopedia of Biostatistics. Vol 6. West Wessex, England: John Wiley & Sons Ltd; 1998: 4742–4745.
 ↵
Tudor G, Koch GG. Review of nonparametric methods for the analysis of crossover studies. Stat Methods Med Res. 1994; 3: 345–381.
 ↵
Koch GG, Carr GJ, Amara IA, Stokes ME, Uryniak TJ. Categorical data analysis. In: Berry DA, ed. Statistical Methodology in the Pharmaceutical Sciences. New York, NY: Marcel Dekker; 1990: 389–473.
 ↵
LaVange LM, Durham TA, Koch GG. Randomizationbased nonparametric methods for the analysis of multicentre trials. Stat Methods Med Res. 2005; 14: 281–301.
 ↵
Lehmann EL. Nonparametrics: Statistical Methods Based on Ranks. San Francisco, Calif: HoldenDay; 1975.
 ↵
Koch GG, Sen PK, Amara I. Logrank scores, statistics, and tests. In: Kotz S, Johnson NL, eds. Encyclopedia of Statistical Sciences. Vol 5. New York, NY: John Wiley & Sons; 1985: 136–142.
This Issue
Jump to
Article Tools
 Rank Score TestsLisa M. LaVange and Gary G. KochCirculation. 2006;114:25282533, originally published December 4, 2006https://doi.org/10.1161/CIRCULATIONAHA.106.613638
Citation Manager Formats