Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction
The c statistic, or area under the receiver operating characteristic (ROC) curve, achieved popularity in diagnostic testing, in which the test characteristics of sensitivity and specificity are relevant to discriminating diseased versus nondiseased patients. The c statistic, however, may not be optimal in assessing models that predict future risk or stratify individuals into risk categories. In this setting, calibration is as important to the accurate assessment of risk. For example, a biomarker with an odds ratio of 3 may have little effect on the c statistic, yet an increased level could shift estimated 10-year cardiovascular risk for an individual patient from 8% to 24%, which would lead to different treatment recommendations under current Adult Treatment Panel III guidelines. Accepted risk factors such as lipids, hypertension, and smoking have only marginal impact on the c statistic individually yet lead to more accurate reclassification of large proportions of patients into higher-risk or lower-risk categories. Perfectly calibrated models for complex disease can, in fact, only achieve values for the cstatistic well below the theoretical maximum of 1. Use of the c statistic for model selection could thus naively eliminate established risk factors from cardiovascular risk prediction scores. As novel risk factors are discovered, sole reliance on the c statistic to evaluate their utility as risk predictors thus seems ill-advised.
Models for prognostic risk prediction have been widely used in the cardiovascular field to predict risk of future events or to stratify apparently healthy individuals into risk categories.1 Appropriate model assessment is critical to the determination of clinical impact and to guideline development. In particular, as novel risk factors, which include blood-based biomarkers and those derived from genomics or proteomics, are discovered, whether these factors can contribute to overall risk prediction becomes an important question.
The accuracy of models can be assessed in several ways. Two major components are calibration and discrimination.2 Calibration is a measure of how well predicted probabilities agree with actual observed risk. When the average predicted risk within subgroups of a prospective cohort, for example, matches the proportion that actually develops disease, we say a model is well calibrated. The Hosmer-Lemeshow statistic3 compares these proportions directly and is a popular, though imperfect,4 means to assess model calibration.
Discrimination is a measure of how well the model can separate those who do and do not have the disease of interest. If the predicted values for cases are all higher than for non-cases, we say the model can discriminate perfectly, even if the predicted risk does not match the proportion with disease. Discrimination is of most interest when classification into groups with or without prevalent disease is the goal, such as in diagnostic testing.5 Discrimination is most often measured by the area under the receiver operating characteristic (ROC) curve, or c statistic, as described below.
In the diagnostic setting the outcome is already determined but unknown to the investigator, and the estimated classification can often be compared with a more expensive or invasive gold standard. In prognostic modeling or risk stratification, however, the outcome has not yet developed at the time that predictors are assessed. Future disease status remains to be determined by stochastic process, and can only be estimated as a probability or risk.6 Measures of discrimination are nonetheless commonly emphasized in such settings, which ignores the random nature of the outcome. Calibration, as well as discrimination, is important in accurate risk prediction. More global measures of fit that combine calibration and discrimination exist, such as likelihood statistics, R2, and the Brier score.2,7 The performance of risk prediction models in the cardiovascular literature, however, is often judged solely on the basis of the c statistic,8–14 despite the existence of large prospective cohort studies from which risk can be estimated directly.
The ROC Curve and the c Statistic
The most popular measure of model fit in the cardiovascular literature has been the c statistic, a measure of discrimination also known as the area under the ROC curve,15 or the c index, its generalization for survival data.2,16 The ROC curve and its associated cstatistic are functions of the sensitivity and specificity for each value of the measure or model. The sensitivity of a test is the probability of a positive test result, or of a value above a threshold, among those with disease (cases). The specificity is the probability of a negative test result, or a value below a threshold, among those without disease (noncases). It is commonly believed that sensitivity and specificity are properties of a test and are not subject to alteration by prevalence of disease, as are the positive and negative predicted values. This has been shown to be false, however, both theoretically and clinically.17,18 Both sensitivity and specificity can be influenced by case mix, disease severity, or risk factors for disease. For example, a test is likely to be more sensitive among more severe than among milder cases of disease. Similarly, specificity can depend on characteristics of noncases, such as gender, age, or prevalence of concomitant risk factors.18–20
The ROC curve is a plot of sensitivity versus 1−specificity (often called the false-positive rate) that offers a summary of sensitivity and specificity across a range of cut points for a continuous predictor. The area under the curve, or cstatistic, ranges from 0.5 (no discrimination) to a theoretical maximum of 1. Perfect discrimination corresponds to a cstatistic of 1 and is achieved if the scores for all the cases are higher than those for all the non-cases, with no overlap. The cstatistic is equivalent to the probability that the measure or predicted risk is higher for a case than for a noncase.15 Note that the cstatistic is not the probability that individuals are classified correctly or that a person with a high test score will eventually become a case. The latter is closer in meaning to the predictive value, or the probability of disease given the test result.
The cstatistic also describes how well models can rank order cases and noncases, but is not a function of the actual predicted probabilities. For example, a model that assigns all cases a value of 0.52 and all noncases a value of 0.51 would have perfect discrimination, although the probabilities it assigns may not be helpful. The actual predicted probabilities do matter, however, in clinical risk prediction models such as those commonly used for the assessment of global cardiovascular risk.
In a prospective cohort that is considered generally low-risk, such as many population-based cohorts, there may be a small proportion of individuals who are at high risk, with a preponderance of those at low or very low risk. Rank-based measures such as the cstatistic do not take this distribution into account. Differences between 2 individuals who are at very low risk (eg, 1.0% versus 1.1%) have the same impact on the cstatistic as 2 individuals who are at moderate versus high risk (eg, 5% versus 20%) if their differences in rank are the same. Clinically, however, it may be more important to separate the latter 2 individuals, particularly if treatment decisions are based on predicted probabilities, such as those used by the Adult Treatment Panel III.1
cStatistics and Model Selection
Because the cstatistic is based solely on ranks, it is less sensitive than measures based on the likelihood or other global measures of model fit.2 This characteristic may make it a poor choice for the selection of variables to be used in a predictive model. As an example, consider data from the Women’s Health Study,20a a prospective cohort including 26 901 initially healthy nondiabetic women followed for the development of future vascular events over an average period of 10 years. Table 1 (top) shows the results from traditional Framingham risk factors for future cardiovascular disease (CVD) that were modeled with a Cox proportional hazards model. The likelihood ratio statistic tests the significance of the addition of each variable separately to a predictive model that included age only (Table 1, top). Each variable is highly significant statistically, and the χ2 statistics indicate that, after age, systolic blood pressure (SBP) is the strongest predictor of risk, followed by smoking and lipids. To directly compare the magnitude of effects in this population, the rate (hazard) ratios per 2 SD units are shown, roughly comparable to a comparison of risks for extreme tertiles. The rate ratios lie in the same order as the likelihood ratio statistics for the continuous variables.
The cstatistic in the model that included only age is 0.70 (Table 1), based on the generalized c index for censored data. This means that the probability is 70% that a case is older than a noncase. When SBP is added to the models, the cstatistic improves to 0.74, which means that the probability that the predicted risk is higher in cases than noncases is 74%. Although the likelihood ratio statistics and the rate ratios suggest that SBP is the strongest predictor after age, the cstatistic is 73% to 74% for models with SBP, smoking, and high-density lipoprotein cholesterol (HDL-C), and is unable to distinguish between these 3 factors. The cstatistic is only 0.71 for the model that includes age and low-density lipoprotein cholesterol (LDL-C). This failure to improve the cstatistic would suggest that LDL-C is not predictive of CVD, even though it is highly statistically significant in these data, the effect size is moderate, and we know from many studies and trials that LDL-C is an important modifiable risk factor for CVD. An example of improved predictive accuracy despite little change in the cstatistic is given below.
In a similar manner, with age, SBP, and smoking together in the model for future CVD risk, the cstatistic was 0.76, and improved only slightly to 0.77 when any of the lipids were added individually (Table 1, middle). Despite this, the likelihood statistics and the rate ratios indicate that HDL-C is a strong cardiovascular risk predictor, followed by total and LDL cholesterol. Finally, when each variable was in turn removed from the full model (Table 1, bottom), the cstatistic dropped from 0.78 to only 0.77 or 0.76 for all variables except age. Thus, in this example, the likelihood-based measures of model fit were able to distinguish the importance of several established risk factors, whereas the cstatistic could not. Indeed, if improvement in the cstatistic was used as the criterion for model inclusion, then neither LDL-C, HDL-C, nor total cholesterol would have been included in risk models after accounting for age, blood pressure, and smoking. In an example from the Framingham Heart Study, family history of premature atherosclerosis was found to be an independent predictor of cardiovascular events, with a relative risk of 2.0 for men and 1.7 for women.21 However, the cstatistic increased only from 0.80 to 0.81 in men and 0.81 to 0.82 in women, which may inappropriately limit enthusiasm for this variable.
In these examples, sole reliance on the cstatistic would seem ill-advised because discrimination is only one aspect of model performance. Likelihood-based measures, such as the likelihood ratio statistic or the Bayes information criterion, which adjusts for the number of variables in the model, are alternatives that are more sensitive and more global measures of model fit.2 Use of these criteria would have selected age, SBP, smoking, total cholesterol, and HDL-C from the variables considered in Table 1, and reached a final model similar to that developed from the Framingham data.1,22
Odds Ratios and Predictive Values
In epidemiological studies, the most common choices of effect measures are ratios, expressed either as a relative risk (risk ratio), an odds ratio (OR), a rate ratio, or a hazards ratio. Pepe et al23 describe the relation between the OR and the cstatistic, and show that an OR as large as 3.0 may have little impact on the ROC curve or the cstatistic. This may be pertinent when classification into 2 groups is the objective, such as in diagnostic testing. A relative risk of this size, however, could lead to a clinically important improvement in risk prediction for future disease.
Consider, for example, a novel biomarker that has a relative risk of 3.0, but leads to little or no improvement in the cstatistic given traditional risk factors. For some patients, a high level of the biomarker would shift estimated 10-year risk from 1% to only 3%, a clinically unimportant difference. For others, the same high biomarker level could alter the estimated 10-year risk of a cardiovascular event from 8% to 24%, and lead to different treatment recommendations under current Adult Treatment Panel III guidelines. Thus, for risk prediction, the actual or absolute predicted risk, which is not captured by the cstatistic, is of primary clinical interest.
Figure 1 shows the distribution of a hypothetical normally distributed risk factor X among cases and controls. Among controls the mean is 0, with a SD of 0.5. Suppose that the OR per 2 SD units equals 3.0, which corresponds to a cstatistic of 0.65. Despite this moderately large OR, there is a great deal of overlap between the distributions for cases and noncases. This extent of overlap often occurs in practice, as evidenced by the distributions of total cholesterol among cases and noncases of coronary heart disease in the Framingham study.24 Thus, total cholesterol by itself would be considered a poor classifier for cardiovascular risk, even though it is known to be a pathophysiological determinant of disease.
The OR does, however, relate to the predictive value, or the probability of disease given a positive test, or a value above a threshold. This is a direct function of the OR. Figure 1 also plots the probability of disease given the value of risk factor X in a population with an overall disease probability of 10%. Despite the overlap in the distributions, the predicted probabilities range from <5% to >25%, a difference that may be clinically important. Although the actual numbers may depend on overall disease incidence, the revised risk could cross risk strata in treatment guidelines and lead to different treatment decisions.
SBP provides a more concrete example. Among men in National Health and Nutrition Evaluation Study II data, the estimated mean SBP is 129 mm Hg (SD 17.7).25 An OR of 3.0 would correspond to a mean of 139 mm Hg among cases, or a difference of 10 mm Hg between cases and noncases (Table 2). Corresponding differences in other measures would be 6 mm Hg for diastolic blood pressure, 24 mg/dL for total cholesterol, 21 mg/dL for LDL-C, and 8 mg/dL for HDL-C,25,26 all of which would appear to be clinically important differences. None of these by itself, however, would lead to substantial improvement in the area under the ROC curve.
Pepe23 suggests that an OR of about 16, which corresponds to a cstatistic of about 0.84, may be needed to achieve reasonable discrimination, or classification into cases and noncases. As shown in Table 2, this would require much larger differences in means between cases and noncases, as large as 24 mm Hg in SBP and 62 mg/dL for total cholesterol. It is unlikely that a single marker could achieve such levels of separation and discrimination. It is, however, possible for a score composed of several of these predictors together to achieve this target. Thus, the Framingham score, which includes several traditional risk factors, has been shown to discriminate reasonably well.22
Thus, if a relative risk >3.0 were required as a strict criterion for inclusion of each additional biomarker in risk prediction, then, except for age, few of the components of the Framingham risk score would be eligible for inclusion. In the Framingham model,22 none of the risk factors besides age, which include blood pressure, smoking, or lipids, individually achieves a rate ratio higher than 1.9 for men or 2.2 for women. Although the Framingham score as a whole, including age, increased the cstatistic from 0.5 (ie, from chance alone with no predictors used) to 0.74 in men and 0.77 in women, the individual clinical risk factors could not do so based on their conditional relative risks. Use of an improvement in the ROC curve for each individual biomarker as a criterion, then, would eliminate most risk factors currently in use for cardiovascular risk prediction, which would include lipids, blood pressure, and smoking.
Predictive Values and Calibration
Calibration has largely been overlooked in discussions of model fit in the cardiovascular field.8,9,11,13,14 Although a model can be recalibrated to the overall risk in a new population,2 we prefer that predicted risk match observed risk within all subgroups. There is, in fact, a trade-off between discrimination and calibration, and a model typically cannot be perfect in both. Diamond27 showed that a perfectly calibrated model, in which the predicted risk equals the observed risk for all subgroups, cannot achieve a cstatistic equal to 1 in usual settings. With the assumption of a uniform distribution of risk in the population, the maximum cstatistic is 0.83. Gail and Pfeiffer demonstrated that this upper limit varies with the distribution of risk in the population.28 Figure 2 shows the maximum cstatistic that can be achieved with perfectly calibrated models with various distributions of population risk. The 3 distributions in Figure 2 (left) have an average 10-year risk of 10%. If there is relatively little spread, then risk is centered around 10%, and the maximum cstatistic with a perfectly calibrated model is only 0.63. If the average risk is the same, but there is more spread, the limit for the cstatistic increases, and could reach 0.76 or even closer to 1.0. The same is true for the distributions in Figure 2 (right), which all have an average risk of 50%.
In a population similar to the Women’s Health Study cohort, with a low average 10-year risk of 2.5% and a 99th percentile of 23%, the maximum cstatistic is 0.89. In a population with a higher overall risk of 10% and a 99th percentile of 40%, closer to that for coronary heart disease (inclusive of angina) in the Framingham cohort,22 the maximum cstatistic is 0.76. Both perfect calibration and perfect discrimination could be achieved only if the true risk as well as the estimated risk were either 0 or 1, similar to the U-shaped distribution in Figure 2 (right). This may be true in the diagnostic setting where the outcome is determined, and the individual truly does or does not have disease. In the prediction of 10-year CVD risk in population-based cohorts, however, the maximum cstatistic for perfectly calibrated models appears to be ≈0.75 to 0.90.
A related way to compare models is to examine curves of predicted values or estimated risk (as opposed to true risk, which is unknown).29 In theory, a stronger model should lead to a wider spread of predicted values and, consequently, stronger discrimination. A plot of the predicted risk versus the risk percentile has been used to compare models.29,30 Such distributions may, however, also be insensitive in distinguishing between models. An example is shown in Figure 3, which plots the predicted risk from models in the Women’s Health Study that include age, smoking, total and HDL cholesterol, but with and without SBP in the model. Transposition of the x and y axes yields a plot of the cumulative probability distribution functions. As shown, there is little difference in these curves even though SBP is the strongest risk predictor after age. The populations in the plot also do not tell us whether 1 model estimates risk more accurately or how the predicted risks differ for individuals with the 2 models. A look at conditional or joint distributions of predicted risk, as described below, may give more insight.
Clinical Risk Reclassification
Most important for clinical risk prediction is whether a new model can more accurately stratify individuals into higher or lower risk categories of clinical importance. The Adult Treatment Panel III, for example, uses estimated 10-year risk categories in its treatment guidelines.1 If such risk stratification can be made more accurate, the model will be improved.
Table 3 presents the results of risk stratification with models that include age, SBP, smoking, and total cholesterol, but with and without HDL-C. The cstatistic only changed from 0.77 to 0.78 when HDL-C was included with the other variables (Table 1, bottom), despite a relative risk of 0.54. However, of women classified at 5% to 20% 10-year risk in the model without HDL-C, >34% changed risk category when HDL-C was included in the model. More important, the new risk estimate that used HDL-C was a more accurate representation of actual risk for all but 6 of 1920 women reclassified. A similar result has been shown for the addition of C-reactive protein to models that include traditional risk factors.31 If HDL-C was not useful in risk prediction, this reclassification would occur randomly. When a completely random variable was added to the full model above, <1% were reclassified in each risk category, and roughly half of these were more accurate. When all the traditional risk factors were added to a model with age only, 7% of those at <5% risk and >60% of those at 5% risk to 20% risk were reclassified more accurately.
To assess the potential for reclassification, risk could be estimated over a range of values of a new biomarker to determine whether it may be important to measure in an individual. Suppose that a woman’s age, SBP, smoking status, and total cholesterol are known, but that her HDL-C is not. Suppose that, with the assumption of a reference value for HDL-C of 50 mg/dL, her estimated 10-year risk is 8%. This could vary from about 5% for an HDL-C of 80 mg/dL to about 14% for an HDL-C of 30 mg/dL. Figure 4 shows how a woman’s absolute risk estimate may vary with changes in HDL-C compared with risk at the reference HDL-C of 50 mg/dL given her other risk factors. For those at low risk, the additional information is minimal, whereas for those at higher risk the impact on risk of disease is more substantial. If the difference in risk over the range of HDL-C is clinically important, then a test could be ordered to obtain the woman’s actual posttest probability. Although other factors such as age32 must be considered, such focus on intermediate categories of risk is an option for novel predictors or biomarkers, which are difficult or expensive to obtain.
The estimated risk or predicted values, and how well these predict actual risk, may be a more important aspect of a prognostic model than sensitivity and specificity, on which the ROC curve is based. Even in diagnostic testing, patients (and examining physicians) are interested in whether they have the disease given a test result rather than their probability of having a positive test given the presence or absence of disease, as expressed by the sensitivity and specificity.33,34 If a patient has hypertension, he or she may not be interested in whether everyone with a myocardial infarction has hypertension, but rather his/her chances of having a myocardial infarction. The predictive value, or posttest probability, can thus be more relevant for patient care. It may be especially important for prognostic models in which the clinical question is the chance of disease development in the future given current risk factors.
When the goal of a predictive model is to categorize individuals into risk strata, the assessment of such models should be based on how well they achieve this aim. Inclusion of novel risk factors in risk prediction equations could lead to more accurate risk classification, despite little change in the cstatistic. The use of a single, somewhat insensitive, measure of model fit such as the cstatistic can erroneously eliminate important clinical risk predictors for consideration in scoring algorithms.
A full discussion of model fitting and validation is beyond the scope of this paper (see Harrell2), but some simple suggestions for comparison of predictive models are shown in Table 4. First, a sensitive measure, such as the likelihood ratio test, or the Bayes information criterion, should be used to determine global model fit. The Bayes information criterion applies a penalty for the number of variables and can compare models that are not nested. It is related to the posterior probability that the model is correct, and is a conservative criterion for model selection. Second, measures of calibration and discrimination, such as the Hosmer-Lemeshow statistic and the cstatistic, can be informative and should also be assessed. When these statistics give different answers, it may be that fit is better for a subset of individuals, such as those at higher risk, and predicted risks for individuals should be compared. One can determine the extent of reclassification in clinically important risk categories, and which model classifies more accurately. Finally, an important criterion for a new marker’s usefulness in practice is whether its measurement could lead to different treatment decisions.
Ultimately the decision whether to include a new risk factor in prediction models depends on relative costs, both in terms of dollars and in the potential for illness prevented and lives saved.28 In the setting of prospective risk prediction, the proportion of patients reclassified correctly, rather than the cstatistic, would seem to have more relevance for such calculations. Currently, several potential biomarkers for cardiovascular risk have been proposed by various investigators. Although individual predictors may add incremental value to risk prediction, the possibilities for model improvement are greater for combinations of markers. The most promising of these novel risk factors should thus be examined rigorously and simultaneously to evaluate their potential role in improved models for clinical risk prediction.
The author wishes to thank Fran Cook for helpful comments and suggestions.
Sources of Funding
This work was supported by a grant from the Donald W. Reynolds Foundation (Las Vegas, Nevada). The Women’s Health Study cohort is supported by grants (HL-43851 and CA-47988) from the National Heart Lung and Blood Institute and the National Cancer Institute, both in Bethesda, Md.
Harrell FE Jr. Regression Modeling Strategies. New York: Springer; 2001.
Hosmer DW, Lemeshow S. A goodness-of-fit test for the multiple logistic regression model. Comm Stat. 1980; A10: 1043–1069.
D’Agostino RB, Griffith JL, Schmidt CH, Terrin N. Measures for evaluating model performance. In: Proceedings of the Biometrics Section. Alexandria, VA: American Statistical Association, Biometrics Section; 1997: 253–258.
Blankenberg S, McQueen MJ, Smieja M, Pogue J, Balion C, Lonn E, Rupprecht HJ, Bickel C, Tiret L, Cambien F, Gerstein H, Münzel T, Yusef S, for the HOPE Study Investigators. Comparative impact of multiple biomarkers and N-terminal pro-brain natriuretic peptide in the context of conventional risk factors for the prediction of recurrent cardiovascular events in the Heart Outcomes Prevention Evaluation (HOPE) study. Circ. 2006; 114: 201–208.
Levy D, Labib SB, Anderson KM, Christiansen JC, Kannel WB, Castelli WP. Determinants of sensitivity and specificity of electrocardiographic criteria for left ventricular hypertrophy. Circulation. 1990; 81: 1144–1146.
Wilson PW, D’Agostino RB, Levy D, Belanger AM, Silbershatz H, Kannel WB. Prediction of coronary heart disease using risk factor categories. Circulation. 1998; 97: 1837–1847.
Pepe MS, Janes H, Longton G, Leisenring W, Newcomb P. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol. 2004; 159: 882–890.
Drizd T, Dannenberg AL, Engel A. Blood pressure levels in persons 18–74 years of age in 1976–80, and trends in blood pressure from 1960 to 1980 in the United States. Vital Health Stat 11. 1986; 234: 1–68.
Carroll M, Sempos C, Briefel R, Gray S, Johnson C. Serum lipids of adults 20–74 years: United States, 1976–80. Vital Health Stat 11. 1993; 242: 1–107.
Gail MH, Pfeiffer RM. On criteria for evaluating models of absolute risk. Biostat. 2005; 6: 227–239.
Pepe MS, Feng Z, Huang Y, Longton GM, Prentice R, Thompson IM, Zheng Y. Integrating the predictiveness of a marker with its performance as a classifier. UW Biostatistics Working Paper Series, #289. 2006. Available at: http://www.bepress.com/uwbiostat/paper289. Accessed October 20, 2006.
Folsom AR, Chambliss LE, Ballantyne CM, Coresh J, Heiss G, Wu KK, Boerwinkle E, Mosley THJ, Sorlie P, Diao G, Sharrett R. An assessment of incremental coronary risk prediction using C-reactive protein and other novel risk markers: the Atherosclerosis Risk in Communities study. Arch Intern Med. 2006; 166: 1368–1373.
Ridker PM, Cook NR. Should age and time be eliminated from cardiovascular risk prediction models? rationale for the creation of a new national risk detection program. Circ. 2005; 111: 657–658.