Ready-Made, Recalibrated, or Remodeled?
Issues in the Use of Risk Indexes for Assessing Mortality After Coronary Artery Bypass Graft Surgery
Background—Risk indexes for operative mortality after cardiac surgery are used for comparative profiling of surgeons or centers. We examined whether clinicians and managers should use an existing index without modification, recalibrate it for their populations, or derive a new model altogether.
Methods and Results—Drawing on 7491 consecutive patients who underwent isolated CABG at 2 Toronto teaching hospitals between 1993 and 1996, we compared 3 strategies: (1) using a ready-made model originally derived and validated in our jurisdiction; (2) recalibrating the ready-made model to better fit the population; and (3) deriving a new model with additional risk factors. We assessed statistical accuracy, ie, area under a receiver-operator characteristic curve (ROC); precision, ie, statistical goodness-of-fit; and actual impact on both risk-adjusted operative mortalities (RAOM) and performance rankings for 14 surgeons. The new model was slightly more accurate than the ready-made model (ROC, 0.78 versus 0.76; P<0.05), albeit not different from the recalibrated model (ROC, 0.77). The ready-made model showed poor fit between the predicted and observed results (P<0.001), leading to significant underestimation of RAOM (1.6±0.2%) compared with the other strategies (2.5±0.2%; P=0.048). Remodeling also changed the performance rankings among half the surgeons with higher RAOM.
Conclusions—Poorly calibrated risk algorithms can bias the calculation of RAOM and alter the results of surgeon-specific profiles. Any existing index used for risk assessment in cardiac surgery should be episodically recalibrated or compared with new models derived from local subjects to ensure that its performance remains optimal.
Along with the proliferation of public “report cards” on cardiac surgery in the United States, researchers have published many predictive rules or risk-adjustment algorithms for mortality, morbidity, and length of hospital stay after surgery.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 These risk indexes aim at identifying and weighting the patient characteristics that affect the probability of specific adverse outcomes. Indexes are then used retrospectively to adjust for case-mix differences among surgeons and centers when performance profiles are compiled. They are also used prospectively for patient counseling and for the identification of high-risk patient subgroups for special care or research.
Patient populations and the risk factors associated with adverse outcomes do change over time and also differ between centers. Thus, risk models derived and validated in 1 locale usually perform less well when applied in other settings or even to more contemporary patients in the same setting.6 11 12 15 16 Clinicians and managers considering the use of a risk index accordingly have 3 basic options:
They can use an existing, external index, knowing that the identified risk factors or at least the weights assigned to them may not be ideal for their patient populations (“ready-made”).
They can accept the risk factors in a published index but adjust the precision of the index to their own patient population by recalibrating the weights assigned to a published model.
They can derive an internal index from their own data (remodel).
In this article, we explore the implications of these options for assessment of operative mortality using a detailed data set with information on consecutive patients undergoing isolated CABG at 2 Toronto teaching hospitals. Specifically, we have compared the analytical strategies of recalibrating and remodeling against a ready-made rule developed by the authors for the Ontario Cardiac Care Network (CCN).1
We examined clinical risk factors and operative mortality (OM) for all 7491 patients undergoing isolated CABG under the care of 14 surgeons at The Toronto Hospital (n=5343) and Sunnybrook Health Science Center (n=2148) between April 1, 1993, and December 31, 1996. Details of this database have been published previously.17 Data from earlier years were deliberately excluded because they were included in the multicenter data set used to derive and validate the CCN index. Twenty patients were missing ≥1 data elements used in the analyses.
Data were collected and managed in dBASEIV data sets. The SAS for PC18 and BMDP/DYN LR19 programs were used for statistical analysis. We were not seeking to recalibrate or rederive an index for external and general application. Thus, we forwent split-sample methods (ie, separate derivation and validation steps) with both the recalibrating and remodeling strategies described below.
Generating the Models
Ready-made: We used a 6-variable risk index developed for the Ontario CCN, drawing on all patients undergoing cardiac surgery in the province between April 1991 and March 1993. For isolated CABG outcomes, 1 variable (type of surgery) was set to its null value, leaving only 5 variables and their associated risk scores. The original regression coefficients for each variable were used to calculate patient-specific predicted probability (P) of OM from the formula: where β0 is the constant and β is the regression coefficient for each level of a risk factor in the model that characterizes the patient.
Recalibrated: The 5 explanatory variables from the CCN index were included in logistic regression analyses of the 1993 to 1996 data set to reestimate mortality-specific regression coefficients and related risk scores. The predicted probability of OM as well as the total risk score for each patient was calculated as for the CCN model.1 20
Remodeled: The University of Toronto cardiac surgery registry covers a wide variety of potential risk factors.17 All explanatory variables with a univariate P value <0.25, as well as those found commonly in other major risk indexes but failing to meet the critical α-level, were submitted to logistic regression analyses that used forward selection combined with backward elimination.21 22 The best logistic regression model was determined by 2 diagnostic criteria: the Hosmer-Lemeshow goodness-of-fit statistic21 and the area under the receiver-operator characteristic (ROC) curve.15 23 24 As in the CCN index, odds ratios were rounded to the nearest integer, and an additive risk index was created. Because of small numbers, risk scores >18 were collapsed (ie, score ≥18 for OM).
Statistical Comparisons of Index Performance
Statistical precision, or model calibration, was evaluated by the Hosmer-Lemeshow goodness-of-fit statistic.21 We also plotted the mean predicted probability for OM against observed OM for each total risk score15 22 and performed a weighted linear regression to evaluate whether the relationship was overestimated or underestimated. A slope of 1 and intercept of 0 would indicate a perfect fit of predicted to observed outcomes.22 Differences in slopes and intercepts between the 3 regressions were evaluated by ANCOVA with pairwise comparisons as appropriate.
Clinically Salient Comparisons of Index Performance
Expected mortality for each surgeon for each model was calculated as the mean predicted probability of OM based on the prevalence of risk factors in his or her caseload. We calculated risk-adjusted OM (RAOM) by dividing the observed mortality by the expected mortality and then multiplying that ratio by the overall mortality rate in the study population (0.0226). This result can be interpreted as the mortality rate a surgeon would have if his or her case mix were similar to the average case mix in the study.25
Within each model, the difference between the mean observed mortality minus the mean expected mortality was evaluated by paired t test for the null hypothesis (H0) that the difference equaled zero.15 The differences in RAOM across 14 surgeons and 3 models were evaluated by ANOVA.
For each model, the surgeons were ranked from 1 (lowest RAOM) to 14 (highest RAOM) on the basis of how they ranked in the original CCN model. We examined qualitatively whether the ranking of surgeons changed and also calculated Spearman rank correlation coefficients (rs) across models.
Risk Factors and Risk Groups for the 3 Models
There were 169 deaths (2.26%). The prevalence of risk factors and their univariate association with OM are shown in Table 1⇓.
The independent predictors of operative mortality for the new internal model were left ventricular grade, age group, previous bypass surgery, the timing of surgery, sex, triple-vessel disease, left main coronary artery disease, peripheral vascular disease, recent myocardial infarction, acute coronary insufficiency, and a history of hypertension. Given its recurrence as a risk factor in other published indexes, we forced preoperative renal insufficiency into the model, but its inclusion unfavorably affected both model discrimination and precision, possibly because of its low prevalence in our database or collinearity with other important predictors.
Table 2⇓ contains the original ORs for OM and risk weights for the ready-made CCN model as well as the ORs, their 95% CIs, and risk weights from the recalibrated and remodeled indexes. When the CCN and remodeled indexes are compared, all 5 original risk factors do recur, but there are differences in weights, most notably for repeat operation and grade IV ventricular function.
Table 3⇓ shows the number of patients defined by each risk score for each model and the observed OM for that model’s score. The addition of 4 explanatory variables to the remodeled index redefined “risk” in 87% of patients who bore at least 1 of the 4 conditions and resulted in a lower prevalence of both patients and OM at the lower risk scores.
Statistical Parameters of Index Performance
As shown in Table 4⇓, the longer, remodeled index showed a small increment in accuracy over the original CCN index (ROC, 0.78 versus 0.76; P<0.05), with the recalibrated index between them (ROC, 0.77). The ready-made CCN model showed significantly poor fit between predicted and observed results (P<0.001), whereas the other models had acceptable calibration.
Another method of assessing model fit is shown in Figure 1⇓, which depicts the mean predicted probability of OM versus observed OM for each cumulative risk score. The original CCN model overestimated predicted OM (top panel). In contrast, for the recalibrated and new models, slopes appeared closer to unity and intercepts close to zero (data available on request). ANCOVA confirmed that there was a significant difference in intercepts among models. Pairwise comparisons showed that as with the ROC curve area, the significant difference arose only when the original CCN model and the newly remodeled index were compared (P<0.0001).
OM: Observed, Expected, and Adjusted
We compared observed and expected mortality for each model. The observed minus expected mortality rate was significantly different from zero for the CCN model (−1.04±0.4%; P=0.011) but not with recalibration (0.17±0.3%; P=0.62) or remodeling (0.21±0.3%; P=0.55). Intermodel comparisons by ANOVA confirmed that the CCN model had a significantly greater disparity between observed and predicted results compared with both the recalibrated (P=0.017 versus CCN) and remodeled (P=0.014 versus CCN) indexes.
As shown in Figure 2⇓, this higher expected OM in the ready-made CCN model resulted in an underestimation of risk-adjusted OM (P=0.048 compared with the remodeled index). RAOM was similar to unadjusted OM for the recalibrated and remodeled indexes.
Table 5⇓ depicts the relative ranking of surgeons for each of the 3 models from lowest RAOM (rank=1) to highest (rank=14). Recalibration resulted only in surgeons 8 and 9 exchanging positions. Remodeling, however, resulted in surgeon 13 shifting up by 3 positions and surgeons 10 and 11 each moving down 2 ranks. Despite these latter shifts, the overall Spearman correlation coefficient showed a significant association between ranks for the CCN and new models (rs=0.982, P=0.012) because the positions of the first 7 surgeons were stable across models. However, examining the 7 higher-ranking surgeons revealed a diminished correlation (rs=0.857, P=0.07).
We compared 3 possible strategies for assessing risk-adjusted outcomes of cardiac surgery: off-the-shelf use of a simple, multipurpose risk index; recalibrating that published index to ensure a better fit to the available data; and deriving a new, internal model with additional risk factors. The comparisons were effected in a clinical database of all isolated CABG patients undergoing operation between April 1, 1993, and December 31, 1996, at 2 large teaching hospitals in Toronto, Canada.
Our rationale was to offer some guidance to providers who must respond to the burgeoning literature on risk indexes. We accordingly discuss the implications of our findings in 3 areas.
Implications for Benchmarking Improvements Over Time
The original CCN model was derived and validated for earlier years. As such, it tended to overestimate the chances of postoperative death for high-risk patients, with the result that risk-adjusted outcomes improved. Experience with the Society of Thoracic Surgeons5 26 risk-adjustment algorithm has been similar. Their original 23-variable model was derived with Bayesian methods from data on tens of thousands of subjects and scores of centers. Nonetheless, in recent years, the model has predicted a rising probability of operative mortality owing to an increasing prevalence of high-risk patients, even as observed operative mortality has decreased.
If the goal of an outcomes analysis is to determine trends in mortality over time, then arguably a risk model derived and validated from an earlier period can be used, because it anchors practice historically and controls for the evolution of case mix. One limitation is that improved reportage of risk factors (“upcoding”) in contemporary groups of patients may lead to a spurious impression that risk-adjusted outcomes are improving. For example, critics have charged that this phenomenon, rather than the impact of public report cards, explains the improved outcomes of cardiac surgery in New York State.27
Implications for Contemporaneous Quality Management or Patient Counseling
Setting aside temporal benchmarking, the usual goals of an outcomes analysis are contemporary quality management or risk prediction by, respectively, comparing the risk-adjusted outcomes of surgeons or centers or identifying patients in high-risk subgroups. For this purpose, our findings suggest that practitioners and managers should consider recalibrating an existing index or developing a new model with the data at hand.
Recall first that our ready-made CCN index was originally derived and validated in Ontario with data from 9 centers, including the 2 hospitals that contributed subsequent patients to the present study. Thus, it is perhaps not surprising that the CCN index still showed good performance in this study, with an ROC area of 0.76. However, the CCN index showed poor calibration associated with overestimation of expected operative mortality, a feature of model performance that is undesirable for both prospective risk prediction and post hoc risk adjustment. Poor calibration presumably occurred not only because of temporal shifts in case mix but also because the CCN index was developed in a data set that combined ischemic and valvular heart disease, and we were applying it to an isolated CABG series. Parsonnet et al,11 for example, observed a deterioration in model performance when their predictive rule, developed in a data set that combined CABG with valve surgery, was subsequently tested for CABG and valve procedures separately. Similarly, the CCN index was derived to cover both OM and length-of-stay outcomes, and the recalibration here was outcome specific.
Recalibration was therefore a promising strategy in this context. More generally, it allows a group of practitioners to remain efficient in data collection, restricting their efforts to careful documentation of a limited number of prespecified variables. By reweighting these variables and fine-tuning the risk index, analysts may sometimes mitigate shifts in case mix and outcomes that occur either over time or as the index is applied to centers other than those from which it was derived.
Indeed, recalibration did lead to some improvements in model performance in this test case. Whereas the original CCN index demonstrated significantly poor fit with data from this new series of almost 7500 patients, the recalibrated index fit the data well, and we avoided overestimating operative risk. The recalibrated model also showed discrimination similar to that of a new and more complex model and yielded similar relative ranks for most surgeons. However, for the higher RAOM surgeons, the new model did lead to some alterations in surgeon outcome rankings, an observation that underscores the potential practical importance of even small marginal improvements in model accuracy from a statistical standpoint.
In sum, for clinicians and managers who have developed their own index in the past or found an index that shows acceptable performance in their patient populations, episodic recalibration of that index may suffice. However, in those instances in which there are profound differences in case mix or event rates, it will be prudent to derive a new model with the data at hand.
How Many Risk Factors Are Enough?
Recently, the Society of Thoracic Surgeons published its updated risk model for 1995,14 developed from a database of >138 000 patients who underwent surgery at 374 hospitals throughout the United States and Canada. The model shows excellent accuracy but now requires 33 predictor variables. Apart from increased costs of data collection and increased risks of data “gaming” or random errors, the large numbers of explanatory variables also increase the chances of statistical overfitting and model instability when applied to specific centers.
In contrast, the original CCN model was designed to be parsimonious and robust for multicenter comparisons.1 The new model adds only 4 variables, bringing the total to 9 for isolated CABG. These factors are similar to those reported previously by our group17 and others1 2 3 4 5 6 7 8 9 10 11 12 13 and include those highlighted in recent guidelines from the Working Group Panel on the Cooperative CABG Database Project.28 Despite minor differences in surgeon rankings, this new model had performance characteristics similar to those of the recalibrated CCN model with only 5 variables. This latter result is consistent with our earlier findings on the limited marginal improvements in model performance with increasing numbers of predictor variables.25 Accurate and complete data collection on a constrained set of important variables appears to be a prudent strategy.
Our findings illustrate that temporal and intercenter differences in case mix make it difficult to achieve optimal predictive performance with ready-made risk indexes. This observation argues against the proliferation of published risk indexes in the clinical literature that either affirm well-known prognostic factors or add new variables with minimal marginal impact. We have also demonstrated that recalibration of existing indexes may sometimes be sufficient to ensure adequate risk prediction even when models are parsimonious. As a precaution, however, we suggest that centers collect data fastidiously on a modest-sized set of key variables such as those suggested by the Working Group Panel28 and undertake intermittent remodeling to ensure that emerging risk factors are not inadvertently overlooked.
J. Ivanov is supported by a fellowship from the Heart and Stroke Foundation of Canada. Dr Tu is a Scholar and Dr Naylor a Senior Scientist of the Medical Research Council of Canada. Drs Tu and Naylor also receive personnel support from the Institute for Clinical Evaluative Sciences, which is funded by the Ontario Ministry of Health. The authors are deeply indebted to the cardiac surgeons at The Toronto Hospital and Sunnybrook Health Science Center who contributed patients to this study and allowed us access to their databases. The Toronto Hospital: Dr Tirone E. David (chief), Dr Ronald J. Baird, Dr R.J. Cusimano, Dr Christopher M. Feindel, Dr Irvin H. Lipton, Dr Lynda L. Mickleborough, Dr Charles M. Peniston, Dr Hugh E. Scully, and Dr Richard D. Weisel. Sunnybrook Health Science Center: Dr Bernard S. Goldman (chief), Dr George T. Christakis, and Dr Stephen E. Fremes. The authors also wish to acknowledge the superb contribution to data collection and management by Jeri Severs at the Sunnybrook Health Science Center and Susan Collins at The Toronto Hospital. We also wish to thank Dr Donald Redelemeier, Dr Antoni S.H. Basinski, Keyi Wu, and Kathy Sykora at the Clinical Epidemiology Unit of the Sunnybrook Health Science Center and the Institute for Clinical Evaluative Sciences of Ontario for statistical consultation.
The findings and views in this study are those of the authors; no endorsement by the supporting agencies is implied.
Presented in part at the 70th Scientific Sessions of the American Heart Association, November 9–12, 1997, Orlando, Fla, and published in abstract form (Circulation. 1997;96[suppl I]:I-506).
- Received July 31, 1998.
- Revision received January 22, 1999.
- Accepted January 26, 1999.
- Copyright © 1999 by American Heart Association
Tu JV, Jaglal SB, Naylor CD, the Steering Committee of the Provincial Adult Cardiac Care Network of Ontario. Multicenter validation of a risk index for mortality, intensive care unit stay, and overall hospital length of stay after cardiac surgery. Circulation. 1995;91:677–684.
Tu JV, Mazer CD, Levinton C, Armstrong PW, Naylor CD. A predictive index for length of stay in the intensive care unit following cardiac surgery. Can Med Assoc J. 1994;151:177–185.
O’Connor GT, Plume SK, Olmstead EM, Coffin LH, Morton JR, Maloney CT, Nowicki ER, Tryzelaar JF, Hernandez F, Adrian L, Casey KJ, Soule DN, Marrin CAS, Nugent WC, Charlesworth DC, Clough R, Katz S, Leavitt BJ, Wennberg D. A regional prospective study of in-hospital mortality associated with coronary artery bypass grafting. JAMA. 1991;266:803–809.
O’Connor GT, Plume SK, Olmstead EM, Coffin LH, Morton JR, Maloney CT, Nowicki ER, Levy DG, Tryzelaar JF, Hernandez F, Adrian L, Casey KJ, Bundy D, Soule DN, Marrin CAS, Nugent WC, Charlesworth DC, Clough R, Katz S, Leavitt BJ, Wennberg JE. Multivariate prediction of in-hospital mortality associated with coronary artery bypass graft surgery. Circulation. 1992;85:2110–2118.
Parsonnet V, Dean D, Bernstein AD. A method of uniform stratification of risk factors for evaluating the results of surgery in acquired adult heart disease. Circulation. 1989;79(suppl I):I-3–I-12.
SAS Language Guide for Personal Computers. Cary, NC: SAS Institute Inc; 1988.
Dixon WJ, ed. BMDP Statistical Software Manual. Berkeley, Calif: University of California Press; 1992.
Hosmer DW, Lemeshow S. Applied Logistic Regression. New York, NY: John Wiley & Sons; 1989.
Jones RH, Hannan EL, Hammermeister KE, DeLong ER, O’Connor GT, Luepker RV, Parsonnet V, Pryor DB. Identification of preoperative variables needed for risk adjustment of short-term mortality after coronary artery bypass graft surgery: the Working Group Panel on the Cooperative CABG Database Project. J Am Coll Cardiol. 1996;28:1478–1487.