Comparison of “RiskAdjusted” Hospital Outcomes
Jump to
Abstract
Background— A frequent challenge in outcomes research is the comparison of rates from different populations. One common example with substantial health policy implications involves the determination and comparison of hospital outcomes. The concept of “riskadjusted” outcomes is frequently misunderstood, particularly when it is used to justify the direct comparison of performance at 2 specific institutions.
Methods and Results— Data from 14 Massachusetts hospitals were analyzed for 4393 adults undergoing isolated coronary artery bypass graft surgery in 2003. Mortality estimates were adjusted using clinical data prospectively collected by hospital personnel and submitted to a data coordinating center designated by the state. The primary outcome was hospitalspecific, riskstandardized, 30day allcause mortality after surgery. Propensity scores were used to assess the comparability of case mix (covariate balance) for each Massachusetts hospital relative to the pool of patients undergoing coronary artery bypass grafting surgery at the remaining hospitals and for selected pairwise comparisons. Using hierarchical logistic regression, we indirectly standardized the mortality rate of each hospital using its expected rate. Predictive crossvalidation was used to avoid underidentification of true outlying hospitals. Overall, there was sufficient overlap between the case mix of each hospital and that of all other Massachusetts hospitals to justify comparison of individual hospital performance with that of the remaining hospitals. As expected, some pairwise hospital comparisons indicated lack of comparability. This finding illustrates the fallacy of assuming that risk adjustment per se is sufficient to permit direct sidebyside comparison of healthcare providers. In some instances, such analyses may be facilitated by the use of propensity scores to improve covariate balance between institutions and to justify such comparisons.
Conclusions— Riskadjusted outcomes, commonly the focus of public report cards, have a specific interpretation. Using indirect standardization, these outcomes reflect a provider’s performance for its specific case mix relative to the expected performance of an average provider for that same case mix. Unless study design or post hoc adjustments have resulted in reasonable overlap of casemix distributions, such riskadjusted outcomes should not be used to directly compare one institution with another.
Received November 9, 2007; accepted February 13, 2008.
Outcomes research “seeks to understand the end results of particular health care practices and interventions.”^{1} This may involve investigation of a new drug or procedure compared with standard therapy through the use of either a randomized trial or an observational study. Because of the current health policy emphasis on measuring and improving provider performance,^{2,3} interest has also been increasing in another type of outcomes research referred to as provider profiling.^{4,5} This research focuses on the collection and analysis of outcomes data to evaluate the performance of a physician or a hospital.
Clinical Perspective p 1963
Provider profiling has a number of features that distinguish it from other types of outcomes research. First, unlike trials of new medications or treatment regimens, randomization of patients to hospitals or physicians would often be both impractical and unethical. Thus, profiling studies are almost always observational in nature, relying on data from usual practice settings. In further contrast to drug trials that involve direct comparisons of outcomes for only a few treatments, profiling studies typically assess outcomes for many providers, usually with regard to some population reference standard. Finally, when profiling is based on outcomes measures such as mortality or morbidity, risk adjustment is necessary to account for preexisting conditions that may confound their assessment.
Despite their increasingly widespread use, considerable confusion exists among consumers, the media, payers, and providers as to the correct meaning and interpretation of riskadjusted outcomes. For example, many incorrectly interpret such outcomes as having “leveled the playing field” to permit direct comparison of one provider with another. Direct comparability may sometimes be justified in an observational study, but this would be fortuitous and is not an inherent characteristic of the study design.
Correct interpretation of the concept of riskadjusted outcomes is neither a trivial nor a strictly academic concern. Such outcomes are used to designate centers of excellence, to determine reimbursement levels in pay for performance programs, to rank institutions, and to classify providers as “outliers.” These determinations may have profound effects on patient access, hospital reputation, referrals, and financial survival.
The goal of this article is to systematically review the fundamental concepts from which the deceptively simple term “riskadjusted outcome” is derived. We develop the concept of riskadjusted outcomes in the context of causal inference theory and illustrate the derivation of indirectly standardized mortality ratios, often referred to as O/E (observed/expected) ratios. Key methodological concepts (eg, outlier determination and direct comparison of hospitals) are illustrated through the example of coronary artery bypass grafting surgery (CABG) mortality profiling, in which the difference in outcomes of a hospital compared with the reference standard is generally regarded as a reflection of quality of care.^{5}
Methods
Background
It is useful to consider risk adjustment and standardization as specific applications of causal inference theory, a broad discipline with historical roots in philosophy, mathematical logic, and statistics.^{6–19} This is the foundation for understanding causal effects in health care,^{16,20–24} which can be thought of as the difference between the outcome for a patient when exposed to one treatment (or provider) and the outcome when exposed to another.
A fundamental precept of causality is that only one of a series of potential outcomes can be experienced at any one time.^{7,17,20,23,24} In CABG hospital profiling, a patient can undergo CABG at only one hospital on a given day. Therefore, some method must be used to estimate what would hypothetically have occurred to that patient had he or she undergone surgery at a different hospital. The observed result is referred to as the actual outcome, and the unobservable estimated outcome is the counterfactual.^{7,17,20,23,24} Estimation of this counterfactual outcome, the hypothetical result if treated under a different set of circumstances, is the primary motivator for risk model development. Several approaches have been developed to estimate these potential outcomes for individual patients and subsequently to assess the overall performance of a hospital.
Estimation of Counterfactuals for Risk Adjustment and Standardization
The simplest estimator of a counterfactual would be the average result of treating a similar condition (eg, a CABG procedure) in the overall population or at another specific institution. However, this estimator is likely to be both inaccurate and misleading. Patients are nonrandomly allocated among institutions, and use of crude mortality rates from other hospitals as the counterfactual outcomes would ignore systematic differences among patients such as acuity status. At the other end of the spectrum, the counterfactual outcomes could be determined through randomization,^{15,18,19,25} the most internally valid design. Both measured and unmeasured confounders would be balanced, so the mortality experience of patients undergoing CABG at one hospital could serve as the counterfactual outcome for patients treated at another hospital. However, it is implausible to think that most patients would consent to randomization for anything but truly experimental care; for this reason, almost all profiling studies are conducted with observational data. Matching and stratification are other methods sometimes used to derive counterfactuals, but they quickly become impractical when more than a few predictor variables are considered, the typical case in mortality profiling.
Most profiling studies have relied on regression modeling to derive counterfactual outcomes, and it is the method used here. Risk adjustment, the term commonly used for this approach, refers to the results of statistical regression models that relate the outcome for a specific patient to his or her observed characteristics.^{4,26–29} Then, because the main focus of profiling is to determine how the overall experience of a particular hospital compares to what would be “expected,” the next step is to standardize the results of an institution to the reference population.
Indirect standardization is used for almost all profiling and public report cards. With this method, the expected rate represents what the mortality rate would have been at a hospital given its actual distribution of patients but replacing its observed mortality rates with rates estimated from the entire group of providers. The indirectly standardized mortality ratio, often referred to as the ratio of observed to expected outcomes (O/E ratio), compares the outcomes for the specific distribution of patients at a hospital with their expected results had they been treated by an average provider in the reference population.
Indirect standardization is accomplished by first summing the individual risk probabilities for each patient within a given hospital using the coefficients estimated from the regression model and the patient’s specific distribution of confounders. This yields the expected total number of deaths for that hospital. This counterfactual hospital mortality often is used as the denominator of the ratio of observed to expected mortality (O/E ratio), a form of causal estimand. This O/E ratio is favorable if <1 and unfavorable if >1. As a final step, the O/E ratio may be multiplied by the unadjusted population mortality rate for the procedure to obtain what is often called the riskadjusted mortality rate but which is more correctly designated the riskstandardized mortality rate (RSMR) or standardized mortality incidence rate (SMIR).^{30–34}
Outlier Determination and the Direct Comparison of Hospitals
Outliers
The main goal of outcomes profiling is to identify differences in hospital quality. Because the riskstandardized rates for each hospital are derived from the reference population, it is most appropriate to determine whether these rates are statistically different from the population average. If so, the hospital is regarded as a statistical outlier. Most commonly, this is achieved by determining whether the 95% interval for a hospital’s riskstandardized mortality estimate includes the overall state average mortality (or alternatively, if the intervals around their O/E ratio intersect 1). If no overlap exists, they typically are classified as an outlier. An important but overlooked aspect of outlier determination is the effect on expected outcomes when true outlying programs are included in the development of the statistical model. This problem and a potential solution (crossvalidated P values) are described further in the Illustration.
Risk Factor Distribution and Direct Comparability
In addition to comparing individual hospitals with the reference population to determine outlier status, some consumers also seek to directly compare individual hospitals with one another. A problem with direct comparisons that has been widely recognized by statisticians, and that was the motivation for the development of balancing methods such as propensity scores,^{14–16,18,19,35–41} is that of covariate imbalance. Absent randomization, the patient cohorts from 2 hospitals may be unbalanced with regard to the frequency of confounders. The implications of such imbalance have received little attention in the context of riskadjusted outcomes profiling, which in turn has led to both misunderstanding and misuse.
In general, only the results for those patients with comparable risk profiles (eg, that overlap the risk distributions of the 2 providers) should be directly compared. Consider the extreme but not uncommon example of a state or region with many small community hospitals and 1 or 2 tertiary/quaternary hospitals. As a general principle, direct comparison of a community to a tertiary hospital would be appropriate only for the relatively small proportion of patients who overlap between the 2 hospitals. Although the results for the overlap group can be used to estimate expected outcomes for patients not in common between the 2 institutions, this form of extrapolation depends heavily on assumptions that are typically unverifiable. For example, the indirectly riskstandardized results at a community hospital apply to its specific type of patients, who might be relatively low risk compared with a tertiary center. It cannot be assumed that a favorable riskstandardized mortality at the community hospital, based on its lower risk case mix, could necessarily be achieved if it were confronted with the higherrisk case mix of the tertiary center, including some types of patients that it rarely, if ever, encounters.
Propensity scores are a useful method to construct treatment and control groups that may differ in number of subjects but are similar to randomized studies in their balanced distribution of all measured confounders.^{14–16,18,19,35–41} The propensity score is the likelihood of receiving treatment of one type compared with another (or in the case of profiling, exposure to one or another specific provider) on the basis of a patient’s set of observed characteristics. It provides a convenient scalar (1number) summary of the information contained in all the patient’s measured covariates. The propensity score may then be used for matching, stratification, blocking, or weighting in regression modeling.
The problem of covariate imbalance has received little attention in provider profiling studies.^{42–45} If the propensity score provides a convenient summary estimate of individual patient risk, then each provider will have a specific distribution of propensity scores that characterizes its “case mix.” For 2 providers to be comparable, the area of overlap in their respective propensity score distributions should be identified. As shown in Figure 1A, 2 hypothetical hospitals (hospitals 1 and 2) might by chance (or as a result of randomization) have substantial overlap in their propensity score distributions. The area of shaded overlap in Figure 1A indicates that a majority of patients treated at hospital 2 have a similar propensity to have been treated at hospital 1. For almost every patient who underwent CABG at hospital 1, we can find a “similar” patient from among those having CABG at hospital 2.
Figure 1B depicts a different set of 2 hospitals with significant imbalance in their average patient risk as measured by their propensity score distributions. Only a small percentage of patients at the 2 institutions have comparable risk profiles. It is only the group of patients who overlap from which relative performance inferences should be drawn.
Illustration
Study Population
We examined data from all adults (≥18 years of age) undergoing isolated CABG at all acutecare, nonfederal hospitals in Massachusetts between January 1, 2003, and December 31, 2003. Data collection is mandated by the Massachusetts Department of Public Health.
Data Sources
We used clinical data submitted to a data coordinating center (MassDAC) located in the Harvard Medical School Department of Health Care Policy. Data are collected by trained hospital personnel using the Society of Thoracic Surgeons National Adult Cardiac Database instrument.^{46} Supplemental patient and surgeon identifying information also is collected using additional data forms developed by MassDAC. The data are sent electronically to MassDAC, where they are cleaned, audited, and verified using internal and external procedures.
End Points
The primary end point is hospitalspecific, riskstandardized, allcause, 30day mortality rate. Mortality data are obtained 2 ways. First, hospital personnel are responsible for collecting 30day mortality for all patients undergoing cardiac surgery. Second, patient identifying information is linked to this registry from the Massachusetts Registry of Vital Records and Statistics to verify date of death. The registry includes mortality information for Massachusetts residents and all records of deaths that occur within the Commonwealth regardless of the state of residence. Because MassDAC has access to Social Security numbers, the Social Security Index Web site^{47} also is searched to identify deaths, including those reported to the Social Security Administration by funeral homes or by relatives.
Statistical Analyses
Distributions of clinical and demographic variables are computed and stratified by hospital to identify unusual or extreme values. Because of data collection protocols and auditing procedures, no data are missing in the clinical variables or outcomes for the mortality models.
Risk Adjustment
We first estimated a propensity score model in which the dependent variable was multinomial, assuming 13 distinct values corresponding to the 13 hospitals (1 hospital is the reference group). The specific clinical variables included in the model were selected from a literature review of existing models and expert opinion from a panel of senior cardiac surgeons. A multinomial logistic regression model was estimated, and predictions for each patient in the sample were subsequently obtained. Thus, each patient had 14 estimated probabilities, each reflecting the likelihood that the patient would undergo CABG at 1 specific hospital rather than 1 of the remaining 13 hospitals. For this reason, the sum of the 14 estimated probabilities for each patient was 1.
To compare the performance of each hospital with that of its peers, it is necessary to assess whether the population of patients undergoing surgery at a particular hospital is comparable to that of all other Massachusetts hospitals on the basis of their observed characteristics. To accomplish this, we examined the overlap between the distribution of the propensity scores for patients undergoing surgery at each hospital and the distribution of the propensity scores for patients not undergoing surgery at that hospital. Ideally, the estimated propensity scores of the latter group would cover the entire range of estimated propensity scores at the particular hospital being studied. This finding would provide support for the assumption that the 2 groups of patients (those treated at a particular hospital versus all others) were similar in terms of observable demographic characteristics and other comorbidities.
We next estimated a regression model for the mortality outcomes. The dependent variable was binary, assuming a value of 1 if the patient died of any cause within 30 days of surgery and 0 otherwise. We included the same set of confounders used in the propensity score model. We included a random hospitalspecific intercept that represented the underlying quality of the hospital and accounted for withinhospital correlation of patients. We calculated odds ratios (ORs) conditional on the hospital random effects that apply to comparisons of patients belonging to the same hospital (see Larsen and Merlo^{48} for a discussion of differences between conditional and unconditional ORs).
The size of betweenhospital variation was summarized by the median OR (MOR).^{49} The MOR considers 2 CABG patients with the same set of observed risk factors but selected randomly from 2 different hospitals. The MOR is the OR between the patient with a higher probability of dying and the patient with a lower probability of dying. A MOR value >1 supports the hypothesis that betweenhospital variation in mortality exists after adjustment for patient characteristics. If the betweenhospital variation were 0, this would imply that differences in hospital outcomes, after adjustment for patient characteristics, are due only to random sampling variability. Although betweenhospital variation will always be >0 in practice, some have suggested that small values can be effectively ignored by essentially setting the betweenhospital variation component to 0. We see no reason to assume that betweenhospital variation is 0 given that this value can be estimated.
We calculated the mortality risk for each patient using the observed values of his or her confounding variables. The individual risk factors were multiplied by the estimated coefficients from the regression model, transformed onto the probability scale, and summed to obtain the number of expected number of deaths at each hospital.
Hospital RSMRs
We next estimated a riskstandardized mortality ratio for each hospital by computing the ratio of the “observed” number of deaths to the expected number of deaths (RSMR). However, rather than use the actual numbers of deaths at a hospital, we used an adjusted number (called a shrinkage estimate) that avoids several statistical problems associated with the observed number, including small sample sizes and clustering.^{28,34,50,51} We then multiplied the standardized mortality ratio by the crude state mortality rate to obtain hospitalspecific RSMRs. Ninetyfive percent posterior intervals for each RSMR were computed.
CrossValidation
Because all hospitals contribute to the model used to estimate the expected number of deaths, each hospital helps to define its own expected behavior.^{50,51} If one hospital is truly “outlying,” with an unusually high or low mortality rate, it may “inflate” the estimated betweenhospital variance component because the regression model adapts to incorporate the results of the unusual hospital. Consequently, this hospital will be less likely to be identified as an outlier. With a very large number of hospitals, the results of one institution are unlikely to distort the model substantially. However, with a smaller number of cardiac surgery hospitals, as in Massachusetts or other individual states, one aberrant hospital could substantially influence the counterfactual outcome and make the performance of that hospital less likely to be identified as an outlier.
We addressed this problem through crossvalidation. In a second set of analyses, the data from each hospital were sequentially deleted from the determination of the counterfactual distribution for its particular patients. With this approach, the expected number of deaths for a hospital represents how well the rest of the hospitals in the state would fare with the patients from that specific hospital. We computed the difference between the observed numbers of deaths in each hospital and the number of deaths predicted using its case mix and the regression coefficients from a model based on all other hospitals. Posterior predictive probability values, which reflect the similarity of the mortality experience of a particular hospital to that of its peers, also were computed.^{50} Extreme predictive P values (P≤0.01 or P≥0.99) indicate a discrepancy between the observed data and what is predicted by the model developed from the remaining hospitals.
The authors had full access to and take full responsibility for the integrity of the data. All authors have read and agree to the manuscript as written.
Results
The crude 30day mortality rate is 2.25%, corresponding to 99 deaths out of 4393 isolated CABG admissions. The number of isolated CABG admissions ranged from a low of 44 to a high of 650. Not surprisingly, substantial differences were found in patient risk factors among hospitals (Table 1). For example, the percentage of admissions in which ejection fraction was <30% ranged from 1.8% to 15.0%, renal failure ranged from 1.8% to 13.0%, preoperative intraaortic balloon pump use varied from 2.3% to 29.0%, and emergent or salvage procedures ranged from 0% to 7.2%. Visual inspection of the covariate frequencies for hospitals B and F suggests that they represent, on average, quite different populations. For example, 7.2% of the patients at hospital B were emergent or salvage, the highestacuity group, whereas only 0.9% of patients at hospital F were in that category. This imbalance is illustrated more formally in Figure 2B, a graphic depiction of the density of estimated propensity scores from hospital B compared with those of hospital F. This analysis is restricted to those patients who underwent surgery in those 2 hospitals. The propensity scores in Figure 2B were obtained by estimating a (binary) logistic regression model in which the response was an indicator assuming a value of 1 if the patient underwent CABG at hospital B and 0 if the patient underwent surgery at hospital F. The density estimates indicate that for 13% of the patients who underwent CABG at hospital B (solid line), no “similar” patient underwent the procedure in hospital F (dashed line). This percentage was calculated by identifying the fraction of hospital B patients with estimated logodds of propensity scores >5 because this defined the area of nonoverlap (eg, no hospital F patient had an estimated logodds of propensity score >5). The lack of overlap implies that a direct comparison of all patients treated at hospital B with those at hospital F may not be statistically valid.
Table 2 illustrates the prevalence of the individual covariates from which these propensity score density distributions were derived. Column 1 shows the characteristics of the subset of patients at hospital B who do not overlap with hospital F (ie, for whom the logodds of their propensity scores are >5). The prevalence of individual highrisk characteristics is quite elevated in this patient subset (eg, 24% renal failure, 17% reoperation, 10% cardiogenic shock, 52% emergent or salvage), and hospital F has no experience with patients having this overall level of acuity. The last 2 columns demonstrate the balancing properties of propensity scores in the area of overlap, in which patients are found from both hospitals with comparable logodds of propensity score. For many of the most important covariates (eg, prior CABG, cardiogenic shock, recent myocardial infarction, urgent or emergent/salvage status), the prevalence was comparable for hospital B and F patients in the overlap region.
Although direct hospitaltohospital covariate balance was poor, the overlap of estimated propensity score distributions for each hospital compared with the propensity score distribution for patients at most of the remaining hospitals was excellent. For example, Figure 2A displays the overlap for hospital B and all remaining hospitals based on the predictions obtained from the multinomial logistic regression model. This suggests that a comparison of the performance of hospital B relative to the overall group of other Massachusetts CABG providers is statistically valid.
The prevalence of the confounders and their relationship to 30day mortality are presented in Table 3. Betweenhospital variation measured by the MOR, after accounting for patient risk factors, is 1.34. This implies that for 2 patients with the same observed risk factors, the patient treated in the hospital with higher mortality risk is 1.34 times as likely to die within 30 days of isolated CABG as the patient treated in the hospital with lower mortality risk.
The last column of Table 4 depicts the typical profiling results that would be obtained with the entire state experience (all 14 hospitals) as the counterfactual. The 95% posterior interval of each hospital for its RSMR includes the state crude rate of 2.25%. This would imply that no hospital had higher or lowerthanexpected mortality rate given its case mix. In most public report cards, this finding would be regarded as sufficient evidence for the absence of statistical outliers, but as noted previously, this conclusion may be misleading. The 3 columns on the left demonstrate the results of analyses performed with crossvalidation, sequentially deleting the results of each hospital from the determination of its own counterfactual. The result of this crossvalidation predictive P value analysis was highly significant (P=0.01) for hospital D on the left side of Table 4. Supporting this concern is the fact that the betweenhospital variation in riskadjusted mortality is reduced by 50% when hospital D is excluded from the model (from 0.0939 to 0.048; data not shown), and the MOR decreases from 1.34 to 1.23. Finally, a 2.26% excess mortality rate results when hospital D is compared with its peers. These findings all suggest that hospital D is in fact a statistical outlier.
Discussion
The study of variations in the provision of healthcare services has been a central activity of outcomes research for more than 2 decades. This variability has included both utilization of services and outcomes. Initial publication of hospital mortality rates in 1986 by the Health Care Financing Administration (now known as the Centers for Medicare and Medicaid Services, or CMS) was widely criticized for failing to adjust for patient risk.^{52} This motivated the development of numerous statistical risk models, particularly in cardiac surgery, to account for preoperative patient characteristics. It also stimulated CMS to look more closely at its risk models. It has now released new mortality models for acute myocardial infarction and heart failure that address many riskadjustment issues and statistical deficiencies identified in their earlier releases.^{32,33} Nevertheless, although risk adjustment corrects for the case severity at a given institution using risk estimates derived from the entire population, it does not guarantee statistically valid direct hospitaltohospital comparisons. When analyzing outcomes data, interested stakeholders should always consider these additional questions: To what type of patients can inferences about riskstandardized hospital outcomes be applied? What reference population was used to determine the counterfactual? If direct hospitaltohospital comparison is the goal, is there sufficient covariate balance (overlap) to justify such comparison? A widely held view is that risk adjustment levels the playing field so that hospitals can be compared directly with one another over the broad spectrum of patient risk. We argue that this assumption often is invalid and that this common misinterpretation has profound health policy implications in today’s performancecentric environment.
Are current report cards useful? Yes, they are useful when interpreted in the correct context. Most outcomes report cards use indirect standardization. In this context, the RSMR of a hospital may be interpreted as a measure of quality for the type of patient it treats. Properly constructed and interpreted, report cards facilitate comparisons of hospitals with the entire experience of a larger population of providers (eg, a state or region). Such a comparison group for each hospital typically will be rich enough to support a valid assessment of their quality of care, and it provides meaningful information to payers, regulators, and healthcare consumers.
Conclusions
Outcomes research typically involves nonrandomized studies to assess the results of patient experience with the healthcare system. Virtually always, some form of adjustment is required. Although riskstandardized outcomes have been an important advance in adjusting provider results for differences in case mix, such results often have been misapplied. Assessing the performance of a hospital for its case mix compared with the expected performance of a reference group of providers for a similar case mix usually is justified. However, because of substantial differences in the distribution of risk factors, it may often be inappropriate to directly compare 2 hospitals using the results available in most public report cards.
Acknowledgments
Sources of Funding
Dr Normand is contracted by the Massachusetts Department of Public Health to monitor hospital cardiac quality and also receives funding from Yale University to develop risk models for CMS.
Disclosures
None.
References
 ↵
Agency for Healthcare Research and Quality. Outcomes Research: Fact Sheet. Available at: http://www.ahrq.gov/clinic/outfact.htm. Accessed September 5, 2007.
 ↵
Institute of Medicine. Crossing the Quality Chasm: A New Health System for the 21st Century. Washington, DC: National Academies Press; 2001.
 ↵
Institute of Medicine. Performance Measurement: Accelerating Improvement. Washington, DC: National Academies Press; 2006.
 ↵
Gatsonis CA. Profiling providers of medical care. In: Armitage P, Colton T, ed. Encyclopedia of Biostatistics, Volume 6. 2nd ed. Chichester, UK: John Wiley & Sons Ltd; 2005: 4252–4254.
 ↵
Normand SLT. Quality of care. In: Armitage P, Colton T, ed. Encyclopedia of Biostatistics, Volume 6. 2nd ed. Chichester, UK: John Wiley & Sons Ltd; 2005: 4348–4352.
 ↵
Rubin DB. Comment: Neyman (1923) and causal inference in experiments and observational studies. Stat Sci. 1990; 5: 472–480.
 ↵
 ↵
Holland PW, Rubin DB. Causal inference in retrospective studies. Eval Rev. 1988; 12: 203–231.
 ↵
 ↵
Rothman KJ, Greenland S. Modern Epidemiology. Philadelphia, Pa: LippincottRaven; 1998.
 ↵
Pearl J. Causality: Models, Reasoning, and Inference. Cambridge, UK: Cambridge University Press; 2000.
 ↵
Robins JM, Greenland S. The role of model selection in causal inference from nonexperimental data. Am J Epidemiol. 1986; 123: 392–402.
 ↵
 ↵
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983; 70: 41–55.
 ↵
Rosenbaum PR. Observational Studies. New York, NY: Springer; 2002.
 ↵
 ↵
 ↵
Gelman A. Applied Bayesian Modeling and Causal Inference From Incomplete Perspectives. Chichester, UK: Wiley; 2004.
 ↵
Gelman A, Hill J. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge, UK: Cambridge University Press; 2007.
 ↵
Maldonado G, Greenland S. Estimating causal effects. Int J Epidemiol. 2002; 31: 422–429.
 ↵
 ↵
 ↵
 ↵
 ↵
Fleiss JL, Levin BA, Paik MC. Statistical Methods for Rates and Proportions. Hoboken, NJ: J. Wiley; 2003.
 ↵
 ↵
 ↵
 ↵
 ↵
Hannan EL, Wu C, Ryan TJ, Bennett E, Culliford AT, Gold JP, Hartman A, Isom OW, Jones RH, McNeil B, Rose EA, Subramanian VA. Do hospitals and surgeons with higher coronary artery bypass graft surgery volumes still have lower riskadjusted mortality rates? Circulation. 2003; 108: 795–801.
 ↵
 ↵
Krumholz HM, Wang Y, Mattera JA, Wang Y, Han LF, Ingber MJ, Roman S, Normand SL. An administrative claims model suitable for profiling hospital performance based on 30day mortality rates among patients with an acute myocardial infarction. Circulation. 2006; 113: 1683–1692.
 ↵
Krumholz HM, Wang Y, Mattera JA, Wang Y, Han LF, Ingber MJ, Roman S, Normand SL. An administrative claims model suitable for profiling hospital performance based on 30day mortality rates among patients with heart failure. Circulation. 2006; 113: 1693–1701.
 ↵
 ↵
 ↵
 ↵
 ↵
D’Agostino RB Jr. Propensity scores in cardiovascular research. Circulation. 2007; 115: 2340–2343.
 ↵
 ↵
 ↵
Joffe MM, Rosenbaum PR. Invited commentary: propensity scores. Am J Epidemiol. 1999; 150: 327–333.
 ↵
 ↵
 ↵
 ↵
 ↵
Society of Thoracic Surgeons. STS National Database. Available at: http://www.sts.org/sections/stsnationaldatabase/. Accessed September 5, 2007.
 ↵
Social Security Death Index interactive search. Available at: http://ssdi.rootsweb.com/cgibin/ssdi.cgi. Accessed September 5, 2007.
 ↵
Larsen K, Merlo J. Appropriate assessment of neighborhood effects on individual health: integrating random and fixed effects in multilevel logistic regression. Am J Epidemiol. 2005; 161: 81–88.
 ↵
 ↵
 ↵
Draper D, Gittoes M. Statistical analysis of performance indicators in UK higher education. J Royal Stat Soc Ser A (Stat Soc). 2004; 167: 449–474.
 ↵
Iezzoni LI. Risk Adjustment for Measuring Health Care Outcomes. 3rd ed. Chicago, Ill: Health Administration Press; 2003.
CLINICAL PERSPECTIVE
Riskstandardized outcomes are increasingly being used by various stakeholders to assess the quality of care delivered by healthcare providers. Although adjusted outcomes represent a substantial improvement over unadjusted results, they are commonly misinterpreted and misused, which can have important consequences for the provider and the healthcare system. Riskstandardized outcomes, as most commonly constructed, characterize a provider’s performance for a specific group of patients compared with what would have been expected had that care been delivered by an average provider in the reference population (typically a state or a country). These indirectly standardized outcomes, based on providers’ actual case mix, cannot necessarily be extrapolated to predict what their performance might be with a different (eg, more complex) group of patients. Moreover, if the number of providers in the reference population is small, the inclusion of a true outlying program in the development of the risk model may decrease the sensitivity of the resulting algorithm to detect true outliers. In Massachusetts, this problem is mitigated through the use of crossvalidation, obtained by sequentially removing each hospital from risk model development and then assessing its performance with a model derived from the remaining hospitals. Finally, although riskstandardized outcomes are useful for comparing individual provider performance with that of the overall reference population, this does not imply that the outcomes of 2 providers can be directly compared with one another. This would only be justified for the group of patients whose risk profiles overlap the 2 providers because these are the only patients that they have in common.
Footnotes

Guest Editor for this article was Harlan M. Krumholz, MD, SM.
This Issue
Jump to
Article Tools
 Comparison of “RiskAdjusted” Hospital OutcomesDavid M. Shahian and SharonLise T. NormandCirculation. 2008;117:19551963, originally published April 14, 2008https://doi.org/10.1161/CIRCULATIONAHA.107.747873
Citation Manager Formats