Since the measuring device has been constructed by the observer . . . we have to remember that what we observe is not nature in itself but nature exposed to our method of questioning.
Werner Karl Heisenberg
Physics and Philosophy, 1958
Diagnostic testing forms an important component of cardiovascular medicine, such that optimal practice requires detailed knowledge of how to use and interpret tests. Methods of describing test performance, such as sensitivity and specificity, are widely used with the assumption that they are fairly constant descriptors of the ability of a given test result to detect the presence or absence of disease. Unfortunately, while such concepts are fundamental, accurate assessment of test performance is far more complex, and many factors significantly affect test performance.
In the initial evaluation of a test, investigators often assess its performance by comparing results in populations with very low and very high likelihoods of disease—such as healthy volunteers and afflicted patients. The test in question appears to be an excellent discriminator, although the clinically relevant question of correctly classifying a patient with an intermediate probability of disease has not been addressed. Nevertheless, the published results will indicate very high sensitivity and specificity of the test. However, in subsequent test assessments based on wider use of the test in less clearly segregated populations, they appear to plummet.1 The accuracy of the test has not changed; rather, the population referred for testing has. Thus, apparent test performance can be altered by referral patterns or pretest selection bias. Related to this is the influence of disease prevalence in the population examined. This traditional bayesian consideration is well recognized and can dramatically affect the accuracy of a given test result so that the positive predictive value of a test declines as the disease prevalence decreases in the population under study while the negative predictive accuracy rises.2
Verification or posttest bias is a third factor that has been examined only rarely. This bias results when a newly introduced technology is evaluated after physicians have begun to rely on its results and no longer require verification of them. Test results themselves become a factor in whether or not they are confirmed by some additional, definitive procedure. Thus, patients with positive tests (in this case, exercise echocardiography) are more likely to have their results verified (by undergoing angiography) while those with negative tests are rarely referred for subsequent studies. This practice will increase apparent sensitivity because false-negative results are unlikely to be discovered and decrease specificity and true-negative results will be less likely to be confirmed and therefore will be underrepresented. In this issue of Circulation, Roger and colleagues3 report the effects of such verification bias on the accuracy of exercise echocardiography for the detection of coronary artery disease. While this particular test was chosen because no similar analysis had yet been performed, its results are applicable to all forms of diagnostic testing.
The effects of correcting for verification bias are striking. Using a large clinical series of patients undergoing exercise echocardiography in whom only a small minority (9%) were subsequently referred for coronary angiography Roger et al assessed test performance in two ways: (1) in the usual manner in the angiographic subgroup by the presence or absence of significant coronary artery narrowing and (2) in all patients by adding the angiographic results to those estimated in the nonangiographic majority by application of a clinically based logistic regression algorithm for disease presence derived from results in the angiographic subgroup.4 5 The first, or traditional, method resulted in fairly high sensitivity (78%), mediocre specificity (41%), and reasonable correct classification rate or overall predictive value (69%) in the angiographic group (Table 1⇓). The second method, which incorporates a statistical correction for test verification bias,4 resulted in strikingly lower sensitivity (38%), higher specificity (85%), and modestly altered correct classification rate (60%) in the entire population. The decrease in sensitivity and increase in specificity both confirm that exercise echocardiography results were used to determine who would subsequently undergo catheterization and demonstrate the importance of verification bias (in an unbiased population neither would have changed substantially). The importance of test results to the process of verification is also supported by the independent association of a positive test with referral to catheterization (odds ratio 2.5 for exercise electrocardiography, 1.8 for exercise echocardiography).3
On first examination, it appears that Roger et al have discovered a fatal flaw in exercise echocardiography. However, their results can be anticipated by recognition of the importance of the frequency of an abnormal test result on measures of test accuracy.5 Further, for every cardiovascular noninvasive test analyzed for the effects of verification bias (exercise ECG,6 exercise thallium,5 7 exercise radionuclide angiogram,5 and now exercise echocardiography), results are similar (see Table 8 in Roger et al). No type of test escapes this effect; all are similarly insensitive when assessed in an unbiased manner, assuming that validation studies are performed after the test was in clinical use, which is usually the case if large populations are studied. The degree to which verification bias affects test accuracy is related to the extent to which the diagnostic test results influence the decision to refer to angiography. These studies provide a clear illustration of the difference between test performance characteristics in a selected series and those obtained in general clinical practice.
Why does cardiac diagnostic testing appear to perform so poorly? A cursory review of the data is deceiving because the impact of correction for verification bias is negative only if the sole desirable outcome of exercise echocardiography is the detection of coronary disease presence. Correction for verification bias actually improved both specificity (41% to 85%) and negative predictive accuracy (40% to 54%) (see Table 1⇓), implying that disease can be excluded more reliably in a general population than data derived from a (selectively) angiographically validated series would suggest. However, this result is dependent on disease prevalence, which in Roger and colleagues' population was quite high, at 74% in the angiographic group. The result in an individual patient may differ, yet it is the relevant consideration. Consider a hypothetical patient with a 50% pretest probability of disease. (While arbitrarily selecting 50% makes calculations easier, it also represents a patient with an intermediate probability of disease in whom testing is most likely to be beneficial and in whom a clinician is least able to make a confident diagnosis.) For this patient, bayesian principles9 dictate that a positive test increases the likelihood of disease presence to 57% using biased test characteristics but to 72% using adjusted sensitivity and specificity, thereby tripling the incremental gain from test performance (Table 2⇓). In contrast, adjustment for verification bias reduces the incremental gain from a negative test result by half. When analyzed in this manner, correction for verification bias actually improves the clinical value of a positive test while reducing that of a negative test.
There are other factors that can influence the accuracy of diagnostic testing in general and exercise echocardiography in particular, but an exhaustive cataloging is not germane to this discussion. However, mention of at least one of these factors is appropriate in that it supports the value of diagnostic testing in suspected coronary artery disease, which can be questioned given the low sensitivities reported after adjustment for verification bias. Since all diagnostic tests detect the physiological consequences of ischemia (poor perfusion, decreased membrane function, decreased myocardial shortening, etc) while angiography detects only the anatomic stenosis, catheterization is a flawed gold standard that may detect “disease” in the absence of any physiological significance (thereby raising the apparent false-negative rate of noninvasive testing and reducing its sensitivity). This is especially true when a moderate degree of stenosis is used for the definition of angiographic disease (Roger et al used 50%), ensuring that many patients with anatomic “disease” will not have inducible ischemia. This also is particularly true for a diagnostic test such as exercise echocardiography, which relies on the development of systolic dysfunction for the diagnosis of ischemia, as this is a late event in the ischemic cascade. Use of a more appropriate standard might yield more favorable test characteristics.
The second part of Roger and colleagues' work explores the effects of sex on the value of exercise echocardiography.3 Many prior investigators have noted differences in test performance between men and women for all forms of cardiac noninvasive testing. All of the types of bias discussed above may contribute to these differences: Women generally have a lesser prevalence of coronary disease than men, are less likely to be referred for testing even if symptomatic, and are less likely to be referred for angiography once they have a positive noninvasive test result.9 There are also sex-based differences in the pathophysiology of coronary disease10 : Women generally have less severe disease, risk factors are somewhat different, and even traditional ones convey different levels of risk, making clinical determinations of disease likelihood inaccurate if not done in a sex-specific manner.11 Finally, factors intrinsic to the testing procedure itself may differ: Estrogen may affect exercise performance and ischemic threshold, women are older and less likely to exercise adequately, they are more likely to have both hypertension and left ventricular hypertrophy, and are more likely to use medications that affect the results of exercise ECG.
Given these possible influences on test results, it is perhaps surprising that Roger and colleagues' results were so similar in men and women. After adjustment for verification bias, they reported an 11 percentage point higher sensitivity of exercise echocardiography in men compared with women, a small but significant difference despite overlapping 95% confidence intervals. These results are consistent with those previously published in a similarly debiased population undergoing exercise ECG.6 However, a portion of the difference in sensitivity (3 percentage points) is explained by the greater extent of disease in the men. Further, there was a significant difference in the prevalence of disease in the angiographic subgroup (80% in men and 60% in women), which explains the observed sex differences in positive predictive value and in correct classification rate. Similarly, although the negative predictive value of exercise echocardiography was greater in women than in men in both the biased and adjusted groups, this may be related to lower disease prevalence in women. To aid interpretation of test results in an individual patient, let us turn again to the hypothetical patient with a 50% pretest probability of disease. Calculation of the influence of either a positive or negative test result on posttest probability of disease reveals similar incremental gains regardless of whether the patient is male or female (Table 2⇓). Taken together, these results suggest little difference in the intrinsic test performance of exercise echo in individual men and women; apparent differences are due to variations in prevalence and extent of disease in the populations studied.
Much has been written about sex bias in the referral of women to invasive cardiac testing and access to interventions, with investigators presenting both evidence of bias and of its absence. It is useful to examine Roger and colleagues' cohort with this question in mind, as comparison of the effects of correction for verification bias in men and women will shed light on the relative influences of exercise echocardiography results on the decision to refer to angiography.
Overall, fewer women with a positive exercise echocardiogram were referred for angiography (19% versus 27%),3 but, since this decision incorporated clinical variables and was not solely based on test results, the numbers are difficult to interpret. More importantly, adjustment for verification bias resulted in a greater magnitude of correction for women, with sensitivity falling 47 percentage points in women compared with only 36 in men and specificity rising 49 percentage points in women and only 39 in men (Table 3⇓). This suggests that physicians responded differently to a positive test result in women and relied more heavily on test results for the decision to proceed to catheterization in women than in men. Roger and colleagues3 data do not address the issue of sex differences in pretest bias or whether women are referred to diagnostic testing differently than men, nor do they address whether increased reliance on test results in women represents “good” or “bad” medical practice.
How should Roger and colleagues3 results alter the way we use diagnostic testing? Certainly their data are strikingly different from those previously presented for exercise echocardiography and clearly demonstrate the dramatic effects of correction for verification bias. They indicate that such a correction should be included, when appropriate for the population under study, in evaluation of all forms of diagnostic testing. In addition to these general statements, two specific points arising from their findings can be made: First, in a clinical population, exercise echocardiography, like all cardiac diagnostic tests, is not a very sensitive test when compared with the anatomic gold standard of angiography. However, it is highly specific in a given population and provides incremental gain in estimating disease likelihood in individual patients, both important and valuable attributes. Second, in populations with sex-based differences in disease prevalence and extent, there will be sex-based differences in the accuracy of test results, which suggest that test results must be analyzed in a sex-specific fashion and that the decision to proceed to angiography must take into account sex-based differences in measures of test accuracy.11 What makes these results so important, however, it is that they are more likely to represent day-to-day test performance in a routine clinical setting, so that the adjusted measures of test value that Roger at al and others have derived provide the clinician with a more accurate basis for the interpretation of the results of exercise echocardiography.
The author would like to thank George A. Diamond, MD, for his insightful critique.
The opinions expressed in this editorial are not necessarily those of the editors or of the American Heart Association.
- Copyright © 1997 by American Heart Association
Roger VL, Pellikka PA, Bell MR, Chow CWH, Bailey KR, Seward JB. Sex and test verification bias: impact on the diagnostic value of exercise echocardiography. Circulation.. 1997;95:405-410.
Schwartz RS, Jackson WG, Celio PV, Richardson LA, Hickman JR. Accuracy of exercise 201T1 myocardial scintigraphy in asymptomatic young men. Circulation.. 1993;87:165-172.
Douglas PS. Coronary artery disease in women. In: Braunwald E, ed. Heart Disease. In press.