(Circulation. 2001;103:3150.)
© 2001 American Heart Association, Inc.
Current Perspective |
From the University of Texas School of Public Health, Houston, Tex.
Correspondence to Lemuel A. Moyé, MD, PhD, University of Texas School of Public Health, RAS Building E-815, 1200 Herman Pressler, Houston, TX 77030. E-mail lmoye{at}utsph.sph.uth.tmc.edu
| Abstract |
|---|
|
|
|---|
Key Words: trials statistics population epidemiology
| Introduction |
|---|
|
|
|---|
| The Sampling Nemesis |
|---|
|
|
|---|
This difference in the data across samples is sampling variability, and its presence raises the very real possibility that the sample, merely through the play of chance, will mislead the researcher about the characteristics of the total population. This can happen even when the investigator uses modern sampling techniques to obtain a representative sample. In the end, the investigator has one and only one sample, and therefore, she can never be sure whether conclusions drawn from the sample data truly characterize the entire population.
In sample-based research, researchers cede the high ground of certainty. We do insist, however, that bounds be placed on the size of the sampling errors that can be produced from sample-based conclusions. These sampling errors are commonly known as type I and type II errors. If the researcher identifies the effect of therapy in his sample, the researcher must prepare an answer to the question, "How likely is it that the total population, in which there was no therapy effect, produced a sample that experienced a therapy effect?" This is the type I error, addressed by the P value. The reverse sampling error, in which the total population does experience a therapy effect but produces a sample with no therapy effect, is the type II error. These are solely measurements of sampling error and must be combined with the sample size, effect size (eg, relative risk), and the effect size variability in order to successfully convey the results of the research. However, in order for the medical community to successfully draw a conclusion about the pertinence of these findings to the larger population from which the sample was taken, these estimates must be accurate. They are accurate only when the sole source of variability in the experiment is the data.
If executed correctly, a research program will obtain its sample of data randomly and will lead to accurate statistical computations. When more than the data is random, however, the underlying assumptions are no longer in play, and the statistical computations are corruptedthey are incorrect and cannot be corrected. Consider the following example.
| Troubled Enthusiasm and End Points |
|---|
|
|
|---|
Many researchers would have no problem with Dr Cs EDV-to-ESV end point switch, arguing that each end point is a measurement of the same underlying physiology and pathophysiology. Why should Dr C be criticized for making an initial wrong guess about the best end point to choose? Because she had the insight to measure several different indicators of left ventricular function, perhaps she should be commended for (1) her foresight in measuring ESV and (2) her courage in raising the significant ESV result to a prominent place in her report. Others among us would be uncomfortable with the end point change but may uncertain as to exactly what the problem is. We might say that the decision to change the end point was "data driven." Well, whats so wrong with that? Arent the results of any study data driven?
| Random Research Versus Anchored Research |
|---|
|
|
|---|
The research as Dr C chose to execute it has a new, random component. The choice of the end point was a random selectionthe data dictated which end point would be chosen. Of course, the data always contribute to the end point, but, in this unfortunate case, the data chose the end point. Because the data are random, the end point selection process is now random. A different sample might have led to a different choice of end point. This new source of variation wrecks the standard computations of relative risks, standard deviations, confidence intervals, and type I/II errors because the underlying assumptions have altered. These quantities are still computed, but the computations can no longer be trusted.
There are additional problems with changing the end point from EDV to ESV. Magnitude of effect, variability of the estimate, and required sample size would be different, as well. Although logistically important, these concerns are not the focus of this discussion.
The reader should note that the problem introduced by random research is not one of sloppy calculations. On the contrary, great effort is expended on these computations, using state-of-the-art statistical packages. When a statistician is asked to consider the development of a test statistic for the change in EDV, he commonly begins by saying, "Let x represent the EDV change. Then x has the following probability distribution...." From these statements, estimators of effect size, standard errors, confidence intervals, and probability value are available to effectively convey the strength of evidence the data contains. However, this paradigm falls apart when x is not prospectively and arbitrarily chosen but is instead selected by the data, which itself contains sampling error. In this second (random) paradigm, there is a new probability distribution that governs the selection of the EDV variable itself. This change in assumption results in a more complicated second paradigm in which our commonly used, familiar estimators are no longer the best. Therefore, estimators developed for the first (arbitrary, prospective) paradigm are useless when the paradigm has shifted to the second (random) one. The only protection against this dilemma is the prospective specification of the end points in complete detail, leaving nothing to chance. This is the central motivation for the tenet "first say what you will do, then do what you said" in research.
This dilemma is resolved at once if the researcher is able to study the entire population. If a researcher is interested in identifying the effect of therapy in a community hospital for a short time period with no desire to extend the findings to another hospital or to a future time, there is no concern for estimate trustworthiness. As long as the conclusions will be applied only to those patients in the hospital at that point in time (ie, the sample equals the total population), the notion of prospective determination, from a sampling point of view, is not necessary. However, the freedom gained in the end point selection process by studying the entire population is counterbalanced by the inability to generalize.
| Clinical Trials With Random End Points |
|---|
|
|
|---|
The US Carvedilol Heart Failure program and the ELITE study have each been afflicted with the same source of debilitating variability. In each of these efforts, morbidity end points were chosen prospectively. In each case, these end points were upstaged by other findings. In ELITE, the authors stated that "the study demonstrated that losartan reduced mortality compared with captopril."2 It is important in this discussion to focus on the purpose of the implication. There is no doubt that the use of carvedilol is associated with reduced mortality in the US Carvedilol Heart Failure program sample. Analogously, ELITE clearly demonstrated that in its sample of patients, losartan was associated with fewer deaths. However, because many findings in a research sample cannot be extended to the larger population, the scientific community must carefully consider which findings should be extended and which should be characterized as inconclusive. Statistical measurements are useful tools in this analysis. The random research component in both of these studies makes these tools unreliable.
Additional investigations involving carvedilol are underway. The Carvedilol Prospective Randomized Cumulative Survival (COPERNICUS) trial is a multicenter, randomized, double-blind, placebo-controlled study to determine the effect of carvedilol on mortality in patients with severe chronic heart failure. This study has the prospectively selected primary end point of total mortality and will permit a more definitive conclusion about the magnitude of the effects of carvedilol than did the US Carvedilol Heart Failure study.
In the case of ELITE, additional information is
available. The Losartan Heart Failure Survival study (ELITE
II)10 examined the
effect of losartan and captopril in 3152 patients aged
60
years with NYHA Class II-IV heart failure. This study, designed to
formally test the hypothesis that losartan could reduce
mortality, concluded that there was no mortality effect in the
population. Possible reasons for the differences in the findings
between ELITE II and the original ELITE could be difference in age (the
mean age in ELITE II was lower) and differences in the use of
concomitant medications. In addition, more patients had NYHA Class
III-IV heart failure in ELITE II than in the earlier ELITE. However,
the explanation for the differences in the studies results should
begin with consideration of the sampling error. Measurements of
sampling error are useless in the original ELITE. In ELITE II, the end
point selection was fixed and not subject to change after an
examination of the data. Thus, the research is well anchored and the
statistical error measurements are accurate in ELITE II.
With most clinical investigations, we do not have the luxury of an ELITE II to point out the erroneous conclusions of prior studies. It is more likely that we will have only one study on which to base our opinions. The best we can do in such cases is to ensure that the measurements of statistical error are reliable. In random research, no matter how carefully calculated, these measurements cannot be trusted and must be discarded.
| Sampling Error and Multiple End Points |
|---|
|
|
|---|
This profound difficulty has elicited recent
discussion.11 12 13 14 15
The major recommendations from this body of work are (1) to require
that each primary end point and each of the secondary end points have
type I error attached in a prospective and reasoned fashion, (2) to
separate the type I error allocated for the primary end point from that
allocated for the experiment, and (3) to allow the total experimental
type I error to be >0.05 while requiring that the type I error for the
primary outcome be
0.05. This collection of procedures increases the
rigor for the prospective statements concerning secondary end points
and allows for the straightforward interpretation of a research effort
that is positive for secondary end points when the primary end point is
not statistically significant. Although no consensus has yet been
reached, important new dialogue is well
underway.
| Conclusions |
|---|
|
|
|---|
| Footnotes |
|---|
| References |
|---|
|
|
|---|
2. Pitt B, Segal R, Martinez FA, et. al. Randomised trial of losartan versus captopril in patients over 65 with heart failure (Evaluation of Losartan in the Elderly Study, ELITE). Lancet. 1997;349:747752.[Medline] [Order article via Infotrieve]
3. Friedman L, Furberg C, DeMets D. Fundamentals of Clinical Trials. 3rd ed. St. Louis, MO: Mosby; 1996:307308.
4. Meinert CL. Clinical Trials: Design, Conduct, and Analysis. New York, NY: Oxford University Press; 1986:214215.
5.
Moyé LA, Abernethy D. Carvedilol in patients
with chronic heart failure. N Engl
J Med. 1996;335:13181319.
Letter.
6. Fisher LD. Carvedilol and the Food and Drug Administration (FDA) approval process: the FDA paradigm and reflections on hypothesis testing. Control Clin Trials. 1999;20:1639.[Medline] [Order article via Infotrieve]
7. Moyé LA. End-point interpretation in clinical trials: the case for discipline. Control Clin Trials. 1999;20:4049.[Medline] [Order article via Infotrieve]
8. Packer M, Cohn JN, Colucci WS. Response to Moyé and Abernethy. N Engl J Med. 1996;335:13181319.
9. Moyé LA. Statistical Reasoning in Medicine: The Intuitive P value Primer. New York, NY: Springer-Verlag; 2000. Chapters 7 and 8.
10. Pitt B, Poole-Wilson PA, Segal R, et al. Effect of losartan compared with captopril on mortality in patients with symptomatic heart failure: randomized trialthe Losartan Heart Failure Survival Study, ELITE II. Lancet. 2000;355:15821587.[Medline] [Order article via Infotrieve]
11. DAgostino RB. Controlling alpha in a clinical trial: the case for secondary endpoints. Stat Med. 2000;19:763766.[Medline] [Order article via Infotrieve]
12. Moyé LA. Alpha calculus in clinical trials: considerations and commentary for the new millennium. Stat Med. 2000;19:767779.[Medline] [Order article via Infotrieve]
13. Koch GG. Discussion for Alpha calculus in clinical trials: considerations and commentary for the new millennium. Stat Med.. 2000;19:781784.[Medline] [Order article via Infotrieve]
14. ONeill RT. Commentary on Alpha calculus in clinical trials: considerations and commentary for the new millennium. Stat Med.. 2000;19:785793.[Medline] [Order article via Infotrieve]
15. Moyé LA. Alpha calculus in clinical trials: considerations and commentary for the new millennium: rejoinder. Stat Med. 2000;19:767779.
This article has been cited by other articles:
![]() |
S. S. Rathore, Y. Wang, and H. M. Krumholz Sex-Based Differences in the Effect of Digoxin for the Treatment of Heart Failure N. Engl. J. Med., October 31, 2002; 347(18): 1403 - 1411. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. L. DeMets and R. M. Califf Lessons Learned From Recent Cardiovascular Clinical Trials: Part II Circulation, August 13, 2002; 106(7): 880 - 886. [Full Text] [PDF] |
||||
![]() |
L. A. Moye and A. T.N. Tita Defending the Rationale for the Two-Tailed Test in Clinical Research Circulation, June 25, 2002; 105(25): 3062 - 3065. [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Circulation Home | Subscriptions | Archives | Feedback | Authors | Help | AHA Journals Home | Search Copyright © 2001 American Heart Association, Inc. All rights reserved. Unauthorized use prohibited. |