Applying a Big Data Approach to Biomarker Discovery
Running Before We Walk?
Circulating biomarkers play an increasingly important role in cardiovascular medicine. Clinicians routinely measure biomarkers of cardiac injury, neurohormonal activation, and renal function for diagnosis and risk assessment and for guiding therapeutic decision making. Several intersecting trends have heightened interest in the discovery and validation of additional cardiovascular biomarkers, including the highly anticipated transition to personalized or precision medicine and the pending availability of “big data” sets that promise a dizzying array of patient-level data, including unprecedented numbers of biomarkers.
Article see p 2297
A number of strategies have been used to search for novel biomarkers of cardiovascular disease (CVD). Unbiased technologies, including genomics, proteomics, and metabolomics, all use a big data approach for novel biomarker discovery,1 but to date, these technologies have failed to deliver on their initial promise, yielding no new clinically useful biomarkers in cardiac care. An alternative strategy is to focus on known proteins reflecting mediating pathways to ensure a higher probability of association with CVD, an approach that can now be implemented on a massive scale with the use of new multiplex immunoassay techniques that allow conservation of sample volume.
In this issue of Circulation, Gerstein et al,2 representing the Outcome Reduction With an Initial Glargine Intervention (ORIGIN) Trial investigators, used a commercially available Luminex multiplex immunoassay platform to screen hundreds of distinct biomarkers for their ability to predict CVD events. Although the marker selection process was described as an unbiased approach, each of the biomarkers tested was a known protein preselected by either the company or the investigators on the basis of some a priori rationale for inclusion in the test panel, a strategy that may increase the yield of discovery while limiting the full scope of what can be discovered. Within the field of circulating cardiovascular biomarkers, this effort is almost unprecedented in its scope and scale, given the large numbers of biomarkers measured (284), study subjects enrolled (8401), and primary composite clinical end points included (1405). The investigators created uniform algorithms to define which biomarkers were excluded on the basis of analytic concerns, how the biomarkers were modeled, and what criteria were used to select markers for inclusion in the final panels. This approach identified 8 to 15 biomarkers that were independently associated with various individual and composite end points. Adding these biomarkers to a multivariable model improved discrimination and risk classification compared with standard risk factors alone. Although N-terminal pro brain natriuretic peptide, a well-known risk marker, was consistently identified in all analyses, perhaps serving as a good positive control, angiopoeitin-2 and glutathione S transferase-α were also consistently selected across multiple different modeling conditions, suggesting that they may represent biological pathways contributing to CVD risk. However, it is premature to suggest that they would be useful as clinical biomarkers, given the exploratory nature of this study.
The authors are to be congratulated for tackling this very ambitious project, which we hope will begin a dialog on how big data approaches for biomarker discovery can provide a return on investment by identifying potentially useful new biomarkers. In our view, the authors have clearly shown that multiple-biomarker panels can improve risk prediction in a heterogeneous patient population. However, there are no gold-standard processes for marker selection or model building; therefore, it remains unclear whether the algorithms used in this study selected those candidate markers most likely to be clinically useful or biologically relevant. Although it could be argued that it matters little which individual biomarkers are included in the panel, as long as the panel works well for risk prediction, we believe that the greatest opportunities from biomarker discovery lie in a better understanding of disease processes, delineating new treatment targets, and guiding precision medicine decisions. For these goals, it matters greatly that the correct biomarkers for the outcomes of interest are selected from the pool of potential candidates.
Myriad considerations may influence the results of biomarker discovery projects, including factors related to the biomarker measurements, study population, end-point selection, and statistical methodologies used (Table). Preanalytic, analytic, and biological variation may affect biomarker measurements, leading to false conclusions regarding the associations of biomarkers with disease phenotypes. For example, blood collection media, processing protocols, and storage conditions may have important and sometimes unpredictable effects on biomarkers, hindering further clinical development of some biomarkers that may otherwise be promising candidates.3 In addition, variation in the quality of the individual assays and potential interference between assays on multiplex immunoassay platforms may lead to decreased accuracy and precision.4 Finally, even for assays with minimal preanalytic or analytic variation, biological variation may be substantial even in the absence of apparent changes in disease status. For example, natriuretic peptide concentrations may vary by ≥50% across serial measurements even in the absence of change in any clinical parameters.5
In the Gerstein et al study, we know little about the performance of the individual assays except that assays with very low sensitivity were excluded. Even this seemingly straightforward strategy of excluding low-sensitivity assays may be problematic because some insensitive markers may have high positive predictive value for events and be very useful for risk prediction. An important example in this regard is cardiac troponin measurements, which are low or below the detection range in most individuals. Troponin measurements would probably not pass the sensitivity criterion for even being evaluated in the present study, even though troponins are well-validated biomarkers for risk prediction in primary and secondary prevention cohorts.6,7 Obviously, with >200 biomarkers included in the present study, upfront consideration of all these assay-related factors may not be feasible, but these issues could nevertheless have profound effects on the validity of the markers selected. Moving forward, we would propose that due diligence of preanalytic and analytic factors could be performed for promising candidate markers selected after initial screening to determine which of the newly discovered biomarkers are suitable for further clinical development.
Another important consideration with biomarker discovery is the composition of the study population and the selection of end points. From a purely scientific standpoint, we believe that biomarker discovery should ideally be performed in relatively homogeneous populations with the use of narrowly defined end points that represent the most specific phenotypes possible. In contrast, validation should be performed in more diverse cohorts that better reflect the clinical circumstances in which the biomarkers might be used. In the present study, participants with and without existing CVD were combined, and the discovery end points used were relatively broad composites. Inclusion of individuals with existing CVD may alter not only the biomarker levels but also the associations of some biomarkers with recurrent disease. In addition, patients with CVD are typically on multiple pharmacological therapies, which can confound associations between biomarker levels and events. With regard to end points, heterogeneous cardiovascular end points reflect overlapping but distinct mediating pathways and therefore will likely associate with differing patterns of markers. For example, a marker that strongly associates with a thrombotic end point such as stroke may not be selected if a combined end point that includes heart failure is used, whereas a useful heart failure biomarker may not emerge when the composite end point combines atherosclerotic with heart failure events. Broad composite end points may be reasonable ultimately for clinical risk prediction but in our view are not an ideal initial approach for scientific discovery, for which the most precise phenotype would be the most informative. For example, the end-point “phenotype” of myocardial infarction includes both type 1 (plaque rupture) and type 2 (non–plaque rupture) events, stroke may include hemorrhagic and multiple subtypes of ischemic stroke, and heart failure includes distinct phenotypes of heart failure with preserved and reduced ejection fraction. As we move toward precision medicine, novel biomarkers may be more useful to help subphenotype these broad categories rather than to predict the risk of heterogeneous composite end points.
Many biomarkers demonstrate sex, race, and obesity interactions and associate differently with phenotypes among different subgroups. For example, high-sensitivity C-reactive protein levels are highly influenced by sex and obesity,8 and lose associations with atherosclerosis phenotypes among obese individuals.9 In addition, medications used to treat lipids, diabetes mellitus, and other conditions may influence levels of biomarkers or modify the association between a biomarker and outcome. Here, the authors did not consider such interactions. Biomarkers that represent similar biological pathways may be correlated, and 2 informative biomarkers in the same or related pathways may effectively neutralize each other, with neither moving forward in the selection process. Approaches such as principal component analysis and correlation analysis may be useful to cluster markers before testing them in the multivariable models.
As if the issues above were not sufficiently daunting, perhaps the largest challenge is that discovery is highly sensitive to the statistical approaches used. A multitude of different selection techniques and model-building algorithms are available for sorting through large numbers of biomarkers to “find the winners.” In our experience, model building performed with automated variable selection techniques such as forward, backward, and stepwise procedures or with automated criterion-based procedures such as Akaike and Bayesian information criteria may result in different panels of biomarkers selected. Our personal experiences in this regard are very similar to those of the authors, namely that the biomarkers selected among the candidates differ substantially depending on the conditions of the statistical analytic strategy.
In the article by Gerstein et al, several biomarkers selected in the derivation set did not replicate in the validation set; moreover, several of those that did replicate are well-studied biomarkers that have been identified previously through more standard biomarker approaches. In our view, the “discovery” of previously established biomarkers or known risk factors with the use of the preset algorithms is a welcome result because it serves as a positive control and increases confidence in the more novel biomarkers that emerge. Splitting samples into derivation and validation subsets is a reasonable approach to address replication within the same sample but diminishes statistical power. Alternative strategies using bootstrapping with optimism across the entire study sample can preserve statistical power without exaggerating biomarker performance.10 Moving forward, we believe investigators should justify the use of 1 selection technique or algorithm over another and consider sensitivity analyses using a number of different statistical approaches.
Another important statistical issue is that the findings may be highly sensitive to which covariates are included in the multivariable adjustment and how the covariates are modeled. In the present study, for example, different biomarkers emerged on the basis of the seemingly trivial distinction of whether age was considered as a categorical or a continuous variable. Biomarkers that are heavily correlated with age are more likely to be eliminated when the full range of age on a continuous scale is considered.11 Some biomarkers may reflect pathways between risk factors and CVD events, and simply adjusting for traditional risk factors may obscure important mediating relationships.
These observations highlight inherent limitations of a mechanical approach to big data in which the computer makes all the decisions on the basis of prespecified algorithms, incorporating little to no user input. We would argue that more directed user input is necessary on both the input and output sides of model-building algorithms to maximize the potential and to minimize the error rate of the big data approach.
In summary, despite the promise of big data and biomarker discovery, this important article by Gerstein et al highlights that the road to biomarker discovery is going to be tortuous. The technological capability to measure hundreds of biomarkers has outpaced the sophistication of our analytical approaches in interpreting this amount of data. Moreover, assessment of assay-related factors affecting measurements may not be feasible with explorations of this size and scope. Discovery efforts in biomarker science lag behind those in genomics, where large-scale collaboration and multiple-step replication are now standard operating procedures and where the discovery bar is well established. Collaboration in biomarker discovery research is more difficult than in genetics because of the limited biospecimen repositories, higher cost of assays, and lack of incentive for investigators to use their limited supply of biospecimens for replication studies.
None of these challenges are insurmountable, but they require a systematic “walk before we run” approach to yield the most valid scientific results. Large-scale biomarker collaborations, which facilitate replication of major results across study populations, are only beginning to be developed. Moreover, we need standards to define optimal strategies (or at least minimum requirements) for large-scale biomarker discovery and validation. Otherwise, the risks of the big data approach are substantial, with the potential for sins of both commission (ie, false discoveries that do not replicate) and omission (failure to detect important biomarkers as a result of study conditions). This important article by Gerstein et al highlights both the opportunities and the challenges that lie ahead.
Dr de Lemos has received grant support from Roche Diagnostics and Abbott Diagnostics and consulting income from Roche Diagnostics, Abbott Diagnostics, and Diadexus, Inc. Dr Rohatgi and C.R. Ayers report no conflicts.
The opinions expressed in this article are not necessarily those of the editors or of the American Heart Association.
- © 2015 American Heart Association, Inc.
- Gerstein HC,
- Pare G,
- McQueen MJ,
- Haenel H,
- Lee SF,
- Pogue J,
- Maggioni AP,
- Yusuf S,
- Hess S
- de Lemos JA,
- Zirlik A,
- Schönbeck U,
- Varo N,
- Murphy SA,
- Khera A,
- McGuire DK,
- Stanek G,
- Lo HS,
- Nuzzo R,
- Morrow DA,
- Peshock R,
- Libby P
- Ellington AA,
- Kullo IJ,
- Bailey KR,
- Klee GG
- Zethelius B,
- Johnston N,
- Venge P
- Eggers KM,
- Lagerqvist B,
- Venge P,
- Wallentin L,
- Lindahl B