Connecting the Dots
From Big Data to Healthy Heart
Rising capacity to measure extensive arrays of biological parameters has ushered in an era of biomedical big data. As massive datasets from large cohorts become the norm, the discipline of data science has emerged to tackle data-driven problems at the intersection of biomedical research and patient care. We introduce several sources of cardiovascular big data and discuss the importance of maximizing participation in data-driven knowledge production models.
What Are Big Data?
Every day, our world produces a staggering 2.5 quintillion (1018) bytes of data, including a steadily increasing amount of data from health care and biomedicine. The whole-genome sequence of a patient can reach 100 gigabytes (1011 bytes) in size, whereas a cardiology division may perform >1000 echocardiograms per month, totaling >200 gigabytes of data. The term biomedical big data has been coined to describe healthcare and biomedical datasets that reach remarkable scale, volume, or complexity. Four biomedical big data sources are of particular interest to cardiovascular biomedicine:
Functional phenotypes: Demographics, hemodynamics, electrocardiography, echocardiograms, and imaging data are pouring in from large cohorts such as from among the ≈11 500 cardiac-related studies that are listed by clinicaltrials.gov. Popular personal fitness-tracking devices likewise have created a deluge of mobile health data (eg, heart rate, physical activities, lifestyle) awaiting exploitation. The ability to extract features from phenotypic data and to identify complex interrelationships offers tremendous potential to enhance diagnoses and to improve care.
Molecular profiles: Large-scale omics data on genes, transcripts, proteins, and metabolites can now be acquired in large studies, in the clinic, or even commercially and may be integrated with functional data to allow a better understanding of disease pathogenesis. For example, the National Center for Biotechnology Information Database of Genotypes and Phenotypes alone lists >2000 molecular and functional datasets from >250 cardiovascular studies, including the Framingham Heart Study, the Jackson Heart Study, and other National Institutes of Health–sponsored cohort studies. Collection of even more omics data from cardiovascular cohorts is being promoted by funding agencies, for example, the National Heart, Lung, and Blood Institute X01 funding mechanism for “Omics Phenotypes of Heart, Lung, and Blood Disorders.”
Medical records: Patient electronic medical records abound with physician’s notes, billing codes, laboratory test results, and other valuable information on disease, treatment, and epidemiology that may be mined for association studies and predictive modeling on prognosis and drug responses. Billing code data have been most readily analyzable because of their codified structure. Ongoing efforts to translate less structured but more information-rich physician notes for computation will open new analytic avenues.
Literature knowledge: PubMed boasts a treasure trove of >2.2 million cardiovascular-related articles from 1809 to 2016, and it is estimated that there is a new publication every ≈2.7 minutes. This volume of data overwhelms the capacity of human readers to keep abreast of biomedical knowledge. Biomedical articles are written in natural languages, which are syntactically complex and do not come in predefined structures that allow them to be easily parsed by computer programs. Methods to allow biomedical corpora to be read and computed by software (ie, rendering them machine readable) will unleash tremendous power to mine knowledge on genes, diseases, and drugs.
What Benefits Can Data Science Offer?
Bigger data can bring richer phenotypic measurements, truer representation of populations, and more granular information on disease susceptibility and treatment responsiveness—the prerequisites for precision medicine.1 As an illustrating example, investigators from the Electronic Medical Records and Genomics network developed natural language processing algorithms to extract patient electrocardiographic features and clinical data contained in physician notes from >5000 electronic medical records across 5 participating hospitals. Informaticians then integrated the data with genomic variants in the CHARGE (Cohorts for Heart and Aging Research in Genomic Epidemiology) cohorts to identify genetic loci controlling certain electrocardiographic variables such as QRS intervals. Subsequent phenome-wide association identified a handful of polymorphic variants that may be used by physicians to assess and predict the risks of patients developing arrhythmias.2
This byte-to-bedside, data-to-knowledge paradigm is now being explored globally to enhance clinical decision making and to power biomedical discoveries. However, data alone, whether big or small, do not automatically lead to answers. Success requires the confluence of large-scale phenotypic and molecular data and the ability to computationally integrate them across cohorts. As the rate-limiting step in knowledge production shifts from data generation to interpretation, computational and analytic advances often become indispensable for extracting value from data. Take the challenge of electronic medical record extraction above: If the data volume were small, human data entry and curation specialists could be used to manually extract data from electronic medical records and put them into databases. With millions of patient records, however, it is not feasible to simply scale up the workforce needed for data entry, a predicament that has spurred informatics innovations on text-mining and crowd-sourcing alternatives. Opportunities abound for data science advances to address biomedical questions and to expand knowledge.3
What Roles Do Data Commons Play?
Data commons are shared virtual environments where data users can come together and interact with the building blocks of data sciences—data and tools—to complete data-driven analyses that suit their professional interests (Figure). Although implementations vary, the data science sandboxes provided by commons allow data generators to share data and data consumers to locate the data, to access the tools required to decode them, to perform analyses to address biomedical questions, and further to distribute results. A hypothetical commons may contain entry-restricted servers where cohort data can be securely deposited; operational servers where data may be selected and deidentified for consumption; online informatics platforms comprising resource registries and search engines, both of which allow consumers to locate and access digital objects, that is, datasets and tools, through direct queries (discovery indexes); and unified portal user interfaces or computing environment.
Commons are important because they provide critical technical and regulatory infrastructures that promote universal participation in data-driven knowledge production.4 A corollary of the data-driven model is that the original generators of a dataset are no longer the only individuals invested (or even best equipped) in analyzing the data and drawing conclusions. Contrary to the traditional paradigm in which data are generated for the purpose of testing a specific hypothesis, data consumers today may initiate new and original investigations from preexisting data by asking new questions or by applying new analytic approaches. To ensure long-term utility and productivity, data must not only be deposited for reuse but also be made broadly accessible and discoverable by users with very different perspectives, career interests, and professional expertise.
The reconceptualization of data from a single-use throwaway to a permanent, reusable resource is a paradigm shift that calls for a rethink of knowledge generation and management models. Participation from all stakeholders will ensure far-reaching implications for cardiovascular biomedicine, empowering precision medicine and realizing the benefit of big data for all.
Sources of Funding
Drs Lau, Watson, and Ping are supported by National Institutes of Health grant award U54 number GM114833.
The opinions expressed in this article are not necessarily those of the editors or of the American Heart Association.
Circulation is available at http://circ.ahajournals.org.
- © 2016 American Heart Association, Inc.
- Ritchie MD,
- Denny JC,
- Zuvich RL,
- Crawford DC,
- Schildcrout JS,
- Bastarache L,
- Ramirez AH,
- Mosley JD,
- Pulley JM,
- Basford MA,
- Bradford Y,
- Rasmussen LV,
- Pathak J,
- Chute CG,
- Kullo IJ,
- McCarty CA,
- Chisholm RL,
- Kho AN,
- Carlson CS,
- Larson EB,
- Jarvik GP,
- Sotoodehnia N,
- Manolio TA,
- Li R,
- Masys DR,
- Haines JL,
- Roden DM
- Deo RC
- Contreras JL,
- Reichman JH