Skip to content
AdminFeb 16, 20196 min read

The epidemiology of databases: Part I: Four principles of working with real-world data

As the U.S. Food and Drug Administration continues its formal exploration of the use of real-world evidence (RWE) in regulatory decision-making, Aetion has submitted a comment letter to the agency outlining our approach to database epidemiology to help guide that process.

Our perspective is straightforward: When RWE is generated using real-world data (RWD) that follows these four key principles, we can ensure clinically meaningful analyses in which all stakeholders can be confident.

These principles already guide our ongoing collaboration with FDA on RCT DUPLICATE, a series of demonstration projects with Harvard Medical School and Brigham and Women’s Hospital—a shared learning process that will benefit regulators, product developers, payers, providers and, most importantly, patients.

To generate scientifically valid and accurate results using real-world data, investigators must ensure that:

1. Data are “fit for purpose”

Principled database epidemiology starts with framing questions in a meaningful way for the ultimate decision-making audience. When assessing the relevance of a data set for a RWD analysis, the following should be considered:

  • Prior experience with a data source
  • Availability of validation studies against a gold standard
  • Detailed documentation of the data generation mechanism
  • Detailed description of the data curation process
  • Detailed description of any mapping to medical concepts
  • Documentation of any coding shifts over time

In particular, for a specific study of a causal treatment effect in a RWD data set, the following four features must be measurable:

  • The population inclusion and exclusion criteria to characterize the target population and understand its generalizability
  • The exposure status
  • The outcome(s) of interest
  • The factors that influence both the treatment decision and the outcome which, if not properly accounted for, can confound (bias) the estimate of the treatment effect

Measurement of each of these features should be quantified by metrics like sensitivity and specificity for binary variables, or mean squared difference and proportion missing for continuous variables so that reviewers of the study can assess its predictive validity.

Claims data included in certain RWD data sets are not well-suited to test causal effects of medical products in every clinical area. For example, claims-based approaches for doing analyses in oncology may not be as useful because critical clinical parameters such as biomarkers, staging, grading, and progression-free survival may not be reliably captured. Instead, specialty registries may be more appropriate.

Another example is a registry-based study of a drug for rheumatoid arthritis which may focus on EHR records from rheumatology clinics. Some care for this condition, however, may be delivered by primary care doctors whose services might not be tracked in these systems.

In short, when data are fit for purpose, we obtain a more accurate view of the patient’s health history.

2. Real-world data sets should be fully transparent and traceable

RWD can come from a range of sources such as claims databases, registries, EHRs, and clinical trial data. Decisions made in how to transform raw RWD into analyzable data elements can affect resulting estimates. Therefore, all decisions made in the selection and preparation of data must be clearly documented and available for review.

As proper data handling is paramount to good “study hygiene” and responsible stewardship of RWD, processes should be in place to authenticate and document all users of the data throughout each stage of evidence generation. As reproducibility and traceability are cornerstones of trust in real-world evidence, fully archived and auditable logs should record all transactions and provide comprehensive versioning of the data, including data history, provenance, linkages, and transformations.   

Reliable RWD is derived from codes or combinations of codes that represent relevant medical concepts. Key elements of reliability are:

  • Ability to explore data completeness, including trends over time and consistency
  • Fully transparent documentation and versioning of every methodological decision including analytic cohort (if permitted within governance rules) used in the analysis, including data history, provenance, and transformation

Data reliability improves at those points in a data set when the patient is considered “observable”: the time over which a patient’s health events are more likely to be complete and reflect their health encounters. Therefore, reliability analyses should assess whether exposures and outcomes are observable in the RWD source used.

3. Data should be minimally processed and transformed

While data processing and transformation can make data fit for purpose, it may also make that data unusable for other purposes. Therefore, RWD must be available in a minimally-processed state in the beginning of the study—where translation of coding systems, medical terminologies, and even data format/structure is avoided—so that scientifically-appropriate choices for each use of the data can then be made.

Minimal processing ensures that critical transformations turning recorded drug prescriptions, diagnoses, and procedures into study-relevant definitions of exposures, outcomes, and covariates are explicitly documented, rather than obfuscated in a separate process of data transformation.

It is also important to perform most data “cleaning” at the time of the study implementation, using so-called “late-bound transformations,” so that a general process of data cleaning does not reduce the utility of RWD datasets, or worse, introduce misleading information.

Consider the case of a patient’s body mass index (BMI) recorded as “180” which plausibly could be intended to be 18.0, a blood pressure recorded in the BMI field, or simply an error. A one-time cleaning process might recode this value to 18, to “high,” or to missing. However, the best choice for use of such a data element relies on study circumstance. For a study of an anti-diabetic drug’s effect on BMI, an investigator likely will wish to exclude this patient as the patient’s BMI is recorded as outside of a clinically meaningful range.

For a study of the risk of myocardial infarction, however, where BMI is one of many relevant covariates, a more approximate definition may well be viable. Late-bound cleaning of this data element—so that in the first study it is set to missing and thus excluded while in the second study it is set to “high” and can be included—allows investigators to make the best decision in each case.

In all cases, decisions made and the rationales behind them must be clearly documented for reviewers and decision-makers to ensure full transparency in the study results.

4. Use and storage of RWD should adhere to relevant governance standards

Appropriate safeguards must be in place to guard protected health information. And the most critical is to de-identify data sets so there is no reasonable basis to believe that the health information can be used to identify an individual.

If the data cannot be de-identified, however, because doing so may limit the utility of RWD, obtaining a certification from the Health Information Trust Alliance (HITRUST) that the HITRUST Common Security Framework (CSF) has been implemented provides reasonable assurance that the standards in the Health Insurance and Portability and Accountability Act (HIPAA) have been met. For example, encrypting RWD from creation to destruction is a reasonable method to ensure that data are a closed system that cannot be manipulated by unauthorized users.

More to come

Applying these four principles to a RWD study delivers confidence that the results are clinically meaningful and scientifically valid. They also give product developers, patients, payers and other stakeholders a sufficient level of predictability and certainty in regulatory decision-making that incorporates RWE—a key consideration for use of RWE in improving efficiency in clinical trials.

Real-world evidence is a field in which further development is both likely and necessary. These four key principles are just the beginning as we all work toward new and improved methods, technology, data sources, and applications to achieve regulatory-grade standards for RWE.

Part II of this series, The Epidemiology of Databases, presents four principles of generating real-world evidence, the results of RWD studies. And Part III presents applications of RWE.