Skip to content
AdminApr 5, 20194 min read

A perfect data set? Yes and no: The All of Us Project

In an ideal world, all health data would be meticulously collected, capture the whole patient from genomics to medical interventions to environmental factors, reflect diverse patients in the context of clinical care, and be tracked across extended time spans.

In the real world, no single source of health data—whether derived from randomized controlled trials or routinely collected health care data—is both meticulously collected and sufficiently comprehensive to capture all drivers of health outcomes for all patient populations.

Enter the All of Us project, an ambitious $4.5 billion initiative of the National Institutes of Health (NIH) to collect comprehensive, longitudinal health data from one million, or about 1 in 325, Americans.

Announced in 2015 as the Precision Medicine Cohort Program, the project was spurred by the success of the Human Genome Project and the promise of precision medicine. By recruiting 1 million diverse volunteers to share all their health data over ten years, the NIH will create a massive, clean, and longitudinal data set for researchers, accelerating discoveries in prevention and treatment—and taking personalized care from concept to mainstream practice.

“Just like analyzing our DNA teaches us more about who we are than ever before,” said President Obama at the launch, “analyzing data from one of the largest research populations ever assembled will teach us more about the connections between us than ever before.”

Four years later, the project has recruited 200,000 volunteers in the 12 months since its planning and beta test phases, and is on target to reach 1 million by 2025 or before. And for each of those participants, the scope of data points collected surpasses that in any set yet produced by randomized controlled trials or real-world data.

With all privacy guards in place, the process for each participant begins with three health surveys with over 100 data points on demographics, daily activities, health access, lifestyle, personal and family medical histories—with more annual surveys for ten years. Participants consent to sharing their electronic health records (EHRs), and those data are updated every six months. At an initial visit to a participating clinic, physical measures are taken for height, weight, and hip and waist circumference, each measured twice. Blood pressure and heart rate are measured three times each. Seven tubes of blood are drawn, one each for plasma, serum, and sodium heparin, two tubes for whole blood, one for DNA, and one for RNA. A urine sample is taken. And then the participant’s health is tracked for ten years.

“The more complete the longitudinal health information, the higher the utility of data,” says Jeremy Rassen, Sc.D., chief science officer of Aetion. “Combining physical, genomic, and EHR data will allow real-world researchers to dive into a variety of questions about both causes and outcomes, getting to precision medicine. With proper real-world evidence research methods in place, the project’s data has the potential to be an excellent source for certain studies.”

Potential limitations of the data derive from the difference between risk-factor epidemiology and clinical epidemiology. The eventual data set, says Sebastian Schneeweiss, M.D., Sc.D., chief of the Division of Pharmacoepidemiology and Pharmacoeconomics of Brigham and Women’s Hospital, “will be useful in risk-factor epidemiology: studies of genetic versus environmental risk factors and how they play out in the incidence of diseases.” It is likely to be less useful in clinical pharmacoepidemiology where even one million participants may not be enough. “You could look at very commonly prescribed medications but couldn’t evaluate new medications or drugs new to the market, highly targeted medications, or drugs for rare diseases. You’ll run out of patients.”

The data could also be less representative of “all walks of life” than NIH Director Francis Collins, M.D., Ph.D., envisions. Participants who consent to share their EHR data could skew toward those who do not have conditions that carry stigma: mental illness, substance use disorders, HIV or other STDs. Conversely, as the project partners with community health centers to recruit diverse participants, the data may over-represent those with diabetes and other diseases prevalent in under-served communities. The annual surveys of participants may be another potential cause of bias. Will self-reported measures be accurate? Is the project accounting for potential recall bias? In all these cases, says Rassen, potential biases would not invalidate the project, “but researchers need to take them into account when using the data.”

The All of Us Project could present a new opportunity for pragmatic randomized trials, says Schneeweiss. “Once you have one million people in your system, you may be able to impose randomized interventions on subsets. The project is in sustained contact with the participants. It’s a small extra step to obtain informed consent to intervene for certain conditions. The participants are motivated; they may consent to be randomized. But if you waited for participants to take certain medications, you may never have enough for a study.”

Cautions in place, data scientists applaud the goal of the All of Us Project to generate comprehensive, longitudinal data to drive precision medicine. “Working with diverse, real-world data sets has not always been possible,” says epidemiologist Erin Comerford, “but it is an important step as we work to identify the causes of individual differences in response to commonly used drugs.”