Skip to content
Xi FangApr 20, 20241 min read

How to unleash the replicability potential of synthetic data

In today's data-driven world,  the need for innovative solutions to evolve alongside the ever-challenging landscape of privacy concerns is more pressing than ever. Stringent data regulation policies, such as HIPAA in the United States and GDPR in the European Union, mandate strict protocols to safeguard health data confidentiality and authorize only limited access to health data. Therefore,  the ability to create artificial datasets that mirror real-world data while protecting individual identities means that more and more researchers are looking for the replicability potential of artificial synthetic data as an alternative to real data. 


In this context, replicability assesses whether the synthetic data can yield a “comparable” analytical result to real data when the same analysis method is applied. But the burning question remains: How do we test for replicability? Are there reliable and feasible guidelines available for utilizing synthetic data effectively, specifically tailored to unleash its replicability potential in health research?


The answer is YES! 


In our paper, published in Scientific Reports, a journal of Nature Portfolio, we offer an in-depth evaluation of the replicability of analyses utilizing synthetic health data, which included 240 distinct simulation scenarios. Guidelines for the valid use of synthetic data and optimizing replicability were summarized from these comprehensive simulation results. These guidelines answer the questions that researchers are most curious about, for example: 


  • What is the sufficient number of synthetic datasets and techniques to ensure good replicability performance? 
  • Can we benefit from synthetic data amplification? 
  • If so, which type of generative model shall we use to synthesize the health data? 
  • Besides, is privacy still a concern in the synthetic data generation process?


Most importantly, the successful application of these guidelines on different replicability studies illustrates the robustness and adaptability of this approach. In a real-world application, the synthetic datasets demonstrated precise estimates of the effect size for key covariates in eight breast cancer clinical trials, with confidence intervals overlapping up to 99% and guaranteed very minimal privacy risk.


Embracing synthetic data, while adhering to robust guidelines, enables researchers to unlock the full potential of replicability power, thus contributing to the health research community by fostering greater collaboration and knowledge sharing. If you're eager to delve deeper into these innovative methods and maximize the utility of synthetic data, reach out to us today for further insights.