Efficient Synthesis of Complex Health Data

Written by Lucy Mosquera | Jul 17, 2024

Blog authored by Lucy Mosquera (Sr. Director, Generate Operations and Data Science) and Julie Zhang (Lead Data Scientist)

Synthetic data has gained great attention in recent years, especially in the health industry, where privacy regulations limit data access. Synthetic data is data produced by a machine learning or deep learning model trained on real health data to have the same patterns, relationships, and structure as the real dataset but where the values presented are produced by the synthesis model. Synthetic data enables easy, fast, and effective access to high-utility data while complying with privacy regulations.

The Challenge of Synthesizing Complex Health Data

Complex health data are usually longitudinal, where patient information for a range of visits, tests, and outcomes are collected over a long period of time. Complex health data includes:

Clinical trial data: data captured from participants of clinical trials that is used to assess the safety and efficacy of the treatment under investigation;
Open and closed claims data: electronic records of patient-provider interactions and healthcare transactions that are collected for the purpose of billing and reimbursement;
Electronic health record data: a digital version of a patient’s medical records as used by clinicians in the delivery of healthcare.

Complex health data can contain many different relational structures between the data elements. For example, there can be nested relations where, for each oncology patient, multiple tumors are detected, and tumor-specific data is collected over time during subsequent patient visits. Each patient ID will then uniquely link to a set of tumor IDs, where distinct data is collected for each tumor ID, and different patients may have different numbers of tumors. There can also be non-hierarchical relations, such as when multiple patients are on the same insurance plan or one patient has multiple insurance plans.

These different types of relational structures are important to allow data to be collected in a manner that is representative of the complexity of how patients interact with the healthcare system. However, synthesis of these complex relationships requires using an equally complex deep learning model. This approach can help achieve maximum privacy; however, training deep learning models requires costly graphics processing units (GPU), large amounts of data, and a substantial amount of time to dedicate to training. Additionally, with the complex structures mentioned above, highly customized deep learning models may be unable to preserve all the correlations and relationships.

Our Solution: Partial Longitudinal Synthesis

Partial Longitudinal Synthesis (PLS) is the solution to synthesizing complex health data. In PLS, not all variables are synthesized. Instead, only quasi-identifier variables that are cross-sectional in the dataset are synthesized. The other variables remain unchanged.

PLS provides privacy protection for the data by synthesizing quasi-identifiers in the data. Additionally, information on the longitudinal variables is given to the model to ensure the relationship between the synthesized quasi-identifier variables and the longitudinal variables is maintained. Finally, the synthesis complexity is significantly reduced compared to the full synthesis of longitudinal data, allowing synthesis to be completed quickly at a lower cost.

Conclusion

PLS has been applied successfully to different kinds of health data, resulting in high utility and low privacy risk data that can be safely reused. We have observed:

Synthetic data generated from PLS is highly similar to real data in terms of data structure, distributions, and analytic results;
Privacy risks associated with the synthetic data are far below an acceptable risk threshold.

If you have complex data and want synthetic data with high utility and privacy, please contact an Aetion expert today.

View full post