PeterMar 1, 20235 min read

Why is synthetic data used?

By Khaled El Emam | Published: March, 2023 The following article was published by OneTrust DataGuidance and can be accessed on their platform via subscription. Reprinted with permission.

The first article in this series on synthetic data looked at what this type of data is and how it is generated. In this article, Dr. Khaled El Emam, SVP and General Manager of Replica Analytics Ltd, will examine some of the use cases of synthetic data in a bit more detail.

Given that different types of data can be synthesised, it is helpful to limit our scope to synthetic structured data, which is data that can be stored, for example, in a spreadsheet or a relational database. The applications of other types of synthetic data, such as images, overlap with structured data but also has important differences. Therefore, here we are concerned with structured data only.

The use cases for synthetic data can be divided up into two broad categories: data privacy and data enhancement. Both types of use cases will be summarised here since both would be relevant for privacy professionals.

Generating data

The data privacy use cases pertain to enabling more responsible data sharing and data reuse. As there is not a one-to-one mapping between the synthetic records and real people, the ability to identify synthetic records would be quite small. This means that synthetic data would be treated as non-identifiable information, making it easier to use and disclose for secondary purposes with minimal obligations. For example, consent would not be a requirement in many jurisdictions for such secondary uses.

Data can be shared on one of two modalities. The first is the more common approach of providing actual synthetic data files to the data consumers. The second is whereby the data consumers are provided access to the generative models. As the process of synthetic data generation requires the training of these generative AI models, it would make sense to set up a catalogue of generative models, each one representing a unique dataset, and letting the data consumers search for and generate datasets for themselves. For example, if a biostatistician wanted a dataset on a specific type of oncology trial, they would search the catalogue for such datasets, identify the relevant generative models, and then synthesise an arbitrarily sized dataset from the generative model.

This process of sharing generative models gives the data consumers additional advantages in that they can be selective about the cohorts they generate data for. For example, a data user can generate a dataset with 80% females and 20% males, or vice versa. Or the data consumer can define additional conditions on socio-economic status and location for their data generation. This gives them significantly more flexibility. Source: Replica Analytics Ltd

Data privacy use cases

Once data is generated, the following are the types of privacy uses that they can be applied to.

Internal secondary uses of datasets

Much can be learned from the re-analysis of existing data. However, access to internal datasets for exploratory and detailed analytics faces significant friction due to privacy requirements. Existing anonymisation techniques distort datasets too much and take too long. High-quality synthetic datasets can be reused more easily within organisations with less friction.

External data sharing with partners and collaborators

Synthetic data can help address various external data sharing and access challenges, for instance, in research and development. Organisations often want to collaborate with outside parties to solve problems or innovate, and being able to make synthetic data available can accelerate the time to value from these collaborations.

Realistic datasets for software testing

Software always needs to be updated. Every software application change must go through a verification (testing) process. Synthetic data generation produces realistic datasets with high utility for software testing and data engineering, with fewer regulatory constraints and privacy risks. Some privacy regulators have recommended using synthetic data for software testing to avoid privacy breaches. Another advantage is that a library of synthetic datasets can be made available on-demand to software teams.

Data for education, training, and hackathons

Synthetic data can be used as a training method when students or employees learn how to manage personal data. Current anonymised datasets are too distorted for effective teaching and many are inaccessible. Synthetic data generation produces readily available, realistic, reusable, and high granularity datasets to improve learning and skills development. Synthetic data can also be used in hackathons that bring groups together to address real-world problems.

Long-term data retention

Synthetic data can be applied to share, reuse, and retain datasets without compromising privacy requirements. As a result of regulations and costs associated with retaining data, records of personal information are disposed of when they are no longer needed for the original purposes. Organisations can maintain data utility by applying synthetic data techniques to real datasets, and retain the synthetic records or the generative models for an extended period of time.

Vendor technology assessments using synthetic data

New data-centric technologies developed by third parties (e.g., academics or startups) must be evaluated using realistic datasets. However, providing access can take months of extensive contracting and auditing. With synthetic data generation, readily available datasets can be used and vendor evaluations are scalable, realistic, and more efficient.

Data enhancement use cases

In addition to privacy protection, synthetic data can be applied to improve datasets. The following are some examples of such applications:

Data amplification

Amplifying small datasets. When datasets are small (e.g., in healthcare settings these would be for diseases, pediatric clinical studies, etc.) then synthetic data generation methods can amplify the datasets by adding virtual patients with the same characteristics and patterns. This can improve machine learning model accuracy and stability.

Bias correction

If a dataset is biased because of, for example, poor representation of certain individuals or groups, data synthesis can re-balance these datasets while maintaining key relationships. This way, inherent biases in the data can be mitigated. For example, if a dataset has an under-representation of a certain race, then synthetic data generation can simulate additional individuals of that missing race to de-bias the dataset.

Conclusion

Synthetic data generation can address a series of privacy use cases, as well as other data enhancement use cases. The privacy use cases all concern enabling responsible data sharing and data access of non-identifiable information for secondary purposes that we have seen in multiple sectors, including healthcare, finance, and retail.

Dr. Khaled El Emam SVP and General Manager, kelemam@replica-analytics.com, Replica Analytics Ltd, Ottawa

Peter

SOFTWARE

SERVICES

PARTNERSHIPS

Why is synthetic data used?

Generating data

Data privacy use cases

Data enhancement use cases

Conclusion

RELATED ARTICLES