Sometimes we’re asked whether individuals can be re-identified in synthetic datasets, so in this blog post we tackle that question.
Remember that synthetic data are generated from machine learning models that learn the patterns and statistical properties of real data to then generate synthetic data. There’s no one-to-one mapping of the synthetic to the real data, so the concept of re-identification, which is something that’s more often discussed in the context of de-identification, doesn’t really fit well here. If a record was generated from a machine learning model then it is not really about a specific real person any more.
For synthetic data, there are a couple of possible privacy risks to be mindful of and assess, but re-identification isn’t really one of them.
The two privacy risks that are more relevant for synthetic data are what we call Attribution Disclosure and Membership Disclosure, and both start from the basic assumption that the real dataset is a sample from some population. I’ll explain each of them briefly, below.
Attribution Disclosure: Assume that an adversary has some background information about individuals in the population. For Attribution Disclosure to occur, two conditions need to be met. The first is whether a synthetic record matches a real person contained in the adversary’s background information, for example, their gender, age and the date of a medical appointment. The second is whether something new can then be learned about that person from the synthetic dataset, such as their medical diagnosis. In this scenario we find a synthetic record that matches a real record on some attributes, and we then test if the information gain by the adversary would be high from such a match.
Membership Disclosure: Membership Disclosure involves an adversary learning that someone was in a real dataset that was used to train the machine learning model. If the adversary learns that a specific individual was in the real dataset, then the mere knowledge of membership in the real data can reveal private attributes about the individual. For example, if the real dataset contains information about patients with a particular medical condition, the adversary can learn that a person has this condition, because they were in the real dataset.
The good news is that specific measures have been developed to quantify both these types of risks, so we can ascertain the risk level, and confirm that they are acceptably low.
If you’re interested in learning more about this topic, sign up to attend our free webinar, Managing & Regulating Privacy Risks from Synthetic Data, on March 30, 2022 at 11 am. The session is focused on the privacy assurance use case for synthetic data generation and we’ll be examining technical measures for managing risks, as well as regulatory perspectives as they stand today.