By Khaled El Emam | Published: April, 2023 The following article was published by OneTrust DataGuidance and can be accessed on their platform via subscription. Reprinted with permission. The  first article in this series looked at what synthetic data is and how it is generated, and the second article examined the use cases of synthetic data. In this article, Dr. Khaled El Emam, SVP and General Manager of Replica Analytics Ltd. looks at a number critical success factors ('CSF') for the implementation of synthetic data generation (‘SDG’) in an enterprise. Some of these CSFs apply to any artificial intelligence (‘AI’) technology implementation, but they are still worth emphasising, and some are particular or amplified in the context of SDG implementation projects. The article will close with an actual case study that illustrates the implementation of SDG for a highly sensitive dataset on opioid users.
CSF 1: Identify a specific business problem
SDG is sometimes assumed to solve many types of problems, and sometimes these go beyond what the technology is capable of solving. We often see organisations make a list of their many challenges with the hope or expectation that SDG can address all of these. As with any technology, SDG will have the biggest impact on a subset of an organisation’s problems and there may be technical and regulatory limitations. Therefore, the business problem being addressed must fit the current state of SDG technology to ensure that the technology adoption can deliver meaningful results. Also, because SDG and the whole area of generative AI is new there is a lot of curiosity about it with organisations engaging in projects to learn abut SDG, even if these projects will not necessarily generate business value. Such projects are not sustainable and are good candidates for cancellation when resources are limited. Therefore, there must be a business problem identified that is causing sufficient pain to make it worth investing in solving it. This means that there is nontrivial business value in solving that problem. Furthermore, the technology must be capable of solving the problem otherwise the project ends up being a research and development effort that can take time to produce usable solutions – longer than the business is willing to wait.
CSF 2: Transition SDG knowledge to the data custodians
When solving a privacy problem, SDG is applied to a source dataset. This source dataset is most often a pseudonymised dataset. This means a dataset with the direct identifiers removed and/or encrypted. Pseudonymised datasets are still personal information. This means that the team working on SDG must have the authority to access, and the access to, personal information. There are two general approaches to SDG technology deployment: have the data custodian perform the SDG on the source dataset, or send the source dataset to a third party to perform the SDG on their behalf. For the reason of data access, it is easier, and more expeditious, to transition the technology and expertise to the data custodians who already have authority to access the personal information. Then they do not need to go through the more complex process of disclosing personal information to a third party. Avoiding a disclosure means that the advantages of SDG are more evident to the data custodian and data users. This also ensures that the implementation of SDG does not get caught in legacy processes for the disclosure of personal information (which are often inefficient).
CSF 3: Establish continuous training
As SDG is a form of AI technology, and the individuals in the organisation working in AI are in high demand, we have seen in some cases high churn in those skills. This means that there is a need for continuous training on SDG and the problems that SDG can solve, as well as the specific technology that is being implemented. The reason that this is important to emphasise is that there is a need to allocate budget for continuous training. More generally, on-going training is always a good refresher for the team working on the SDG projects to re-enforce concepts as the team gets more experience with their application.
CSF 4: Engage with the privacy team early
While this is obvious to a privacy audience, the need to engage with and involve the privacy team early in the SDG deployment effort is important. This ensures that they are aware of the new privacy enhancing technology that is being introduced into the organisation, can ensure proper integration and alignment with current policies and procedures (or adapting them accordingly), and are able to support the business in treating the synthetic data as non-personal information.
CSF 5: Ensure adequate capacity planning for computational resources
SDG, whether it is based on statistical machine learning techniques or deep learning techniques, is computationally demanding. Different technical solutions will vary in their computational footprint, and this affects the total cost of ownership. Therefore, it is necessary to have an early understanding of the datasets that will be synthesised (e.g., the dataset size and relational data model) and the turn-around times for SDG jobs to ensure that sufficient computational resources are made available to ensure project success. For example, if a data analytics team expects 24 hour turn-around times for synthetic datasets but the computational resources are not able to meet that timeline, then the end-users will not be satisfied.
CSF 6: Do not start with the most difficult business problems
We sometimes see organisations starting to apply SDG on their hardest business problems, these may be hard because of the complexity of business workflows, the existence of adversarial relationships, highly complex datasets, and/or complex contractual arrangements. The thinking is to test the ability of SDG to operate in particularly challenging problems. Unless this particular context is typical, focusing on the most complex scenarios as the starting point for the implementation of SDG can result in a highly confounded experience that is not going to carry over to other parts of the organisation. It is better to start with a typical project of moderate complexity to develop the integrations and procedures around the technology, and also to have learnings that can be more broadly applicable to deploy SDG across other parts of the organisation.
CSF 7: Source data that is fit-for-purpose
As SDG starts off with the real (source) data, it is necessary to ensure that the source dataset is fit-for-purposes. For example, if the original dataset has quality problems that would make it unsuitable for use to address the business problem, then SDG is not likely going to solve that problem. If the objective is to train an AI model to support a resource allocation decision, but the original data has many data collection errors, then the generated synthetic data would likely not be suitable for that type of AI model.
CSF 8: Source data restrictions
When the source dataset, or parts of the source dataset, do not belong to the organisation then it is necessary to ensure that SDG is a permitted use, and that there are no further restrictions in contracts on how the synthetic dataset can be used or disclosed. Specifically, one can argue that SDG is a form of de-identification and if there are contractual limitations, either explicitly or by omission, on de-identification, then these may apply to SDG as well. Furthermore, if there are contractual constraints on the disclosure of derived datasets then these would need to be examined to determine their applicability to the generated synthetic dataset. We have seen disputes arise where there were questions about whether synthetic data was different enough from the source data such that contractual limitations on uses of the source data would no longer apply.
Case Study: Sharing opioid use data
This example of the application of SDG took place in Alberta 1. The context was to provide a provincial health system dataset on opioid users to academic researchers. Historically it has been a challenge to share high quality health data with academic researchers, with the process taking a long time consisting of many approvals from multiple stakeholders. The creation of a synthetic dataset with its high privacy protective characteristics was seen as a mechanism that would accelerate data access. In addition to enabling the academics to accelerate their innovation, better data access was seen as a positive for the economy of the province by attracting more research, more pharmaceutical companies that may be interested in the access advantages in the province, and more research funding. The involvement of the Office of the Information and Privacy Commissioner of Alberta (‘OIPC’) was a priority from the start, and the OIPC was informed of the project and consulted along the way. In this manner, the OIPC was aware of the technology, its strengths and weaknesses, and how it was going to be applied in this particular project. One of the key requirements to convince the data analysts to use the generated synthetic data was to demonstrate its utility. In this particular case it meant that the same complex analysis that was performed on the original real dataset was performed on the synthetic data, and the results and conclusions compared. The analysis consisted of training multiple survival models to predict mortality based on opioid use, and to predict specific diagnoses such as pneumonia. Overall, the multivariate models from the real and synthetic datasets were similar, with the same conclusions drawn. This was a critical component in convincing the academic analysts that the synthetic dataset would be a good proxy to the real dataset. All CSFs were satisfied in this implementation project which ensured commitment from all stakeholders. More importantly, it ensured that the project was sustained until completion. Dr. Khaled El Emam SVP and General Manager kelemam@replica-analytics.com Replica Analytics Ltd, Ottawa
1. See: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-023-01869-w