Aetion® Generate FAQ

Written by Admin | Feb 15, 2023

What is Generate?

Aetion® Generate, formerly Replica Synthesis, is Aetion’s scalable enterprise software that allows users to create synthetic data. The main features of the product are as follows:

Synthetic Data Generation (SDG): Generate's main functionality allows users to create synthetic datasets from source datasets.
SDG Workflows: A powerful capability in the software is the workflow designer which allows complex pipelines to be defined, including joins, pooling, defining cohorts, and a powerful scripting capability for pre and post-processing datasets.
Privacy Assurance: The privacy risks in synthetic data can be evaluated using our unified privacy assurance model.
APIs and SDKs: Generate’s engine can be accessed programmatically through a REST API, an R package, and a Python library. This allows easy integration into analytics pipelines.

The overall architecture of the product is shown below. The software allows the computations to scale in a cluster to accommodate larger and more complex datasets.

Can Generate be installed on-prem?

Yes, it is possible to run Generate’s software software on-prem.

There are healthcare and life sciences organizations that have not moved their computing workloads to use multi-tenant cloud services. Part of this is due to hesitation to have sensitive data reside in a different environment that is not under their control. Although we are seeing more and more workloads moving to the cloud.

An on-prem installation, for the purposes of this response, is an installation on hardware that is operated by the data custodian/data controller. By installing Generate on-prem then the SDG computations are all within the organization’s direct control, and no sensitive data needs to leave the organizational boundary. This can be on actual servers or in a virtual private cloud.

To support that, Aetion provides documentation and support for:

An air-gapped computing environment or where no external communication is permitted, additional steps will be needed to activate licenses through the license server, and to access the on-line help.
Defining the appropriate hardware/virtual machine and system requirements for common deployments.
Instructions on the software installation, which utilizes containers.

How much of the SDG process is automated in Generate?

This question has multiple layers and it is best to parse them out and address them separately.

Is manual intervention needed to synthesize data?

We have worked very hard to maximize the automation in Generate. The software does quite a bit of automated discovery of the data characteristics and data shaping to make it ready for synthesis, and then reverses any shaping at the back-end of the whole process.

The user, of course, has to load data or connect to data sources. If any cohorts need to be defined, then that is also a necessary task. However, in many situations, the synthesis process itself is automated, including all of the necessary hyperparameter tuning needed for training the generative models. This applies when using the GUI or when using the different APIs (R or Python) to perform SDG.

However, there is also an option to tweak this automated process for advanced users. While the automated pre-processing works very well, there may be cases where some adjustments are needed. Generate also provides this capability, but we hope you never have to use it.

How much knowledge about SDG is needed to use Generate?

By design, very little knowledge about SDG is required to use Generate. Of course, the user needs to know the data and how to access their data sources. The user will need to understand the data domain to be able to define meaningful cohorts. But training on SDG is not necessary as that complexity is hidden from the user in Generate.

The on-line help is also a great resource for using the software.

Can Generate be included in automated data provisioning pipelines?

We have clients who have done exactly that. Because of the high level of automation, Generate can be inserted in data pipelines to convert the original datasets into synthetic variants. This can be done by training a generative model for every data cut that comes through the pipeline. For example, when a dataset request is approved, the original dataset can be sent to Generate, and the resultant synthetic dataset is then forwarded to the analyst to work on.

View full post