Generating High-Fidelity and Privacy-Preserving Synthetic Electronic Health Records – Google AI Blog

0
561
Generating High-Fidelity and Privacy-Preserving Synthetic Electronic Health Records – Google AI Blog


Analysis of Electronic Health Records (EHR) has an incredible potential for enhancing affected person care, quantitatively measuring efficiency of scientific practices, and facilitating scientific analysis. Statistical estimation and machine studying (ML) fashions educated on EHR information can be utilized to foretell the chance of varied illnesses (equivalent to diabetes), monitor affected person wellness, and predict how sufferers reply to particular medicine. For such fashions, researchers and practitioners want entry to EHR information. However, it may be difficult to leverage EHR information whereas making certain information privateness and conforming to affected person confidentiality rules (equivalent to HIPAA).

Conventional strategies to anonymize information (e.g., de-identification) are sometimes tedious and dear. Moreover, they will distort vital options from the unique dataset, lowering the utility of the information considerably; they may also be vulnerable to privateness assaults. Alternatively, an strategy based mostly on producing artificial information can preserve each vital dataset options and privateness.

To that finish, we suggest a novel generative modeling framework in “EHR-Safe: Generating High-Fidelity and Privacy-Preserving Synthetic Electronic Health Records“. With the progressive methodology in EHR-Safe, we present that artificial information can fulfill two key properties: (i) excessive constancy (i.e., they’re helpful for the duty of curiosity, equivalent to having related downstream efficiency when a diagnostic mannequin is educated on them), (ii) meet sure privateness measures (i.e., they don’t reveal any actual affected person’s identification). Our state-of-the-art outcomes stem from novel approaches for encoding/decoding options, normalizing advanced distributions, conditioning adversarial coaching, and representing lacking information.

Generating artificial information from the unique information with EHR-Safe.

Challenges of Generating Realistic Synthetic EHR Data

There are a number of basic challenges to producing artificial EHR information. EHR information include heterogeneous options with totally different traits and distributions. There will be numerical options (e.g., blood strain) and categorical options with many or two classes (e.g., medical codes, mortality end result). Some of those could also be static (i.e., not various throughout the modeling window), whereas others are time-varying, equivalent to common or sporadic lab measurements. Distributions may come from totally different households — categorical distributions will be extremely non-uniform (e.g., for under-represented teams) and numerical distributions will be extremely skewed (e.g., a small proportion of values being very giant whereas the overwhelming majority are small). Depending on a affected person’s situation, the variety of visits also can fluctuate drastically — some sufferers go to a clinic solely as soon as whereas some go to lots of of instances, resulting in a variance in sequence lengths that’s usually a lot larger in comparison with different time-series information. There is usually a excessive ratio of lacking options throughout totally different sufferers and time steps, as not all lab measurements or different enter information are collected.

Examples of actual EHR information: temporal numerical options (higher) and temporal categorical options (decrease).

EHR-Safe: Synthetic EHR Data Generation Framework

EHR-Safe consists of sequential encoder-decoder structure and generative adversarial networks (GANs), depicted within the determine beneath. Because EHR information are heterogeneous (as described above), direct modeling of uncooked EHR information is difficult for GANs. To circumvent this, we suggest using a sequential encoder-decoder structure, to study the mapping from the uncooked EHR information to the latent representations, and vice versa.

Block diagram of EHR-Safe framework.

While studying the mapping, esoteric distributions of numerical and categorical options pose an awesome problem. For instance, some values or numerical ranges may dominate the distribution, however the functionality of modeling uncommon circumstances is crucial. The proposed characteristic mapping and stochastic normalization (remodeling authentic characteristic distributions into uniform distributions with out data loss) are key to dealing with such information by changing to distributions for which the coaching of encoder-decoder and GAN are extra secure (particulars will be discovered within the paper). The mapped latent representations, generated by the encoder, are then used for GAN coaching. After coaching each the encoder-decoder framework and GANs, EHR-Safe can generate artificial heterogeneous EHR information from any enter, for which we feed randomly sampled vectors. Note that solely the educated generator and decoders are used for producing artificial information.

Datasets

We concentrate on two real-world EHR datasets to showcase the EHR-Safe framework, MIMIC-III and eICU. Both are inpatient datasets that include various lengths of sequences and embrace a number of numerical and categorical options with lacking elements.

Fidelity Results

The constancy metrics concentrate on the standard of synthetically generated information by measuring the realisticness of the artificial information. Higher constancy implies that it’s harder to distinguish between artificial and actual information. We consider the constancy of artificial information when it comes to a number of quantitative and qualitative analyses.

Visualization

Having related protection and avoiding under-representation of sure information regimes are each vital for artificial information technology. As the beneath t-SNE analyses present, the protection of the artificial information (blue) may be very related with the unique information (pink). With membership inference metrics (will likely be launched within the privateness part), we additionally confirm that EHR-Safe doesn’t simply memorize the unique practice information.

t-SNE analyses on temporal and static information on MIMIC-III (higher) and eICU (decrease) datasets.

Statistical Similarity

We present quantitative comparisons of statistical similarity between authentic and artificial information for every characteristic. Most statistics are well-aligned between authentic and artificial information — for instance a measure of the KS statistics, i.e,. the utmost distinction within the cumulative distribution perform (CDF) between the unique and the artificial information, are largely decrease than 0.03. More detailed tables will be discovered within the paper. The determine beneath exemplifies the CDF graphs for authentic vs. artificial information for 3 options — total they appear very shut usually.

CDF graphs of two options between authentic and artificial EHR information. Left: Mean Airway Pressure. Right: Minute Volume Alarm.

Utility

Because one of the vital use circumstances of artificial information is enabling ML improvements, we concentrate on the constancy metric that measures the flexibility of fashions educated on artificial information to make correct predictions on actual information. We examine such mannequin efficiency to an equal mannequin educated with actual information. Similar mannequin efficiency would point out that the artificial information captures the related informative content material for the duty. As one of many vital potential use circumstances of EHR, we concentrate on the mortality prediction job. We think about 4 totally different predictive fashions: Gradient Boosting Tree Ensemble (GBDT), Random Forest (RF), Logistic Regression (LR), Gated Recurrent Units (GRU).

Mortality prediction efficiency with the mannequin educated on actual vs. artificial information. Left: MIMIC-III. Right: eICU.

In the determine above we see that in most situations, coaching on artificial vs. actual information are extremely related when it comes to Area Under Receiver Operating Characteristics Curve (AUC). On MIMIC-III, the most effective mannequin (GBDT) on artificial information is simply 2.6% worse than the most effective mannequin on actual information; whereas on eICU, the most effective mannequin (RF) on artificial information is simply 0.9% worse.

Privacy Results

We think about three totally different privateness assaults to quantify the robustness of the artificial information with respect to privateness.

  • Membership inference assault: An adversary predicts whether or not a identified topic was a gift within the coaching information used for coaching the artificial information mannequin.
  • Re-identification assault: The adversary explores the chance of some options being re-identified utilizing artificial information and matching to the coaching information.
  • Attribute inference assault: The adversary predicts the worth of delicate options utilizing artificial information.
Privacy threat analysis throughout three privateness metrics: membership-inference (top-left), re-identification (top-right), and attribute inference (backside). The best worth of privateness threat for membership inference is random guessing (0.5). For re-identification, the best case is to exchange the artificial information with disjoint holdout authentic information.

The determine above summarizes the outcomes together with the best achievable worth for every metric. We observe that the privateness metrics are very near the best in all circumstances. The threat of understanding whether or not a pattern of the unique information is a member used for coaching the mannequin may be very near random guessing; it additionally verifies that EHR-Safe doesn’t simply memorize the unique practice information. For the attribute inference assault, we concentrate on the prediction job of inferring particular attributes (e.g., gender, faith, and marital standing) from different attributes. We examine prediction accuracy when coaching a classifier with actual information towards the identical classifier educated with artificial information. Because the EHR-Safe bars are all decrease, the outcomes display that entry to artificial information doesn’t result in larger prediction efficiency on particular options as in comparison with entry to the unique information.

Comparison to Alternative Methods

We examine EHR-Safe to alternate options (TimeGAN, RC-GAN, C-RNN-GAN) proposed for time-series artificial information technology. As proven beneath, EHR-Safe considerably outperforms every.

Downstream job efficiency (AUC) compared to alternate options.

Conclusions

We suggest a novel generative modeling framework, EHR-Safe, that may generate extremely life like artificial EHR information which can be sturdy to privateness assaults. EHR-Safe is predicated on generative adversarial networks utilized to the encoded uncooked information. We introduce a number of improvements within the structure and coaching mechanisms which can be motivated by the important thing challenges of EHR information. These improvements are key to our outcomes that present almost-identical properties with actual information (when desired downstream capabilities are thought-about) with almost-ideal privateness preservation. An vital future course is generative modeling functionality for multimodal information, together with textual content and picture, as trendy EHR information may include each.

Acknowledgements

We gratefully acknowledge the contributions of Michel Mizrahi, Nahid Farhady Ghalaty, Thomas Jarvinen, Ashwin S. Ravi, Peter Brune, Fanyu Kong, Dave Anderson, George Lee, Arie Meir, Farhana Bandukwala, Elli Kanal, and Tomas Pfister.

LEAVE A REPLY

Please enter your comment!
Please enter your name here