Using generative, differentially-private models to build privacy-enhancing, synthetic datasets from real data.

TLDR;

Introduction

Anonymizing data

Diving into the ride share dataset

Example GBFS-formatted bike share record from Los Angeles

The privacy challenge

Build a generative model

Design the model

Build the RNN model

Train the model

Train for 15 steps
Epoch 1/15
15/15 [==============================] - 55s 4s/step - loss: 3.1980
Epoch 2/15
15/15 [==============================] - 53s 4s/step - loss: 2.7035
Epoch 3/15
15/15 [==============================] - 54s 4s/step - loss: 2.2097
Epoch 4/15
15/15 [==============================] - 54s 4s/step - loss: 1.6440
Epoch 5/15
15/15 [==============================] - 54s 4s/step - loss: 1.3700
Epoch 6/15
15/15 [==============================] - 54s 4s/step - loss: 1.2817
Epoch 7/15
15/15 [==============================] - 54s 4s/step - loss: 1.2333
Epoch 8/15
15/15 [==============================] - 54s 4s/step - loss: 1.2052
Epoch 9/15
15/15 [==============================] - 54s 4s/step - loss: 1.1777
Epoch 10/15
15/15 [==============================] - 54s 4s/step - loss: 1.1635
Epoch 11/15
15/15 [==============================] - 54s 4s/step - loss: 1.1498
Epoch 12/15
15/15 [==============================] - 54s 4s/step - loss: 1.1359
Epoch 13/15
15/15 [==============================] - 54s 4s/step - loss: 1.1272
Epoch 14/15
15/15 [==============================] - 54s 4s/step - loss: 1.1208
Epoch 15/15
15/15 [==============================] - 54s 4s/step - loss: 1.1149

Generate synthetic records

  • The code below uses a categorical distribution to calculate the most likely next character.
  • The newly predicted character is appended to the existing text string, and on the next iteration fed back into the model.
  • Experiment with the temperature setting to affect the randomness of the output of the model, by adjusting the thresholds for the selection of the next character from the categorical distribution. Sane values to try with are 0.1–1.0.
  • To improve the model, try training on more data, or with increased epochs to 30.

Examining Synthetic Data

hour, bike_id, src_lat, src_lon, dst_lat, dst_lon
0, ISO583, 33.979251, -118.432186, 33.976328, -118.4362
17, 14606, 34.010976, -118.495253, 34.018563, -118.494045
16, YLB877, 34.03097, -118.48204, 34.036536, -118.492773
0, WLU532, 34.0573, -118.30018, 34.052703, -118.300841
23, QNA776, 34.071466, -118.308093, 34.06929, -118.308638
9, FPQ307, 34.097783, -118.328214, 34.093128, -118.32654
0, TPE726, 34.09905, -118.344236, 34.098878, -118.344676
17, 29816, 34.006241, -118.436058, 34.030886, -118.44072
3, AMU276, 33.97965, -118.462685, 33.990011, -118.461383
23, XDF543, 34.07141, -118.29108, 34.07227, -118.298716
0, 18765, 33.992941, -118.472911, 33.991931, -118.47385
23, TTD075, 34.078546, -118.29179, 34.073496, -118.295458
0, 26728, 34.050535, -118.474988, 34.062818, -118.459506
22, 32521, 34.03313, -118.46946, 33.988963, -118.449293
23, 14437, 34.00505, -118.485803, 34.036113, -118.465366
23, AYP357, 34.077973, -118.290186, 34.069976, -118.305033
18, CVO250, 34.05929, -118.301266, 34.053966, -118.290508
1, 29137, 34.012515, -118.497033, 34.015683, -118.497155
23, DNX577, 34.062081, -118.265396, 34.059896, -118.27588
23, 29830, 33.97656, -118.390121, 33.975181, -118.3941

Why Differential Privacy

Applying Differential Privacy to our model

Visualizing the Synthetic Datasets

Quantifying the Privacy of Synthetic Data

Thoughts

--

--

Creative engineers and data scientists working to make data safe and useful. https://gretel.ai

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alexander Watson

Co-Founder at Gretel.ai, previously GM at AWS. Love artificial intelligence and security. @alexwatson405