Differential Privacy in Federated Models

Federated learning

7 min readMar 6, 2023

Introduced by Google in 2017, Federated Learning (FL) is a way to train a model, without compromising the privacy of a client's information and potentially sensitive data. Pretty cool!

This is done by having clients train models locally on their devices and sending their updates to a central server. The model will be located directly on the client’s device. The central server can then aggregate the updates to create a new model, that is sent back to the clients. This process is contained until the model converges. Federated Learning essentially makes it possible to train AI models without the raw data ever leaving a client’s device. This is a very basic introduction to Federated Learning, and I go a little more in-depth in the video I linked above!

What is FedAvg?

I briefly mentioned above this idea of all the updated parameters being aggregated. This is can and is most commonly done by using an algorithm called FedAvg. FedAvg is an algorithm used to take a weighted average of the updated parameters from clients at the server. This algorithm works by having the server randomly select a subset of clients and send out the current parameters to each of the clients. The clients then receive the parameters, run a number of small batch SGD, and returns the news-updated parameters back to the server. The server then aggregates the data by taking a weighted average of the newly updated parameters. The weight of each client’s parameters will be proportional to the number of local samples available on the client’s device. Or in other words, it depends on the amount of data used by each client when training the model locally.

Privacy issues with FL

Despite the many privacy benefits that come with using an FL model there are some areas in it that could fail to meet security needs. For example, when you send the updated parameters to the central server, those updated parameters may contain sensitive information that the client may not want to be shared. Therefore, it's crucial to ensure that all the data is protected in every way possible. This is where Differential Privacy is used (DP). The idea here is to add noise using the Gaussian mechanism to the gradients. Note other forms of noise can be added, however, researchers have found Gaussian noise to be the most effective. This is called DP-SGD, Differentially Private Stochastic Gradient Descent. If you’re unfamiliar with what DP is, I highly suggest checking out my article where I provide a brief overview of the fundamental ideas.

Differentially Private Stochastic Gradient Descent

Normal SGD works by selecting mini-batches of data, computing the average loss gradient from them and updating the model based on the gradient that was calculated. As I mentioned in DP-SGD noise is added to the gradient updates to protect the privacy of the training data. However, in order to ensure that the privacy guarantee is still meaningful, we want to make sure there isn’t too much noise added, as this can lead to model inutility. With DP there is always this privacy/accuracy tradeoff. When you add a lot of noise, the privacy guarantee is strong, however, the results you get when you use that dataset will likely be useless. On the other hand, when you don’t add enough noise you risk not having enough privacy, however, the accuracy of the result will be meaningful. To balance the privacy, and accuracy tradeoff that comes with using DP, we can clip the L2 norm of the gradient before adding noise. Lots of fancy vocab there don’t worry I’ll break it down.

The L2 norm refers to the magnitude of the gradient vector, therefore, when we “clip” it this simply means restricting it to a maximum value. This allows for more control, for example, if the L2 norm of the gradient exceeds the clipped threshold, the value is scaled down so that its L2 norm is equal to the threshold, ensuring that it is always equal to or less than the clipped norm. The noise is then added to the clipped gradient in order to achieve differential privacy. The amount of noise that is added is proportional to the clipped threshold as it is chosen to satisfy a desired privacy guarantee. The rest is the same as normal SGD.

As the model runs through the training data using DP-SGD, the privacy loss is measured. The idea of using the moment accountant technique was proposed to track the privacy budget in DP-SGP. Moment accountant works by keeping track of a bound on the moments of the privacy loss random variable. Lots of fancy words here, let me try to make it make sense. This idea of moments is simply mathematical functions that provide information on the shape of a probability distribution. Therefore, the algorithm here specifically uses the moments of the privacy loss distribution. This will provide information on the total private loss over a sequence of queries.

The code

Please note this code was taken from TensorFlow’s website, I’m just explaining the code here and providing a little bit of a background!

Basic set-up

def get_emnist_dataset():
  emnist_train, emnist_test = tff.simulation.datasets.emnist.load_data(
      only_digits=True)

  def element_fn(element):
    return collections.OrderedDict(
        x=tf.expand_dims(element['pixels'], -1), y=element['label'])

  def preprocess_train_dataset(dataset):
    return (dataset.map(element_fn)
                   .shuffle(buffer_size=418)
                   .repeat(1)
                   .batch(32, drop_remainder=False))

  def preprocess_test_dataset(dataset):
    return dataset.map(element_fn).batch(128, drop_remainder=False)

  emnist_train = emnist_train.preprocess(preprocess_train_dataset)
  emnist_test = preprocess_test_dataset(
      emnist_test.create_tf_dataset_from_all_clients())
  return emnist_train, emnist_test

train_data, test_data = get_emnist_dataset()

In the TensorFlow Federated library there already exists some datasets to play around with, here the emnist dataset is being loaded. All the data is being preprocessed here, which would look a little different if this was a real Federated environment. It also splits the dataset into the training data and the testing data.

def my_model_fn():
  model = tf.keras.models.Sequential([
      tf.keras.layers.Reshape(input_shape=(28, 28, 1), target_shape=(28 * 28,)),
      tf.keras.layers.Dense(200, activation=tf.nn.relu),
      tf.keras.layers.Dense(200, activation=tf.nn.relu),
      tf.keras.layers.Dense(10)])
  return tff.learning.from_keras_model(
      keras_model=model,
      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      input_spec=test_data.element_spec,
      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

A very simple and standard neural network is made. The emnist dataset uses pictures of drawings that are 28 x 28 pixels. We first need to reshape the dataset so that it can be fed through the network, which is done here. Three hidden layers are defined. A loss function for the network is defined using Sparse Categorical Cross Entropy. This is a standard loss function used for multiclass classification problems.

total_clients = len(train_data.client_ids)
clients_per_thread = 5
tff.backends.native.set_sync_local_cpp_execution_context(
    max_concurrent_computation_calls=total_clients / clients_per_thread)

def train(rounds, noise_multiplier, clients_per_round, data_frame):

  aggregation_factory = tff.learning.model_update_aggregator.dp_aggregator(
      noise_multiplier, clients_per_round)

  sampling_prob = clients_per_round / total_clients

The first line is determining the number of clients in the training data. Then the number of clients being processed per thread is set to five. This means five units of execution that can run computations. The “clients_per_thread” is used to control the parallelism in FL systems. Now that is set up we can move into the fun stuff! The train function is defined using four different arguments.

“rounds”: parameters tell the number of rounds of training being performed.
“noise_mltuplier”: parameter gives information on the amount of noise added.
“client_per_round”: specifies the number of clients selected for each round of training.
“data_frame”: the data being used for training.

The function “aggregation_factory” is used to aggregate the model’s updates from the clients. The “dp_aggregator” provides differentially private aggregation considering a noise multiplier, which gives information on the amount of noise added. The next important step is to determine the sample probability which tells you the probability of picking a specific client during each round of training. This is significant when trying to preserve privacy.

learning_process = tff.learning.algorithms.build_unweighted_fed_avg(
        my_model_fn,
        client_optimizer_fn=lambda: tf.keras.optimizers.SGD(0.01),
        server_optimizer_fn=lambda: tf.keras.optimizers.SGD(1.0, momentum=0.9),
        model_aggregator=aggregation_factory)

  eval_process = tff.learning.build_federated_evaluation(my_model_fn)

Now we can go back and look at the model being used for training. The learning process is outlined here. It’s created using the “build_unweighted_fed_avg” function, which creates an iterative training process for an FL model using the FedAvg algorithm. It takes in a few arguments.

“my_model_fn”: which was previously defined above, which is a basic neural network structure.
“client_optimizer_fn”: this function plays an important role in updating the model for each client during training. Stochastic gradient descent here is used.
“server_optimizer_fn”: similar to the client optimizer, the server optimizer is used for the server, to update the global model during each training round.
“model_aggregator”: tells how the model updates from different client parameters will be updated, which uses the secure FedAvg algorithm that we previously defined.

Finally, the preference of the Federated model must be done, which is defined using the function “eval_process”, which evaluates the model used.

# Training loop.
  state = learning_process.initialize()
  for round in range(rounds):
    if round % 5 == 0:
      model_weights = learning_process.get_model_weights(state)
      metrics = eval_process(model_weights, [test_data])['eval']
      if round < 25 or round % 25 == 0:
        print(f'Round {round:3d}: {metrics}')
      data_frame = data_frame.append({'Round': round,
                                      'NoiseMultiplier': noise_multiplier,
                                      **metrics}, ignore_index=True)

 
    x = np.random.uniform(size=total_clients)
    sampled_clients = [
        train_data.client_ids[i] for i in range(total_clients)
        if x[i] < sampling_prob]
    sampled_train_data = [
        train_data.create_tf_dataset_for_client(client)
        for client in sampled_clients]
# Use selected clients for update.
    result = learning_process.next(state, sampled_train_data)
    state = result.state
    metrics = result.metrics

  model_weights = learning_process.get_model_weights(state)
  metrics = eval_process(model_weights, [test_data])['eval']
  print(f'Round {rounds:3d}: {metrics}')
  data_frame = data_frame.append({'Round': rounds,
                                  'NoiseMultiplier': noise_multiplier,
                                  **metrics}, ignore_index=True)

  return data_frame

The code above goes through the training process.

To conclude, Differential Privacy is a powerful tool that can be applied in Federated Models to further secure the data and information of clients, by specifically adding noise to the gradients during SGD.