Mirza Mrahorovic
Federated learning
Published in
7 min readApr 20, 2020

--

This work is created by: Stefan Hofman, Mirza Mrahorovic & Yen Lin Wu

Reproducing: Communication-Efficient Learning of Deep Networks from Decentralized Data

Introduction

In 2017, H. Brendan McMahan, et al. published Communication-Efficient Learning of Deep Networks from Decentralized Data. The paper describes the principle of Federated Learning and demonstrates a practical method of deep networks based on iterative model averaging.

This project is done as part of CS4240 Deep Learning course at Delft University of Technology. We reproduced the paper’s results, performed experiments on randomized uneven data distribution before weight updates and added noise to weights before communication with clients.

Federated Learning

With increasing use of mobile devices, the data stored on them can be used to improve, for example, language models, voice recognition, and text entry. Federated Learning allows users to collectively reap the benefits of shared models trained from such data without the need to centrally store it. However, such data are privacy sensitive in nature.

In this project, we investigate a learning technique proposed by Google. It is termed Federated Learning, since the learning task is solved by a loose federation of participating devices (clients) coordinated by a central serverH.Brendan McMahan et al. How the algorithm works is as follows. A central server shares its model weights with clients. Each client computes its own update of the model on local training data and uploads it to the server that maintains the global model. The local training data set is thus never uploaded; instead, only the update (in weight/bias parameters) is communicated to the global model. More formally, every client k trains on local data with SGD with a batch size B to obtain a gradient estimate g (of client k), with learning rate η.

After E epochs, the updated weights w (of client k) are sent to the server. The server then combines the weights of the clients to obtain a new model w (at t + 1). This is done by assigning a higher weight to clients with a larger fraction of data.

This model is then redistributed back to the clients for further training. A schematic visualizing the process is shown below.

Several questions arise here:

  1. What happens to the convergence rate if the data is distributed in an uneven manner among users?
  2. Would a simple weighted averaging of the models lead to faster convergence if the data is distributed in an uneven manner?
  3. How much noise can the weight updates tolerate and what are the implications for data privacy?

In order to answer these questions, we adjusted the algorithm called FederatedAveraging that is introduced in the original paper. A series of 3 experiments are carried out to demonstrate its robustness to unbalanced and non-IID data distributions, as well as the ability to reduce the rounds of communication needed to train a deep network on decentralized data by orders of magnitude.

We start by replicating the results of the paper given the existing code in GitHub. These results are presented in Section Replication of Paper Results. Next, we will present and discuss the results of the three questions presented above in Section Uneven Distribution without Weighted Model Updates, Uneven Distribution with Weighted Model Updates and Noise robustness.

Replication of Paper Results

To validate the correctness of the code and see if any bugs are present, we first replicate the results from the paper. We attempted to reproduce the original paper results using CNN on MNIST database for both IID and non-IID cases. For clarity, we repeat the model updates (for the clients) and weighted algorithm (by the server):

Here, the gradient estimate g (of client k) is written out where L is the loss function evaluated on data sample i with model parameters w.

The results are shown below. The top row shows the replicated results, the bottom row shows the original results from the paper.

We can see a small discrepancy between the results. We attribute the difference to fewer communication rounds. Further discussion is in the Discussion and Conclusion section.

Uneven Distribution without Weighted Model Updates

To answer the first question, we must distribute the data in an uneven manner among users. Furthermore, the algorithm is adjusted such that the weighted averaging is taken out. In fact, the original code on Github did not have the weighted term and hence, no adjustments were needed. Note that the model update is given by the following formulas:

The set of data points P (of client k) that is used to estimate the gradient is denoted in red to emphasize that the parameter is altered. Now, the Python code that achieves this functionality is:

def mnist_iid(dataset, num_users):
num_items = int(len(dataset)/num_users)
dict_users, all_idxs = {}, [i for i in range(len(dataset))]
for i in range(num_users):
dict_users[i] = set(np.random.choice(
all_idxs,
random.randint(1,num_items), # unevenly distribute data
replace=False))
all_idxs = list(set(all_idxs) - dict_users[i])
return dict_users

The learning curve is shown below:

Uneven Distribution with Weighted Model Updates

The question that arises now is: how can we improve the model convergence given uneven distribution of data? A simple solution would be to alter the algorithm such that the server averages the weights of the models (provided by the k users) in a weighted manner. Note that the Federated Averaging algorithm already has this functionality according to original paper. However, the code on Github did not have this functionality and hence required adjustments.

In equations, the red highlighted part is now investigated.

Now, the Python code that achieves this functionality is:

def FedAvg(w, clients):
w_avg = copy.deepcopy(w[0])
for k in w_avg.keys():
for i in range(1, len(w)):
tens = torch.mul(w[i][k], clients[i])
w_avg[k] += tens
w_avg[k] = torch.div(w_avg[k], sum(clients))
return w_avg

The learning curve is shown below:

From the learning curve we can see that it is not very effective to weight the clients on the amount of data they have. We experimented a little bit with this and we found that, with the small differences in size of the datasets, there is not a clear advantage of having slightly more data and this results in inaccurate local models which will be weighted more in te global model. A possible experiment for further research can be to play with the distribution and see if bigger differences would result in better accuracy compared to unweighted uneven distribution.

Noise Robustness

As last, we would like to know how much noise the weights can tolerate. The reason why we would like to investigate model convergence under noisy weight updates is because attacks could occur in which the original image of the client can be reconstructed based on the updated weights and biases sent over to the server. If these weights are ‘polluted’ with noise, we expect that the derived image will also be polluted and thereby made useless to the attacker.

In equations, we have adjusted the algorithm as follows:

Now, the Python code that achieves this functionality is:

sigma_squared = 0.2
for layer in w:
x = np.random.normal(0,sigma_squared,w[layer].size())
x = np.reshape(x,w[layer].size())
x = torch.from_numpy(x)
w[layer] = w[layer]+x.cuda()

The learning curve for 100 epochs is presented below:

Discussions and Conclusions

What can we tell from the 3 questions?

Comparing the first 2 results (with and without weight distribution from clients), we see that both demonstrate poor performance on B=∞ cases. The former (w/o weight distribution) shows faster convergence, reaching an accuracy of 0.9 after 100 communication rounds on 5 out of 6 cases. To answer the 2nd question, a simple weighted averaging of models does NOT lead to faster convergence.

From the results above, we have seen that a Gaussian noise with a variance of 0.2 seems to converge to a reasonable accuracy (0.95). For Gaussian noise with a variance of 0.3, the model does not converge. The oscillations in the former is limited, but clearly present. Note that the oscillations in the latter model are too severe. To investigate the implications for data privacy, attacks should be reconstructed. In an (still) unpublished master’s thesis, work on this topic is investigated. We therefore leave this as a direction for future work.

Why our reproduction differs from the paper results?

We can see that our results did not reach the accuracy as shown in the paper. There are several factors that could have caused this; e.g. unlucky initial model instantiation or too early stopping of the training (due to limited computational power). Despite, the results show a reasonably fast convergence of the algorithm and relatively high test accuracy (>0.98).

Limited computational power

Our biggest challenge while reproducing the results is the limited access to computational power. Especially, the cases where each client trains for 20 epochs takes 1 hour to run 10 communication rounds. It would have taken 100 hours to run a single E=20 case. As a result, we ran only 200 rounds on Replication and 100 rounds on the 3 questions. (The paper did not mention the specification of GPU used nor the computation time.)

--

--