https://arxiv.org/abs/1610.05492

(Summary)Federated Learning: Strategies for Improving Communication Efficiency

Yuan Ko
Published in
4 min readFeb 10, 2020

--

In federated learning, we update H to server. Server will get H and generate new model W_t+1

This H

There are two kind of update. Structured update and sketch update.

  • Structured update: try to use less variables
  • Sketch update: update full model but compress it

Let’s talk about structured update first. We got two way in structured update.

  • Low rank: represent H in AB ,fix A in a random matrix which can be express in random seed. and send trained B
  • Random mask: update H in a spare matrix, following a predefined random sparsity pattern.

Random mask get better performance which is mentioned in the paper, but didn’t explain the reason.

In my opinion, low rank will zero out some rank which will throw part of dimension away and it will cause much damage to the accuracy.

Sketched update is another way to address communication cost.
It also has two way in sketched update, subsampling and probabilistic quantization.

  • subsampling: only communicate matrix H′ which is formed from random subset of the values of H^i_t. Server will average it.
    (It is familiar with random mask, but update the full model with some parament being throw away. Random mask train the spare)
  • probabilistic quantization:

It can compress weight from 32x to 4 bytes float.
(but probabilistic quantization will cause a problem)
This example is mentioned in paper,

For example, when max = 1 and min = −1 and most of values are 0, the 1-bit quantization will lead to a large error.
We note that applying a random rotation on h before the quantization (multiplying h by a random orthogonal matrix) solves this issue.This claim has been theoretically supported in Suresh et al. (2017). In that work, is shows that the structured random rotation can reduce the quantization error by a factor of O(d/ log d), where d is the dimension of h. We will show its practical utility in the next section.

With appling random rotation, the federated learning become more stable.
Quantization to 2 bits will not change accuarcy a lot.

Experiment

low rank and mask comparsion: CIFAR 10

Get random mask perform significantly better than low rank, so we can omit the low rank in other comparison.
Accuracy seems to be unaffected with reduced update size in random mask.

Compare random mask and rotation: CIFAR 10

Random mask(Red line) learn better than sketch update(blue and green line) which contain subsampling and rotation, but sketch update attain modest accuracy faster.
Since sketch update will throw part of paraments away, random mask will have better accuracy in the end.
Moreover, model perform better with rotation.

(I am not sure whether subsampling is used in this experiment or not. It didn’t show the ratio of subsampling, but it mentione about throwing some imformation away, so I guessed it is used.)

Compare all sketched update without structured update: CIFAR 10

Random rotation imporve the performance, particular in the small number of quantization bits and smaller modes.
With sketch out all but 6.25 and quantize to 2 bits ,we only get minor drop in convergence.

Sketch update (no random mask): Reddit data

With random rotation will performance better.
Quantization into 2 bits will not cause any loss in performance.(1 bits may cause)
Only susampling 50 client without enven touch other 1950 clients once.

Effect of number of client: Reddit data

More client, less communication.
More communication, less client.

next paper I would read

Communication-Efficient Learning of Deep Networks from Decentralized Data

If you found anything wrong, that probably is my fault. Response to me and I will fix it.

reference

--

--