Track the model performance metrics in Federated training

Anton Ivanov
Apr 22 · 10 min read
Image for post
Image for post

Federated learning is a machine learning technique that trains a model across multiple decentralized devices, each of them holding a local data sample, without exchanging these data samples. Let’s imagine that by using this technique you have trained a binary classification model. You want to test it on the data of several devices by calculating the model ROC-AUC score on each one of them and then by averaging the results. The following questions arise:

  • How much does this score differ from the ROC-AUC score that could have been obtained if all the data was located on the same device?
  • Under which conditions are both scores equal? Are they equal only in the case when the data is identically distributed among all devices?

In this article, we will address these questions by doing an analytical case study supported by numerical examples and interactive visualizations.

Motivation: Federated learning

Federated learning architecture. Source: http://vision.cloudera.com/wp-content/uploads/2018/11/2018-10-31-181344-federated_learning_animated_labeled.gif

Compared to the centralized learning approach where the data is stored on a single server and we have constant access to it during training, in the federated learning setting the data is held by multiple devices that can participate at different times in the training process without sharing their data. A possible federated training procedure is composed of the following steps which are repeated multiple times until the model is trained:

  • send the model to every active device that is willing to participate in the training process;
  • for every device train a model on a single data batch of the corresponding device;
  • return the trained model weights to the central server where the model weights are aggregated in a secure way such that the server is not aware of the contributions of the different devices to the averaged model weights;
  • check the model performance on a validation data set and decide if you should stop the training process;

In this setting, the central server receives only the averaged model weights updates from all devices, which in combination with other privacy-preserving methods hinders him to guess how the data in the different devices, for example, mobile phones, looks like.

If we want to check the model performance, we would like to do this in a way that we can gather as little information about the data as possible. A possible solution could be to calculate a model performance metric A (like the ROC-AUC score) in every device for the data contained in this device and then to send the securely calculated average of the metrics to the server. In this case, the model owner (server) receives this averaged metric but does not know how it relates to the metric A(D) calculated if all the data D was in one place, i.e. he does not know if:

Image for post
Image for post

where we have split the data D into M subsets.

ROC-AUC definition

We will consider the case of having a binary classification problem. A data set composed of features (x∈ ℝⁿ) and target values (y∈ {0,1}) is used to train a model g: ↦ ℝ that assigns to every feature a score ξ=g(x). By comparing the score ξ with a threshold T we can decide if a given element belongs to the class y=1 (when ξ>T) or to the class y=0 (otherwise).

For example, we can consider the case of having a data set with three features (x ∈ ℝ³), target variable y and a model g that is defined as follows:

Image for post
Image for post

where the ω’s denote the trained parameters of the model.

We can separate the calculated scores ξ in two groups: a group of scores obtained from data points that belong to the positive class (y=1) and to the negative class (y=0), respectively. Both groups of scores can be interpreted as samples from probability distribution functions f₊, f₋ for the elements of the first and second groups, respectively. A sample visualization of both distributions is given in the figure below.

Image for post
Image for post
Possible scores distribution of the elements belonging to the positive class (f₊), and to the negative class (f₋). All predictions that are on the right side of the threshold T will be classified by the trained model as members of the positive class.

The figure represents a common case where we are not able to set the threshold T such that both probability distribution functions can be completely separated. Some of the members of the negative class are classified as positive (False positive = FP; this is the area under the blue curve on the right side of T) and some members of the positive class are classified as negative (False negative = FN; this is the area under the red curve on the left side of T). In general, the better the model the smaller the overlap between f₊, f₋ and with it the smaller the number of false positives and false negatives.

The area of f₋, f₊ on the right side of T is equal to the false positive rate (FPR), and to the true positive rate (TPR) respectively. Both are defined as:

Image for post
Image for post
Image for post
Image for post
Typical ROC curve of a trained model (red solid line). The ROC-AUC score is equal to the area under the curve. It is equal to 1 for a perfect classification model and to 0.5 for a model that randomly guesses the target value (area under the red dashed line).

The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the ability of a model to discriminate between two classes as its discrimination threshold T is varied. The ROC-AUC score is defined as the area under the ROC curve and it is used to measure the model performance. This score can be expressed as a function of both probability distribution functions f₊, f₋, as follows:

Image for post
Image for post

where FPR′(ξ) refers to the derivative of FPR(ξ) with respect to ξ. This equation will be used to explain some of the differences between the ROC-AUC score and the proposed averaged ROC-AUC score in federated learning mode.

ROC-AUC in the federated learning mode

As already mentioned in the Motivation: Federated learning section, we will look at (1) for the case where the metric A is equal to the ROC-AUC score, i.e:

Image for post
Image for post

The number of devices is M and D_m denotes the data contained in the m-th device. In the following sections, we will see that the equality in (7a) depends on the distribution functions (f₊, f₋) of the scores for each device. We will consider several cases.

Case 1: the f₊, f₋ distributions among the devices are identical

To warm-up, we will consider the trivial case where the data points contained in every one of the M devices are identically distributed. This means that the distributions f₊_m, f₋_m (m = 1, 2, . . . M) of the scores obtained from applying the trained model to the data on each one of the M devices are the same, i.e.:

Image for post
Image for post

The same statement can be applied to the false positive (FPR_m) and to the true positive (TPR_m) rates which are derived from f₋ and f₊, respectively. By using (6) we can conclude that the ROC-AUC score for each one of the M devices will be the same, i.e.:

Image for post
Image for post

It follows that the average ROC-AUC score across all devices is equal to the ROC-AUC score for the entire data set.

Image for post
Image for post

For example, we can consider the case of having 4 devices, each of them having 500 elements of the positive and 500 elements of the negative class, respectively (the source code used to generate this example is provided at the end of the post). The elements are split between the devices such that the distributions of the scores

Image for post
Image for post

are approximately the same and normally distributed with mean μ₋,μ₊ , and standard deviation σ₋,σ₊, as shown in the figure below. Such a situation can be achieved if the data is identically distributed among all devices.

Image for post
Image for post

In this particular case the ROC-AUC scores of the four devices are:

Image for post
Image for post

The average ROC-AUC score and the ROC-AUC score for the entire data set are equal, as expected:

Image for post
Image for post

The slight difference between the ROC-AUC scores for the different devices can be explained with the finite number of data points (n=1000) in every device which leads to small differences in the subsamples distribution.

Case 2: the f₊, f₋ distributions among the devices are not identical

If we consider the case where the data points contained in every one of the M devices are not identically distributed, we can represent f₊ and f₋ as follows:

Image for post
Image for post

where I₋, I refer to the number of points of the positive/negative class in the entire data set, and I_m, I_m to the corresponding number in the m-th device, respectively. It follows that:

Image for post
Image for post

Case 2.1: only the f₋ distributions among the devices are identical

This has the following implication:

Image for post
Image for post

where FPR′(T) = d FPR(T)/dT. The weighted average ROC-AUC score across all devices is then given by:

Image for post
Image for post

i.e. it is equal to the ROC-AUC score of the entire data set. If the number of elements from the positive class from every device I_m is the same, the last equation reduces to (9).

To illustrate this we can use again the same example with 4 devices that we have used in the previous section. We use the same data which in this case is distributed among the devices in a way that only the data points belonging to the negative class are identically distributed among all devices.

Image for post
Image for post

In this particular case the ROC-AUC scores of the four devices are different:

Image for post
Image for post

but the average ROC-AUC score and the ROC-AUC score for the entire data set are equal up to the second digit:

Image for post
Image for post

The slight difference can be explained by the fact that the data sets in the devices have a finite number of points.

Case 2.2: only the f₊ distributions among the devices are identical

This has the following implication:

Image for post
Image for post

We can use equation (14) in the same way as (12b) was used to prove (13):

Image for post
Image for post

In the case of having the same number of elements from the positive class I_m in every device, this equation reduces to (9), i.e. the average ROC-AUC score is again equal to the ROC-AUC score of the entire data set.

To illustrate this, we can use again the same example with 4 devices that we have used in the previous section. We use the same data which in this case is distributed among the devices in a way that only the data points belonging to the positive class are identically distributed among all devices.

Image for post
Image for post

In this particular case the ROC-AUC scores of the four devices are different:

Image for post
Image for post

but the average ROC-AUC score and the ROC-AUC score for the entire data set are equal up to the third digit, as expected:

Image for post
Image for post

An interactive chart that allows the user to experiment with different degrees of similarity between the score distributions f₊ and f₋ is available in the following link.

Case 2.3: Neither the f₊ nor the f₋ distributions among the devices are identical

In this case, we cannot expect that equation (9) will be fulfilled. In our numerical example, we indeed see that both sides of (9) are not equal if f₊ and f₋ are different among the devices.

Summary

In this article, we have looked at how a weighted average of ROC-AUC scores among multiple devices, each of them holding a local data set, changes in comparison to the case of calculating the ROC-AUC score of the complete data set on a single device. A sufficient condition for equivalence of both metrics is that either the distribution of the positive or of the negative scores among all devices is identical.

This work was done within the scope of the polypoly project.

Resources:

KI labs Engineering

KI labs Technical Blog https://www.ki-labs.com/

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store