# Track the model performance metrics in Federated training

Federated learning is a machine learning technique that trains a model across multiple decentralized devices, each of them holding a local data sample, without exchanging these data samples. Let’s imagine that by using this technique you have trained a binary classification model. You want to test it on the data of several devices by calculating the model ROC-AUC score on each one of them and then by averaging the results. The following questions arise:

- How much does this score differ from the ROC-AUC score that could have been obtained if all the data was located on the same device?
- Under which conditions are both scores equal? Are they equal only in the case when the data is identically distributed among all devices?

In this article, we will address these questions by doing an analytical case study supported by numerical examples and interactive visualizations.

# Motivation: Federated learning

Compared to the centralized learning approach where the data is stored on a single server and we have constant access to it during training, in the federated learning setting the data is held by multiple devices that can participate at different times in the training process without sharing their data. A possible federated training procedure is composed of the following steps which are repeated multiple times until the model is trained:

- send the model to every active device that is willing to participate in the training process;
- for every device train a model on a single data batch of the corresponding device;
- return the trained model weights to the central server where the model weights are aggregated in a secure way such that the server is not aware of the contributions of the different devices to the averaged model weights;
- check the model performance on a validation data set and decide if you should stop the training process;

In this setting, the central server receives only the averaged model weights updates from all devices, which in combination with other privacy-preserving methods hinders him to guess how the data in the different devices, for example, mobile phones, looks like.

If we want to check the model performance, we would like to do this in a way that we can gather as little information about the data as possible. A possible solution could be to calculate a model performance metric ** A** (like the ROC-AUC score) in every device for the data contained in this device and then to send the securely calculated average of the metrics to the server. In this case, the model owner (server) receives this averaged metric but does not know how it relates to the metric

**calculated if all the data**

*A(D)***was in one place, i.e. he does not know if:**

*D*where we have split the data ** D **into

**subsets.**

*M*# ROC-AUC definition

We will consider the case of having a binary classification problem. A data set composed of features (*x∈ ℝⁿ*) and target values (*y∈ {0,1}*) is used to train a model *g: *ℝ*ⁿ* ↦ ℝ that assigns to every feature a score *ξ=g(x)*. By comparing the score *ξ *with a threshold *T *we can decide if a given element belongs to the class *y=1 (*when *ξ>T) *or to the class *y=0* (otherwise)*.*

For example, we can consider the case of having a data set with three features (*x ∈ ℝ³)*, target variable *y* and a model *g* that is defined as follows:

where the ω’s denote the trained parameters of the model.

We can separate the calculated scores *ξ *in two groups: a group of scores obtained from data points that belong to the positive class (*y=1*) and to the negative class (*y=0*), respectively. Both groups of scores can be interpreted as samples from probability distribution functions ** f₊**,

**for the elements of the first and second groups, respectively. A sample visualization of both distributions is given in the figure below.**

*f₋*The figure represents a common case where we are not able to set the threshold ** T** such that both probability distribution functions can be completely separated. Some of the members of the negative class are classified as positive (False positive = FP; this is the area under the blue curve on the right side of

**) and some members of the positive class are classified as negative (False negative = FN; this is the area under the red curve on the left side of**

*T***). In general, the better the model the smaller the overlap between**

*T***,**

*f₊***and with it the smaller the number of false positives and false negatives.**

*f₋*The area of ** f₋, f₊ **on the right side of

**is equal to the false positive rate (FPR), and to the true positive rate (TPR) respectively. Both are defined as:**

*T*The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the ability of a model to discriminate between two classes as its discrimination threshold *T* is varied. The ROC-AUC score is defined as the area under the ROC curve and it is used to measure the model performance. This score can be expressed as a function of both probability distribution functions ** f₊**,

**, as follows:**

*f₋*where FPR′(*ξ*) refers to the derivative of FPR(*ξ*) with respect to *ξ*. This equation will be used to explain some of the differences between the ROC-AUC score and the proposed averaged ROC-AUC score in federated learning mode.

# ROC-AUC in the federated learning mode

As already mentioned in the *Motivation: Federated learning* section, we will look at (1) for the case where the metric ** A** is equal to the ROC-AUC score, i.e:

The number of devices is *M* and ** D_m** denotes the data contained in the

*m*-th device. In the following sections, we will see that the equality in (7a) depends on the distribution functions (

**,**

*f₊***) of the scores for each device. We will consider several cases.**

*f₋*## Case 1: the f₊, f₋ distributions among the devices are identical

To warm-up, we will consider the trivial case where the data points contained in every one of the *M* devices are identically distributed. This means that the distributions ** f₊_m, f₋_m **(

*m = 1, 2, . . . M*) of the scores obtained from applying the trained model to the data on each one of the

*M*devices are the same, i.e.:

The same statement can be applied to the false positive (FPR*_m*) and to the true positive (TPR*_m*) rates which are derived from ** f₋** and

**, respectively. By using (6) we can conclude that the ROC-AUC score for each one of the M devices will be the same, i.e.:**

*f₊*It follows that the average ROC-AUC score across all devices is equal to the ROC-AUC score for the entire data set.

For example, we can consider the case of having 4 devices, each of them having 500 elements of the positive and 500 elements of the negative class, respectively (the source code used to generate this example is provided at the end of the post). The elements are split between the devices such that the distributions of the scores

are approximately the same and normally distributed with mean *μ₋,μ₊ , *and standard deviation *σ₋,σ₊*, as shown in the figure below. Such a situation can be achieved if the data is identically distributed among all devices.

In this particular case the ROC-AUC scores of the four devices are:

The average ROC-AUC score and the ROC-AUC score for the entire data set are equal, as expected:

The slight difference between the ROC-AUC scores for the different devices can be explained with the finite number of data points *(n=1000)* in every device which leads to small differences in the subsamples distribution.

## Case 2: the f₊, f₋ distributions among the devices are not identical

If we consider the case where the data points contained in every one of the *M* devices are not identically distributed, we can represent *f₊ **and*** f₋** as follows:

where *I**₋, **I**₊** *refer to the number of points of the positive/negative class in the entire data set, and *I**₋**_m**, **I**₊**_m *to the corresponding number in the *m-th* device, respectively. It follows that:

## Case 2.1: only the f₋ distributions among the devices are identical

This has the following implication:

where FPR′*(T)* = *d *FPR*(T)/dT*. The weighted average ROC-AUC score across all devices is then given by:

i.e. it is equal to the ROC-AUC score of the entire data set. If the number of elements from the positive class from every device *I**₊**_m *is the same, the last equation reduces to (9).

To illustrate this we can use again the same example with 4 devices that we have used in the previous section. We use the same data which in this case is distributed among the devices in a way that only the data points belonging to the negative class are identically distributed among all devices.

In this particular case the ROC-AUC scores of the four devices are different:

but the average ROC-AUC score and the ROC-AUC score for the entire data set are equal up to the second digit:

The slight difference can be explained by the fact that the data sets in the devices have a finite number of points.

## Case 2.2: only the f₊ distributions among the devices are identical

This has the following implication:

We can use equation (14) in the same way as (12b) was used to prove (13):

In the case of having the same number of elements from the positive class *I**₊**_m *in every device, this equation reduces to (9), i.e. the average ROC-AUC score is again equal to the ROC-AUC score of the entire data set.

To illustrate this, we can use again the same example with 4 devices that we have used in the previous section. We use the same data which in this case is distributed among the devices in a way that only the data points belonging to the positive class are identically distributed among all devices.

In this particular case the ROC-AUC scores of the four devices are different:

but the average ROC-AUC score and the ROC-AUC score for the entire data set are equal up to the third digit, as expected:

An interactive chart that allows the user to experiment with different degrees of similarity between the score distributions *f₊ **and*** f₋ **is available in the following link.

## Case 2.3: Neither the f₊ nor the f₋ distributions among the devices are identical

In this case, we cannot expect that equation (9) will be fulfilled. In our numerical example, we indeed see that both sides of (9) are not equal if ** f₊** and

**are different among the devices.**

*f₋*# Summary

In this article, we have looked at how a weighted average of ROC-AUC scores among multiple devices, each of them holding a local data set, changes in comparison to the case of calculating the ROC-AUC score of the complete data set on a single device. A sufficient condition for equivalence of both metrics is that either the distribution of the positive or of the negative scores among all devices is identical.

This work was done within the scope of the polypoly project.

**Resources**:

- Interactive visualization:

http://35.234.91.20:80 - Source code used to generate the examples in the article:

https://gist.github.com/ImScientist/764484ef4a04cd40e6512c078e869d0e - Source code for the interactive visualization: https://github.com/ImScientist/plotly-web-app