https://leaf.cmu.edu/

Realistic Federated Datasets for Federated Learning

LEAF: A Benchmark for Federated Settings

Disassembly
Published in
2 min readOct 7, 2020

--

There are iid( Independent and identically distributed) data and non iid data in the statistic.

Independent here means that the data we sample can not have any connection in any way. Identically Distributed means that all the data we sampled have the same distribution.

As you can imagine, it does not make sense if we assume the data, in reality, is iid data in federated learning. Each client may have a unique hobbit. Therefore, we will need a non-iid data which is provided by the leaf. You can also get more information in its paper. It is a simple but useful work in federated learning.

Sorry for I directly copy the explain in the paper since I believe it has well explained what dataset it got in the LEAF.

  • Federated Extended MNIST (FEMNIST), which is built by partitioning the data in Extended MNIST based on the writer of the digit/character.
  • Sentiment140 , an automatically generated sentiment analysis dataset that annotates tweets based on the emoticons present in them. Each device is a different twitter user.
  • Shakespeare, a dataset built from The Complete Works of William Shakespeare. Here, each speaking role in each play is considered a different device.
  • CelebA, which partitions the Large-scale CelebFaces Attributes Dataset 3 by the celebrity on the picture.
  • Reddit, where we preprocess comments posted on the social network on December 2017.
  • A Synthetic dataset, which modifies the synthetic dataset presented in to make it more challenging for current meta-learning methods. See Appendix A for details.

These points are copied from the paper. You can get more detail from the paper.

What if we need a centralized dataset to be separated to be a federated dataset?

In MNIST you can only send the client0 digit 0, only send the client1 digit 1…something like that. However, it is not the best way to archive a realistic federated dataset. I believe that will have worse performance than the LEAF dataset since it is tougher for learning.

Leave the comment if you know a better way to separate the centralized dataset into a federated dataset. Leave a comment if you find something wrong in my article.

Want to know more about FL?

Architecture of three Federated learning

(Summary)Federated Learning: Strategies for Improving Communication Efficiency

Federated Learning Aggregate Method (1)

Reference

--

--