Distributed Machine Learning
A Review of current progress
Machine Learning is one of the old key research and application fields in Computer Science that is rapidly becoming part of our daily life, think of song and movie recommendation system, cell phone and web personalization, computer vision and CCTV applications and so forth, one of the main drivers of the current boom in Machine Learning demand is the huge amount of data produced after the web 2.0 era, Facebook alone used to process 500TB of new data per day in 2012 [1] it is also estimated that the Digital Universe will reach 44 zettabytes in 2020 which is 50-fold growth since 2010 [2] having such pile of untapped data at hand, companies will then need to make use of it by finding patterns and insights which in turn can lead to business performance improvement and more user understanding and this is where Machine Learning comes for rescue
As part of my studies i was asked to conduct a review on the current status of Distributed Machine Learning and here is what i found
1- Effort started in 2007 to tackle this problem by Stanford research paper [3] then by the inception of Apache Mahout
2- Currently many products solve this problem such as Apache Mahout, Apache Spark, GraphLib and others
3- Not all machine learning algorithms can be executed in a distributed environment ( such as sequential algorithms which depends on precomputed results in previous steps or needs the whole data to compute one item)
4- Some work is happening towards more Declarative Machine learning
You can read the full review on this link:
http://www.researchgate.net/publication/277020368_Distributed_Machine_Learning_A_Review_of_current_progress
References
- CNet, “Facebook processes more than 500 TB of data daily” http://www.cnet.com/news/facebook-processes-more-than-500-tb-of-data-daily
- IMC, New Digital Universe Study Reveals Big Data Gap http://www.emc.com/about/news/press/2012/20121211-01.htm
- Chu, Cheng, et al. “Map-reduce for machine learning on multicore.” Advances in neural information processing systems 19 (2007): 281. http://ai.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf