Why do I need a High Performance Framework?

Simple…we want to process data.

how much data exactly?

SOME NUMBERS

Facebook

New data per day:
• 200 GB (March 2008)
• 2 TB (April 2009)
• 4 TB (October 2009)
• 12 TB (March 2010)

Google

• Data processed per month: 400 PB (in 2007!)
• Average job size: 180 GB

What if you have that much data?

what if you have just 1% of that amount?

“No Problemo”, you say?

Reading 180 GB sequentially off a disk will take ~45 minutes, and you only have 16 to 64 GB of RAM per computer, so you cant process everything at once, general rule of modern computers:

data can be processed much faster than it can be read

Solution: parallelize your I/O, but now you need to coordinate what you’re doing and that’s hard, what if a node dies? is data lost?will other nodes in the grid have to re-start? how do you coordinate this?

ENTER: OUR HERO

Introducing MapReduce, in the olden days, the workload was distributed across a grid and the data was shipped around between nodes or even stored centrally on something like an SAN which was fine for small amounts of information but today, on the web, we have big data I/O bottleneck along came a Google publication in 2004 MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html now the data is distributed computing happens on the nodes where the data already is processes are isolated and don’t communicate (share-nothing)

See more at http://www.slideshare.net/Wombert/largescale-data-processing-with-hadoop-and-php-confoo2012-20120302