235x times faster than hadoop
http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html — a must read article for anyone interested in big data processing, map-reduce and hadoop.
In my experience something as simple as a quick command line application (I am far from being an expert on more interesting shell options like xargs or awk, although grep | find | wc goes a long way) in Java will often be quicker to write and execute and far more explicit on what the cpu and memory profile is than provisioning clusters of hadoop or spark jobs. Easily scaling up to 100’s of gigabytes too.
The trick, more often than not, is convincing business that their datasets really are finite.