FROM ZERO TO BIG DATA

Richard Simoes
noiselessdata
Published in
2 min readJul 19, 2015

Data analysis as we know it today is nothing new in terms of a practice, people out there have been doing similar kind of statistical research for decades now, areas of knowledge like data mining, operations management, business intelligence and others are clear examples of this.

From an analyst perspective what is really disruptive is the scale and the nature of the datasets that can now be used to improve the outcome of those kind of analysis, and that all this is economically feasible.

The later is closely related with the widespread adoption of the [inlinetweet prefix=”” tweeter=”” suffix=”#datasignal”]cloud and open source technologies that have democratized the large scale computing[/inlinetweet] required to process the volume of data available. In our previous post we talked a little bit about the evolution of the newest set of technologies, now we are going to show you the easiest way to get started using them.

My recommendation for anyone that wants to learn and get a hands-on experience with the kind of set up required to build today’s data pipelines is to start with the Sandboxes offered by providers like Cloudera or Hortonworks. These are distributed as Virtual Machines that you can boot using either VirtualBox or VMware hypervisors. Also both Cloudera and Hortonworks offer a huge set of tutorials that you can follow to start doing some interesting things with them.

To download the latest QuickStart VM from Cloudera go here and follow the instructions.

To download the latest Sandbox from Hortoworks go here and follow the instructions.

For those of you that are more experienced I found the following table interesting, it shows the versions of the packages currently supported by each one of the major big data platforms. The credit of the research is for the great Merv Adrian.

HadoopDist2015

--

--

Richard Simoes
noiselessdata

Pragmatic computer engineer and data analysis nerd.