Big Data Strategies: Build applications, not infrastructure

The variety of data that’s available now to organizations is incredible: Internally, you have website clickstream data, typed notes from call centre operators, e-mail and instant messaging repositories; externally, open data initiatives from public and private entities have made massive troves of raw data available for analysis. The challenge here is that traditional tools are poorly equipped to deal with the scale and complexity of much of this data.

For software development organizations, their critical data used to be limited to their transactional databases and data warehouses. In these kinds of systems, data was organized into orderly rows and columns, where every bite of information was well understood in terms of its nature and its business value. These databases and warehouses are still extremely important, but businesses are now differentiating themselves by how they’re finding value in the large volumes of data that are not stored in a tidy database.

If we wish to see a cluster of hardware used in as flexible a way as possible, providing hosting to multiple parallel workflows, the answer is to push the smarts into the software and away from the hardware.

In this model, the hardware is treated as a set of resources, and the responsibility for allocating hardware to a particular workload is given to the software layer. This allows hardware to be generic and hence both easier and less expensive to acquire, and the functionality to efficiently use the hardware moves to the software, where the knowledge about effectively performing this task resides.

When thinking of the scenario in the above design, many people will focus on the questions of data movement and processing. But, anyone who has ever built such a system will know that less obvious elements such as job scheduling, error handling, and coordination are where much of the magic truly lies.

If we had to implement the mechanisms for determining where to execute processing, performing the processing, and combining all the sub results into the overall result, we wouldn’t have gained much from the older model. There, we needed to explicitly manage data partitioning; we’d just be exchanging one difficult problem with another.

This touches on the most recent trend, which we’ll highlight here: a system that handles most of the cluster mechanics transparently and allows the developer to think in terms of the business problem. Frameworks that provide well-defined interfaces that abstract all this complexity — smart software — upon which business domain-specific applications can be built give the best combination of developer and system efficiency.

“}���q��