Big Data Terminology: 16 Key Concepts Everyone Should Understand (Part I)

The phrase “Big Data” has been around for a while and we are at the stage where it has more impact every day, and it’s a trend that is showing no sign of slowing down.

With that in mind, I am putting together a series of posts for those who might not be too familiar with the subject (you can see my complete beginner’s guide to Big Data in 2017 here). As a companion to my guide, I’ve written a post explaining the meaning of some of the jargon and buzzwords that have built up around this topic.

So here goes. These definitions are for anyone who wants to know more about Big Data and of which they should have a general understanding.

Data-as-a-service, software-as-a-service, platform-as-a-service, these all refer to the idea that rather than selling data, licences to use data, or platforms for running Big Data technology, it can be provided “as-a-service,” rather than as a distinct product. This reduces the upfront capital investment necessary for customers to begin putting their data, or platforms, to work for them, as the provider bears all the costs of setting up and hosting the infrastructure. As a customer, as-a-service infrastructure can greatly reduce the initial costs and setup time for getting Big Data initiatives up and running.

Data science is the professional field that deals with turning data into value, such as new insights or predictive models. It brings together expertise from fields including statistics, mathematics, computer science, communication as well as domain expertise such as business knowledge. The role of data scientist has recently been voted the number 1 job in the U.S., based on current demand and salary and career opportunities.

Data mining is the process of discovering insights from data. In terms of Big Data, because it is so large, this is generally done by computational methods in an automated way using methods such as decision trees, clustering analysis and, most recently, machine learning. Think of this as using the brute mathematical power of computers to spot patterns in data that would not be visible due to the complexity of the dataset.

Hadoop is a framework for Big Data computing that has been released into the public domain as open-source software, so it can be freely used by anyone. It consists of several modules, all tailored for a different vital step of the Big Data process, from file storage (Hadoop File System, HDFS) to database (HBase) to carrying out data operations (Hadoop MapReduce, see below).

Posted on