Big Data Nomenclature for Managers

Explained list of big data concepts for managers

This article aims to explain few common concepts and terms in the big data world for an audience that is not as technical as an engineering one but at the same time is at least familiar a bit with the big data space. The words are in no particular order, and a bit of technical language is used. Reach out if something is not clear!


Relational database management system (RDBMS)

Structured data in a predetermined schema (tables), scalable vertically through large SMP servers, or horizontally through clustering software. These databases are usually easy to create, access, and extend. The standard language for relational database interoperability is the Structured Query Language (SQL).


Non-relational database

A database that does not store data into tables, but made them accessible through special query APIs. The standard language used is Not Only SQL (NoSQL): it does not present a fixed schema, it uses BASE system to scale vertically (basically available, soft-state, eventually consistent), and sharding (horizontal partitioning) to scale horizontally. Examples are MongoDB and CouchDB (they differ mainly because in MongoDB the main objects are documents, while in CouchDB are collections, which in turn contain documents). NoSQL commonly used JavaScript Object Notation (JSON) data format (BSON in MongoDB — binary JSON), and it mainly works through Key Value Store (KSV), i.e., a collection of different unknown data types (while an RDBMS stores data into table knowing exactly the data type).


Programming language

It is a formally constructed language designed to communicate instructions to a machine. The main ones for data science applications are Java, C, C++, C#, R, and Matlab. Scala is another language that is becoming
extremely popular right now, but it is an example of functional language.


Hadoop

An open source software for analyzing a huge amount of data on a distributed system. His primary storage system is called Hadoop distributed file system (HDFS), which duplicates the data and allocates them in different nodes. It has been written in Java. It is a core technology in the big data revolution and stores data into their native raw format, and it can be used for several purposes (Dull, 2014), such as a simple data staging or landing platform complementary to the existing EDW (as an enterprise data hub, i.e., EDH), or managing data (even small), transforming those into a specific format in the HDFS and sending them back to the EDW, lowering thus the costs while increasing the processing power. Furthermore, it can integrate external data sources and archive data (both on-premises or into the cloud), and reduce the burden for a standard EDW.


MapReduce

Software for parallel processing huge amount of data.


Flume

Service to gather, aggregate, and move chunks of data from several sources to a centralized system.


Cassandra

An open source database system for analyzing a large amount of data on a distributed system. It is characterized by a high performance and by a high availability with no single point of failure (i.e., a part of the system that if fails stops the whole system). It fosters data denormalization, which means grouping data or adding redundant information, in order to optimize the database performance.


Distributed System

Multiple terminals communicating between them. The problem is divided into many tasks and assigned to each terminal. It is a highly scalable system as further nodes are added.


Google File System

Proprietary distributed file system for managing efficiently large datasets.


HBase

An open source non-relational database (column-oriented) developed on an HDFS. It is very useful for real-time random read and write access to data, as well as to store sparse data (small specific chunk of data within a vast amount of them). The relational counterpart is called Big Table.


Enterprise Data Warehouse (EDW)

A system used for analysis and reporting that consists of central repositories of integrated data from a wide spectrum of different sources. The typical form of an EDW is the extract-transform-load (ETL), that is the most representative case of bulk data movement, but other three important examples of these systems are data marts (i.e., a subset of the EDW extracted out in order to address a specific question), Online analytical processing (OLAP) — used for multidimensional low-frequency analytical query — and Online transaction processing (OLTP) — used rather for high volume fast transactional data processing. The wider system that includes instead a set of servers, storage, operating systems, database, business intelligence, data mining, etc., is called data warehouse appliance (DWA).


Resilient Distributed Datasets (RDD)

A logical collection of data partitioned across machines. The most known example is Spark, an open source clustering computing that has been designed to accelerate analytics on Hadoop thanks to the multi-stage in-memory primitives (that are basic data types de ned in programming languages or built it with their support). It seems to run 100 times faster than Hadoop, but its disadvantage is that it does not provide its own distributed storage system.


Hive

An additional example of EDW infrastructure that facilitates data summarization, ad-hoc queries, and specific analysis.


Pig

A platform for processing huge amount of data through a native programming language called Pig Latin. It runs at the same time sequences of MapReduce.


Scripting Language

It is a programming language that supports scripts, which are pieces of code written for a run-time environment that interprets (rather than compile) and automates the execution of tasks. The main ones in the big data field are Python, JavaScript, PHP, Perl, Ruby and Visual Basic Script.


Data Mart

It is a subset of the data warehouse used for a specific purpose. Data marts are then department-specific or related to a single line of business (LoB). The next level of data marts is the Virtual Data Marts, i.e., a virtual layer that creates various views of data slices — in other words, instead of physically creating a data mart, it just takes a snapshot of them. The final evolution is instead called Data Lakes, which are massive repositories of unstructured data with an incredible computational capability. Hence, data marts physically create repositories (slices) of data, virtual data marts leave the data where they are and create virtual constructs — reducing the cost of transferring and replicating them — while data lakes work as the virtual data marts but with any kind of data format.

Reference

Dull, T. (2014). A Non-Geek’s Big Data Playbook. SAS Best Practices White paper. Retrieved from http://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/non-geeks-big-data-playbook- 106947.pdf.

Note: the above is an adapted excerpt from my book “Big Data Analytics: A Management Perspective” (Springer, 2016).

— —

Follow me on Medium

Look at my other articles on AI and Machine Learning:

Show your support

Clapping shows how much you appreciated Francesco Corea’s story.