The term “Big Data” has been around since the 1990’s and it is the study and applications of data sets that are too large or complex for traditional data-processing software and is defined with three key concepts called “The Three Vs”.
What are the three Vs?
Volume: The quantity of data being generated and stored. Big data is typically processed in high volumes and with unstructured data. Examples include Twitter data feeds or clickstreams on a web page or a mobile application.
Velocity : The speed in which the data is being generated and processed. High velocity data streams directly into the memory instead of being written in the database and is often in real time which allows to analyze data as it is being generated for our needs.
Variety : Refers to the many types of data that are available. 80% of the world’s data is now unstructured. With Big Data, texts, images, audio, and video can now be processed and analyzed to derive meaning and insight.
But wait, there’s more!
Veracity: This refers to the quality and trustworthiness of the data being generated and processed. With the enormous amounts of data being generated every second, the quality and accuracy is less controllable. (Just think about all the fake news on facebook and tweets).
Value: Probably the most important of all Vs, VALUE!!! What’s the point of collecting and analyzing all this data if it does not produce any value? Big data analysis can become very costly, so businesses need to be very clear about the cost and benefits before starting any big data initiatives.
The Importance of Big Data
Fun Fact: More data is being generated than ever before and this will continue to grow. According to IBM, as of 2012, every day 2.5 exabytes (2.5×1018) of data is being generated and this number continues to grow…The IDC estimates that between 2013 and 2020, the amount of data that we generate will increase from 4.4 zettabytes to 44 zettabytes. And by 2025, there will be over 163 zettabytes of data. That’s a lot of data!
Here’s one zettabyte: 10²¹ (1,000,000,000,000,000,000,000 bytes).
How we’re using all this data:
Predictive analytics: Utilizing a variety of statistical techniques such as data mining(discovering patterns in large data sets), predictive modeling(uses statistics to predict outcomes), and machine learning(using statistical techniques to give computer machines the ability to “learn”) to analyze and predict future events.
User behavior analytics(UBA): Defined by Gartner as a cybersecurity process where Big data is used to analyze human behavior patterns and then algorithms and statistical analysis is deployed to predict and indicate potential threats. Instead of tracking devices or security events, UBA tracks a system’s users for any suspicious activity.
How is Big Data being analyzed?
The most popular software being used to analyze big data is Apache Hadoop which was released in 2011. It utilizes Hadoop Distributed File System (HDFS) and MapReduce. Hadoop separates files into large blocks and distributes them across nodes into a cluster. Each node can then manipulate the data that it has access to. This allows for faster and more efficient data processing.
MapReduce refers to the two separate tasks(Map and Reduce) that allows Hadoop to separate the data, convert it into another set of data, and then broken down into key-value pairs. The key-value pairs are then broken down into smaller sets of key-value pairs with reduce.