Big Data and Data mining: Where to get started as a Data Analyst

Darren Willenberg
MLthinkbox
Published in
5 min readOct 23, 2022

What exactly is Big Data? And how is it used in our everyday lives? In this blog post, we’ll explore those questions and more. Read on to learn all about Big Data and its implications for our daily lives.

Big Data is something we interact with every day, often without really realising it. It’s become such an integral part of our lives that it’s now hard to imagine living without it. The world is generating more than 2.5 quintillion bytes every day and this is expected to grow from 100 zettabytes in 2022 to 180 zettabytes by 2025. That's a huge amount of zeros and it is really hard for me to fathom the significance of this number. Anyways let me start with a graph and try to unpack this further.

Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025 (Statistica, 2022)

What is Big Data and How Does it Work?

With the recent emergence of the internet, digitalization, and a connected world, there has been exponential growth in data. This has led to the term “Big Data” being coined. Big Data is data that is too large or complex to process with traditional methods. It can be anything from social media posts to medical records and transactions on e-commerce websites.

When we share our photos on social media, stream a movie on Netflix, or search for something on Google, we are creating data. And that data is valuable to companies like Facebook, Amazon, and Google. They can use it to target ads, recommend products, and improve their services.

Most of this data takes the form of qualitative or unstructured data such as text, images, and video. Within the qualitative data stream, there is an abundant supply of journals, research and high-quality content helping us to solve almost any problem that we can dream of.

But there are also potential drawbacks to big data. It can be used to invade our privacy, manipulate our behavior, and unfairly discriminate against us. We need to be aware of these risks and take steps to protect ourselves from them.

How is Big Data processed and analysed?

Typically, data becomes Big Data when its volume, velocity or variety exceeds the abilities of IT systems to ingest, store, analyse and process the data. To address IT system limitations we need to think about data in a different way.

One way is to process data in a distributed manner. Instead of scaling IT systems vertically (e.g. more processor power, more memory etc on a single machine) we rather scale horizontally via distributed processing. A distributed computer system consists of multiple software components that are on multiple computers but run as a single system. Distributed systems are more scalable (machines can be added as needed) and also offer redundancy (more machines means less likelihood of failure).

Distributed computing example

Another paradigm shift is to think of data as a stream where instead of processing a defined amount of data of a specific shape and size we consider the data to be unbounded (without end). Streaming data applications, instead, process data over time as it arises. This presents many opportunities for efficiency as streaming data can be processed on-the-fly without necessarily requiring much data storage while results are being generated. Multiple data streams can be combined to provide us with even more insights.

Conceptualising data as multiple streams

Using Data mining to gain value out of Big Data

The wisdom of the ages dictates that we should learn from the past mistakes of others. Therefore, it is no surprise that much of our decision-making today is guided by looking into the past and trying to discover trends that may help us make useful inferences about the future.

Data mining is the process of sorting through large data sets to identify patterns and relationships that can help solve business problems through data analysis. Effective data mining applications use historical data, machine learning and/or various statistical methods to help uncover hidden insights.

Today, data mining solutions are used in almost every industry and are an essential part of any successful business intelligence (BI) endeavour. Numerous additional crucial corporate use cases, such as fraud detection, risk management, and cybersecurity planning, are supported by data mining. It is crucial to many other fields as well, including governance, science, math, and sports.

Where to get started with Big Data Analytics?

Big Data architectures typically involve the streaming of a variety of data sources of various types into a distributed data storage environment. Data can be transformed before storage in preparation for the application of data mining.

Currently, I build most of my Big Data analytics applications in Apache Spark. While my research is far from exhaustive, I could not find many straightforward alternatives to spark.

Apache Spark

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

With a python friendly API, and a dataframe syntax similar to pandas it is easy for anyone that has a basic understanding of python to get started.

Databricks

Databricks is a Unified Analytics Platform on top of Apache Spark that can be used for data science, engineering and business. They have fully managed Spark clusters in the cloud and you can easily provision clusters by following a relatively simple setup procedure. To be fair there are many other cloud computing options available, but this is beyond the scope of this current exploratory article.

Conclusions

  • Big Data is everywhere,
  • Distributed computing along with data streaming is a common approach to analysing Big Data,
  • We can unlock powerful insights by building data mining applications,
  • Finally, understanding the fundamentals of Spark is a great place to start with Big Data analytics.

I think this is enough for now. Thanks for reading!

References

--

--

Darren Willenberg
MLthinkbox

Engineer | Analyst | Data Science Enthusiast | UCT | MLthinkbox Publication Founder