Big Data: Basic Overview

The term “Big Data” reflects the two drivers: BIG and DATA. But what is it actually? What should I have in mind, when talking about Big Data? Let’s see the definition of Big Data and the place of Data System in the world of Big Data along this post.

Every thing is Data apart from analog signals

These following words define Big Data: Volume, Velocity, Variety, Veracity, Value. For fun I call it the V5 of Big Data :). Let’s see one by one, what each V tells us.

Volume: How big is big in Big Data?

This is the impact of the exponential growth of data. Every thing in range of Terra-bytes to 10s of Petabytes and more are considered as big.

Velocity: Data is produced continuously

Sending and receiving emails, tweeting tweets on twitter or posting on Facebook, uploading video on YouTube and many more … all of these are continuous production of data. In addition, data produced by machine are as well a source of endless data. Every thing in range of 30 KiB to 30 GiB per second is considered as big velocity.

Variety: Type of data

It symbolizes the word “every thing”. Data are from all kind of digital sources. They can be structured or unstructured, text or images/videos, curated or automatically collected …

Veracity: Variation of the captured data quality

Data from sources are not always, what we expect. They involve some uncertainty and ambiguities. For example an user likes a post on Facebook today and tomorrow he will dislike it.

Value: Analyze data

An important aspect of dealing with Big Data is the value of data, indeed analytics can be performed in order to evaluate movement around captured data. Question about several situations such as: What happened, why did this happen, what’s wrong, what will happen, what should we do and why, … could be answered while analyzing data.

And that’s it, when you talk about Big Data, think about these V5 — Volume, Velocity, Variety, Veracity, Value — ;) .

Data system in the world of big data

A Data system is the one that: stores data, provides access to data and ideally makes data analysis easy. It sits in the middle of big data. There are actually many types of data system and different data system use different data model.

Data model

  1. Relational data model: Data are presented as tuples (any SQL databases).
  2. Key-Value model: Data are stored in a collection of key/value pairs, where the key for each pair is unique ( ex: Redis).
  3. Wide column data model: Data are as well stored in a collection of key/value pairs but the key consists of 3 parts: row key, column key and time-stamp (ex: HBase).
  4. Document data model: Data have no predefined structure but are schema-free (ex: MongoDB).
  5. Graph data model: Data are stored in node which can hold any number of key/value pairs (ex: Neo4j)
  6. Multiple data models: different models are combined. (ex: ArangoDB)

More general, data models are often divided into 2 parts: RDBMS (Relational Data Base Model Systems), which uses SQL as query language and NoSQL(Not only SQL), which are the 4 last data models.

How to find the right data model for your application?

If you want your application to be capable to deal with Big Data, the answer is often NoSQL. Why?

  1. NoSQL Database can scale out. The ability to scale out permits an endless scalability to the application.
  2. NoSQL Database are simple and faster.
  3. With NoSQL Database, getting answer quickly is more important than getting a correct answer.
  4. NoSQL Database has most pre-product definition and interface-language, that are easy and fancy to learn.

The CAP-propriety: To be considered while using NoSQL data model

The CAP-propriety tells us that only two of “Consistency, Availability and Partition of tolerance(scale out)” can be achieved at the same time. It is then important to know which of these proprieties has less/more value for your application and which data model fit your needs.

So that is my basic overview about Big Data, I hope it can help you to start with this topic :D.