Big Data 101: What, Why and How

Big data refers to data sets that are massive and complex and which cannot be handled by traditional data processing software. It also refers to the infrastructure that’s needed to support such analysis. For this reason, we need specialized tools to clean, manipulate the data to extract trends and other relevant information. These trends are often related to human behaviour and interactions.

Companies use Big Data on a day to day basis for a variety of reasons. Google uses Big Data to process millions of search queries per second. Facebook uses Big Data to cache images and user information — status updates. PayPal uses Big Data to predict fraud across millions of transactions.


Big Data is characterized using “The Four V’s”:

Volume

Velocity

Variety

Veracity

Volume refers to the size of data. The size of data is usually in petabytes, gigabytes, and terabytes. It is said that 90% of data was created in the last two years, alone. This is attributed to the rise of a digital footprint — Social Media, online transactions, digital photos and videos and GPS data.

Velocity refers to the challenges and demands that make growth and development difficult. Velocity is defined by problem space.

Variety refers to the diversity of data. Data is classified as unstructured data, structured data, time-dependent data, sparse data and dense data. Other types of data include geospatial data, sensor and IoT data and media, in the form of audio, video and images.

Veracity refers to trustworthiness of the data and how valid or accurate the data is.


Domains of Data Storage:

Relational databases are built on a blueprint of organization known as “schema”. The primary means of accessing relational databases are by querying and for this purpose — there are languages that are used for querying. SQL is a common language used for this purpose and examples of relational databases include MariaDB, MySQL, SQLite and PostgreSQL. These perform well on data that is less sparse.

Non-relational databases are based around documents and use JSON and XML as their building blocks. These perform better across distributed systems and allow parallel access across a cluster. An example of a non-relational database is MongoDB, which uses BSON.

Highly distributed databases are primarily attributed to the introduction of the HDFS or the Hadoop Distributed File System, which was created by Google to index the entire web. Cassandra is one system that is used by Facebook and Amazon S3 is another system that is used by DropBox.

Graph databases consist of nodes and edges and are primarily used to understand relationships and find patterns. Social networks use graph databases. Examples of Graph databases include Neo4j and Dgraph.


Big Data is primarily employed for three major uses:

Generalized Data Processing

Search

Machine Learning

Generalized data processing is the foundation of big data systems in the sense of processing data as batches and streams. This is employed by API calls using Apache Hadoop and Apache Spark. Apache Hadoop was originally created to rank web pages, and Spark — which is built atop Hadoop allows to access data with lightning fast speed, stream the history of data and allow in-memory processing.

Search allows breaking up of search terms into tokens such as verbs, adjectives and prepositions. Solr and Lucene are created for this purpose and are used by companies like Netflix, Instagram and Twitter. Elasticsearch is used to index data on disk and provide front end to clients to perform searches on the indexed data.

Machine Learning allows to train computers. This is done using Sci-Kit Learn and Tensorflow. Examples of machine learning include Recommendation Systems, Self Driving Cars and Graph systems to predict fraud.