Big Data Tools: Overview

Abhinav Vinci
4 min readJan 27, 2023

--

EarlierPart 1 : Concepts: Map Reduce, HDFS. Tools: Hadoop, Spark

In this blog:

  • Tools (Presto, Hive, Apache Pig)
  • Column Databases: (BigTable, HBase, Cassandra)

Presto:

Presto is an open-source, distributed SQL query engine that is designed for big data processing.

Benefits of Presto ?

  • It allows users to run interactive, ad-hoc queries on large data sets stored in various data sources, such as HDFS, Amazon S3, and Apache Cassandra, among others.
  • It is designed to be fast, low-latency and parallel processing. It can handle large data sets, and it’s designed to work with different data sources, such as data lakes and data warehouses.
  • Additionally, Presto is designed to be highly scalable, it can handle thousands of concurrent users and queries, and can process hundreds of terabytes of data in a single query.

How Presto works ?

  1. Users can run SQL-like queries on the data using Presto SQL, which is a variant of SQL that is optimized for big data processing.
  2. The query is broken down into a series of smaller sub-queries, which are executed in parallel across the data sources.
  3. The results of the sub-queries are then combined to produce the final result set.

Usage: If you have a table called “sales” stored in your Hadoop cluster, you can use Presto to query the total sales by region:

SELECT region, SUM(sales) 
FROM sales
GROUP BY region;

This query will return a table with two columns: “region” and “sum of sales” and will show the total sales of each region.

Presto allows you to query the data stored in your Hadoop cluster without the need to move the data out of it, making it a popular choice for big data analytics.

Apache Pig:

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. It allows users to write complex data processing tasks using a simple scripting language called Pig Latin.

Pig Latin scripts are then executed on a Hadoop cluster and can process large amounts of data in parallel.

How Pig works:

  1. Data is loaded into the HDFS . Users write Pig Latin scripts that define a series of data processing operations, such as filtering, sorting, and grouping, to be performed on the data.
  2. The Pig Latin script is converted into a series of MapReduce jobs, which are executed on the data stored in the HDFS.
  3. The results of the MapReduce jobs are then returned to the user in the form of a relation, which is a data structure that represents the processed data.
  4. Pig provides a rich set of built-in operators for common data processing tasks, such as filtering, sorting, and joining data, as well as a flexible programming model that allows users to create their own custom operators.

Pig is particularly well-suited for big data scenarios where data needs to be cleaned and transformed before it can be analyzed.

Apache Hive :

Apache Hive is a data warehousing and SQL-like query language tool that is built on top of the HDFS. It allows users to query and analyze large amounts of data stored in the HDFS using a SQL-like language called HiveQL.

How Hive works ?

  1. Data is loaded into the HDFS. Hive creates a logical view of the data stored in the HDFS by creating a table on top of the data. This table contains metadata about the data, such as the column names and data types.
  2. Users can then query the data using HiveQL, which is translated into a series of MapReduce jobs that are executed on the data in the HDFS.
  3. The results of the query are returned to the user in a tabular format, similar to a traditional relational database.
  4. Hive also provides a metastore service, which is a relational database that stores the metadata for Hive tables and partitions.

Comparisons:

Pig vs Presto ?

Pig is more focused on data cleaning, transformation, and preparation, while Presto is more focused on interactive querying and analysis of large data sets. Both tools can be used together to perform data processing, cleaning, and querying tasks on big data.

Hive vs Presto ?

Hive is more focused on providing a SQL-like interface for querying and analyzing large data sets, and it’s optimized for batch processing, while Presto is more focused on interactive querying and analysis of large data sets and it’s optimized for low-latency query performance.

Bigtable

Google Bigtable is distributed, NoSQL databases designed for low-latency, high-throughput access to large amounts of data.

  • It uses a column-family data model, which is well-suited for storing large amounts of sparse data.
  • It uses a distributed architecture, where data is spread across multiple machines for scalability and fault tolerance.

Hbase

A NoSQL database that runs on top of Hadoop. Apache HBase is a distributed, column-family NoSQL database that runs on top of the Hadoop Distributed File System (HDFS). It is modeled after Google’s Bigtable and is designed for low-latency, high-throughput access to large amounts of data.

Benefits of HBase?

HBase is a good choice when you require random, real-time read/write access to your big data, and can be integrated with other tools in the Hadoop ecosystem such as Pig, Hive, and MapReduce.

BigTable vs HBase ?

HBase is built on top of the Hadoop ecosystem and runs on top of the Hadoop Distributed File System (HDFS), while Bigtable is built on top of the Google File System (GFS). In General HBase is more flexible but there are minor differences based on use case.

Cassandra

Apache Cassandra is a distributed, NoSQL database that is designed to handle large amounts of data across multiple commodity servers. Cassandra uses a column-family data model.

Cassandra vs HBase ?

Cassandra is more often chosen for distributed systems, with a high write rate, and high availability requirement, while HBase is a better choice when you require random, real-time read/write access to your big data and your data is already stored in Hadoop ecosystem.

Cassandra’s data model is more suited for use cases like real-time data serving, real-time analytics, and time-series data.

--

--