What is Amazon EMR?

Akashgupta
3 min readMay 14, 2023

--

Amazon EMR (Elastic MapReduce) is a fully managed big data processing service offered by Amazon Web Services (AWS). It allows users to easily and quickly process vast amounts of data using popular distributed computing frameworks such as Apache Hadoop, Apache Spark, and Presto.

EMR allows users to scale their clusters dynamically based on the processing needs.

Let’s begin with the fundamentals.

Mapper & Reducer Class

Map — Take a data and distribute into smaller chunks.
Reduce — Take that chunks and process it to give output combining different chunks

  1. Shuffling Phase: This phase combines all values associated to an identical key.
  2. Sorting Phase: Once shuffling is done, the output is sent to the sorting phase where all the (key, value) pairs are sorted automatically.

EMR dashboard

When you create an EMR cluster, the EMR service automatically creates EC2 instances to run the core and task nodes of the cluster. You can use the EMR dashboard to view and manage these instances, such as starting and stopping instances, changing instance types, and viewing instance details such as IP addresses and security groups.

The EMR dashboard provides a convenient way to manage the EC2 instances used by your EMR cluster, without having to navigate to the EC2 console separately. This allows you to have fine-grained control over the compute resources used by your EMR cluster, and to optimize your usage and costs.

In addition to managing EC2 instances, the EMR dashboard also provides a way to manage other resources used by EMR clusters, such as S3 buckets, security groups, and applications installed on the cluster. This allows you to manage the entire EMR infrastructure from a single interface.Amazon EMR (Elastic MapReduce), EC2 instances are used as compute resources to run the core and task nodes of an EMR cluster

EMR dashboard can be used to manage EC2 instances that are used as compute resources for EMR clusters.

Types of Nodes in EMR Cluster

Amazon EMR (Elastic MapReduce) is a managed Hadoop framework provided by Amazon Web Services (AWS) that allows users to process large amounts of data using open source tools such as Apache Spark, Hadoop, and Hive. Within an Amazon EMR cluster, there are typically three types of nodes:

With Amazon EMR, you can easily spin up and configure clusters of EC2 instances to perform various big data tasks, such as batch processing, machine learning, real-time processing, and data analysis. Amazon EMR offers various benefits, including:

  1. Scalability: Amazon EMR can easily scale to process petabytes of data by adding or removing nodes to your cluster.
  2. Cost-effective: You can use Amazon EMR to reduce your infrastructure costs as you only pay for the resources you use.
  3. Flexibility: You can choose from a variety of big data tools and frameworks and customize your cluster to meet your specific needs.
  4. High availability: Amazon EMR automatically replicates data across multiple nodes to ensure high availability and durability.
  5. Security: Amazon EMR provides a range of security features to protect your data, including encryption, authentication, and authorization.

Overall, Amazon EMR is a powerful and flexible platform that can help you to easily process and analyze large amounts of data.

--

--