Map-Reduce with Amazon EC2 and S3

Sanchit Gawde
5 min readDec 22, 2019

--

(Saavn_TrendFinder)

Introduction

This Map-Reduce algorithm is developed to find the top 100 trending songs from the Saavn App Streaming dataset hosted on Amazon S3 Bucket.
The algorithm uses Map-Reduce Framework for processing and computing data. Sliding Window technique is used for ranking of the songs with respect to the occurrences on the following day. Chaining map-reduce concept is used to filter and assign a weight to the data.

Technologies used:

  • Amazon EC2
  • Amazon S3 Bucket
  • Cloudera Manager Community Edition
  • MapReduce Framework (Processing layer)
  • HDFS with AWS S3 (Storage layer)

High-Level Architecture:

MapRed with AWS S3 as a storage layer.

Definitions:

The following are the pre-assumed definitions that are taken into consideration while developing the algorithm.

  • Trending Songs
    A song is considered trending only when the following criterions are met:
  • A song should be present in the dataset for a minimum of five consecutive days i.e. There should be at least one occurrence of the song in five consecutive days to be considered as a trending song.
    eg. If song-id “X4q99UL7” is to be considered trending on date: 25/12/2017. Then it should have at least one occurrence on the following dates:
20/12/17
21/12/17
22/12/17
23/12/17
24/12/17
  • There should be no spikes in the data for the 5 consecutive dates.
    Definition of spikes in data:
    - An increase in the recent value should not be more than or equal to 1000% of the previous value.
    - If it is then it is considered as a spike and song-id is discarded immediately.
  • Song Rank Algorithm:
    The rank of the song which eventually determines the trending nature of the song depends on the following criterion:
    - This algorithm uses a sliding window technique for determining song-weight/rank on a specific day.
    - Following are the assumptions of the song rank/weight:
  • Eg. If the Target date is 25/12/2017 then weight assigned for the song on a specific day will be as follows:
20/12/17-> (Actual Count)*(0.9)^5
21/12/17-> (Actual Count)*(0.9)^4
22/12/17-> (Actual Count)*(0.9)^3
23/12/17-> (Actual Count)*(0.9)^2
24/12/17-> (Actual Count)*(0.9)^1

Prerequisites:

Before moving to the actual implementation following configurations are necessary for successful MapRed job implementation.
Since the dataset provided in Amazon S3 bucket is nearly 44GB thus the following configurations need to be updated through Cloudera Manager over Amazon EC2 instance:

  • Cloudera-manager UI: <PublicIPV4 Addr>:7180 -> Select “Yarn” -> Configurations
yarn.scheduler.maximum-allocation-mb - 10GB
yarn.nodemanager.resource.memory-mb - 8GB
Map Task Maximum Heap Size - 3 GB
Reduce Task Maximum Heap Size – 3 GB
Client Java Heap Size – 3 GB

Note:
Since this dataset has high amount of I/O overhead associated with it. Create an AWS EC2 instance of m4.large category.

Detailed Implementation:

Following is the detailed implementation of the whole map-reduce algorithm execution. It consists of two phases as follows:

  • Phase 1:- Filter MapReduce Job:
    -
    Filter MapReduce Job implements the following functionalities.
    - Takes unstructured CSV input and transforms it into the key-value pairs based on the filter conditions.
    - Filtering irrelevant data from the dataset
    - Trending-Song condition: This condition is checked at the reducer phase in the cleanup() method.
    - Spike Anomaly detection: once the trending song condition is verified. The next condition to be checked is Spikes Detection. compareAlgo() in the algorithm checks for the spikes in the data.
    - Once all the data is filtered and all the irrelevant data is drained out. The key-value pairs are written to the directory of Amazon S3 Bucket via HDFS.
    - This key-value pair data is then used as input to the Phase-2 map-reduce job.
Phase-1: Initial Output of MapRed job 1
  • Phase 2:- Trend MapReduce Job
    Trend MapReduce Job implements the following functionalities:
    - In the trendSongMapper() phase. Each song-id is assigned with the weight associated with the following date with respect to the target date as per the description stated above.
    - trendSongCombiner() is used as a combiner that sums up the values associated with the respective keys at the mapper’s phase itself. Thus reduces the network I/O.
    - trendSongPartitioner is used to partition the data output files according to the dates for which the top 100 trending songs list needs to written in the S3 bucket directory.
    - trendSongReducer is the final reducer phase that writes up the key-value pairs in the cleanup() method. All the data is summed up and then ingested into the tree-map which arranges it in the descending order.
    - Once all the key-value pairs are ingested into the treemap. The top 100 key-value pairs are written into the S3 bucket directory.
Phase-2: Final Trending Song Output
  • Yarn Application Logs:
Yarn Application logs stating two MapRed jobs succeeded.

Commands important for MR jobs on Amazon EC2-Servers:

Following are the commands which one needs to look out for when executing the MR jobs in the distributed cluster mode:

  • To execute the Hadoop MR job for S3:
hadoop jar SaavnProject.jar com.bits.upgrad.App \ 
s3a://mapreduce-project-bde/<input> \
s3a://<input path>/finalFilterOutputSaavn \
s3a://<input path>/finalFilterOutputSaavn \
s3a://saavn-<outputPath>/finalTrendOutputSaavn
  • To check for the state of all the jobs:
mapred job -list
  • To check for the application logs through Yarn CLI:
yarn logs -applicationId application_1576382256655_0019

Execution time and other details:

  • Since the MapRed framework does not support in-memory computation such as Spark. Execution time is a bit higher than Spark jobs.
  • The dataset on the S3 bucket was nearly around 44 GB. Thus, the overall execution time required for the dual jobs to run was around: 25 minutes.

--

--