Challenges during migration from On-Premise to AWS Cloud

5 min readMay 26, 2020

In case of on-premise set up, infrastructure management is one of the challenge for a company. Now a days, companies are moving to cloud to reduce this cost. And this migration has its own challenges (Here, we are talking about technical challenges).

Based on my experience, Here I am sharing challenges faced during migration to AWS cloud from Cloudera Distribution based on on-premise set up. Although, below tech choices can be changed according to the client requirement.

Let’s say on-premise system had following tools and frameworks used :

While moving to AWS below components got changed (highlighted in bold):

In big data space, all frameworks follows same convention so it is not really hard to migrate the system. But there are always some challenges which we face while working with AWS cloud. Here I am mentioning few ones which might help somebody working with AWS EMR:

Infrastructure Provisioning:

Avoid Installing Libraries in EMR Cluster : EMR cluster has already preinstalled with most of the libraries (as per your configuration). So try to install only project specific libraries while spinning up a cluster and avoid framework specific libraries.
e.g. pyspark
Resource Management in Yarn Queues: This issue and solution is explained here very clearly. But to summarize, If we have one Long Running Job and few other small running jobs. In that case, there might me a scenario where long running job is holding all the resources and other small jobs has to wait for this job to complete to get the resources.

Data Ingestion:

INSERT query changes: When we insert data in hive database,
Unsupported Query : INSERT INTO TABLE table_name VALUES (‘value’)It creates a temporary table like Values__Tmp__Table__1 and doesn’t not delete it after the query is executed, which leads to a garbage table created.
Supported Query : INSERT INTO TABLE schema.table_name SELECT ‘value1’, ‘value2’
Sqoop import has issues: There are two ways to sqoop parquet table.
But both the ways, its not possible to sqoop directly to parquet in EMR 5.X
Problem with both the approach -
a) Using--as-parquetfile: Sqoop used Kite SDK to read/write Parquet and it has some limitations. And its not possible to use —-as-parquetfile. EMR will remove Kite SDK in future as told by the AWS Support
b) Using HCatalog: Support Parquet through HCatalog has been added for hive (v2.4.0, v2.3.7) jira card and hive (v3.0.0) jira card. But EMR 5.X uses hive version 2.3.5.
What could be a workaround till now in EMR(v5.x):
1. Use an intermediate text table to pull the data. Use a separate hive query to copy the data from text to desired parquet table.
2. Updating Kite SDK and use--as-parquetfile . ref: this link
Sqoop export has issues: As sqoop uses Kite SDK to read and write non-text (such as parquet) data. If the data is not in text format, it needs to be used with Hcatalog.
Kite SDK has some limitation in EMR as of now.

Data Processing

Oozie not able to read workflow file from S3: Copy EMRFS jar from share location to oozie lib in the EMR cluster master node.
sudo ln -sf /usr/share/aws/emr/emrfs/lib/* /usr/lib/oozie/lib/
Enable EMFS Consistency View: If we use EMRFS to read files from S3, There might be below intermittent error which can be avoided by enabling EMRFS Consistency View.
java.io.IOException:com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found;

EMRFS Consistency View uses dynamo DB , which has separate billing.
Oozie retry does not work if it is enabled.

Can not run spark-submit/beeline command from oozie shell action: In yarn Application master gets created only in CORE node and generally, we don’t install any tool on core nodes (otherwise it will impact autoscaling timing), So it is not possible to run beeline/spark-submit in core node.
If still you want to do it, spark and beeline needs to be installed on CORE nodes explicitly.

Data Storage:

Athena has Security Issues : As Athena is server-less, it’s access can not be restricted based on IP.
Athena View and Hive View are different : Athena is a server-less implementation of Presto. Presto defined views and Hive defined views are not compatible with each other, even if they are defined within the same catalog.

Networking and Security:

Request to AWS DNS is limited : To limit the CPU utilization, AWS default DNS has limit of 1000 request per second. So if you are sending requests more than this, you will start getting below exception. In order to solve this issue, We need to install dnsmasq in all node for DNS caching.
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request:<bucket-name>.s3.amazonaws.com
“Access Denied” Issues : AWS modules talks to each other using security groups (inbound rules) and VPCs. Make sure correct configuration is in place. Otherwise, there might be error like, Hive not talking to Glue Metastore or unable to connect to external systems like MySQL.

So, these are the few things which we need to take care while moving to AWS cloud from Cloudera Distribution.
In Cloudera Distribution, All these tools are customized and most of the issues has been fixed. In AWS Cloud, things are getting better.

Some quick pointers about Amazon EMR:
1. For more cost saving, go for “spot instances”.
2. Oozie web-client is not available in EMR cluster. So, extjs.zip needs to be copied manually (location is : usr/lib/oozie/libext).
3. To enable OIDC authentication, code changes needs to be done in Hue django service.

When it comes to cost savings, with AWS cloud as your infrastructure, cost can be reduced to half or more (depends on your implementation).

Thanks Snigdhajyoti Ghosh for pairing with me on this :)