9 ways to put your Big Data Platform (BDP) cloud migration at risk

Published in

Google Cloud - Community

7 min readDec 1, 2021

As a Strategic Cloud Engineer (SCE), I had an opportunity in the last year to work on several big data platform migrations. Let’s name them ScrollMarkt, HugeCart, and BrowseNSwipe. They are global e-commerce companies. With my support, they kicked off the project and finished Minimum Viable Product-MVP (end-to-end use-case that meets business needs).

As a Professional Services consultant, I am helping customers solve their biggest challenges. Some customers migrate for the first time. They want to learn best practices. Others tried to migrate to the cloud in the past and failed. Another needs to speed up the journey. During those kinds of projects, my role is not to execute the migration, but to set the engineering team for success by educating, upskilling, and sharing best practices and lessons from other projects.

The way of learning is to fail in a safe environment. MVP is one of those. The team can experiment, learn, fail and learn how to succeed in large-scale migration.

During those 3 projects, customers with their decisions have put their projects at risk. Here are 9 of those that make the biggest impact on your migration.

1. Choosing between lift and shift or modernisation options

This is the story of BrowseNSwipe. During this project, the company had time constraints and wanted to balance between doing lift & shift (L&S) and modernisation. If, due to time constraints, they will do L&S, later they are forced to do a second migration to modernise. This affects business, as the team is focusing on technical migrations, not the business as usual. Another risk is price — running a long-living Hadoop with HDFS as storage or HBase may be expensive. If they focus on modernisation, then they are at risk of not meeting deadlines. Modernisation requires learning, refactoring, rewriting, which may not be possible to automate.

BrowseNSwipe tried lift & shift for most of ETL (Spark SQL on Dataproc) and modernise the analytics part with BigQuery.

Understand your constraints, be smart and focus on what is important and what kind of technical debt you can accept.

2. Onboard others without learning first

It takes time to understand and learn new tricks. Moving from an on-prem platform like Hadoop to the cloud is not only about change of technology stack (BQ, Dataproc, Dataflow) or new tools (governance, automation) but it’s also about a new operating model. Teams are no longer bound with fixed capacity and need to learn that there is the cost behind each action.

HugeCart rushed in onboarding users without understanding how the platform worked resulting in a big bill and migration project put on hold. Preventative measures were missing, proper monitoring wasn’t in place, proper training wasn’t in place as well.

When doing migrations, I plan with the team on how to do proper MVP so the first team can understand and learn how everything works. Later, they can upskill the rest of the company. This first team will build foundations, learn the platform, document typical issues, build dashboards. It’s like training the trainer.

The HugeCart team is on a way to success now.

3. Leave manual steps in the migration

Part of the assessment which I’m running with customers is generalising their applications, solutions and having discussions on how to migrate them at scale. Like how to quickly adapt any of your Spark jobs from running on on-prem Hadoop to Dataproc (that’s usually an easy one) or rewriting on-prem scheduler to Composer (that’s a hard one).

Most companies bring automation for the following activities:

Data migration — for example by using Hadoop distributed copy
Data loading — for example by writing custom Spark jobs or Bash scripts
SQL translation — for example by using Compilerworks (acquired recently by Google)
Validation — for example by creating SQL generator or running our data validator tool

For the rest — they write playbooks. Time of migration can be quickly calculated by an effort to migrate one instance * many instances (divided by team members). Take into account extra testing to mitigate the risk of human error.

BrowseNSwipe, the time-constrained company, had successfully verified automation for most of the activities mentioned above) but it was still not enough to meet demanding deadlines. Either it was infeasible (too long) or too expensive (a big team). It was clear that they needed to automate more.

This company has created automatic translation from the proprietary ETL solution to the Cloud Composer and as a result I have created a plan migration that would meet time constraints.

4. Have Data team migrate everyone

I interacted with the core team that was building the original big data stack. It’s the team which brought platform consumers. They depend and rely on them.

Here is an expectation — the core team migrates everything to the target platform. But the data team is much smaller than the consumer teams and can’t scale the migration.

When working with ScrollMarkt company, we had to put special attention on how to scale migration. The first step was to include consumer teams in the migration plan — they need to migrate themselves with the support of the data team.

5. Stick to 5 Rs (Rehost, Refactor, Revise, Rebuild and Replace) vs take shortcuts

HugeCart had a typical ingestion pattern. Clickstream application writes data to topic (Kafka or PubSub) and Nifi pipeline reads messages from it, validates, and distributes those across many topics and tables (fanout pattern).

When planning migration, the team was pushing to refactor the application and rebuild pipelines with a new processing framework based on Apache Beam.

This proposal introduces risk, as they have to learn a new framework.

They took the recommended approach and retired Nifi and expanded the responsibilities of the app.

Here, the cost of added reading and writing to the messaging platform (not free!) helped drive the decision.

6. Not focus on automation from day one

The HugeCart team lacked automation. There was a process in place, but in the end, someone executed changes (databases, DDLs, ETL jobs, permissions).

Migration is an excellent opportunity to bring more automation. Bringing it from day 1 will speed up the project. Manual steps will slow it.

Reverse engineering IaaC is possible, terraformer may be helpful, but you already wasted time.

7. Ignore the ecosystem

ScrollMarkt is hermetic and doesn’t trust consultants. It was difficult to get any information — sizing, use cases, or architecture diagrams. It took time to gain trust.

Partners are not here to steal your business. I’m here to bring knowledge and consultancy on how to migrate. Others want to specialise in data platform migrations.

Partners can scale your migration much faster than your HR team. Once you’ve set up a migration factory, give them a backlog and let them do the work at scale. Your team could focus on oversight, design, or business as usual.

8. Ignore business users in the migration

For the Data team, moving to the cloud is a fun project. Everyone is learning new skills. You can modernise the platform, erase part of technical debt. For business, it’s not fun, it results as the freeze in business changes.

They will come and put the stick in your wheels questioning why it’s taking so long or they won’t even allow starting the migration.

Bring business users to MVP and they need to see the value of why you are moving to BDP to the cloud. Whether it is faster analytics, shorter time to innovate, or a new ML platform. Once they understand the value, they will help you with migration.

The business users will also tell you what’s important for them. You may help solve actual pain points or challenges they are facing. They will give the perspective where you need to build extra resiliency in the architecture.

With ScrollMarkt company, finding business users is the biggest conundrum we are still trying to solve.

9. Lock in temporary solutions

You may have heard the sentence that nothing is more permanent than a temporary solution.

We’ve found it at HugeCart — it is a temporary solution from temporary solutions’ offering.

HugeCart is a happy Kafka user. For migration, they accepted a managed one as an intermediate step. They plan to migrate to PubSub once they refactor all applications.

The problem starts when the team is relying on proprietary connectors that came with the offering. It locks them and makes the transition to final architecture more difficult.

The customer decided not to rely on this tool as it wasn’t flexible enough.

DIY — Don’t take PSO (or partner)

Mentioned companies have talked to PSO very late.

BrowseNSwipe started the consultancy project after they began PoC, putting it at risk of failing. HugeCart started discussions 6 months before the deadline, resulting in uncertain timelines. ScrollMarkt company started MVP before starting the project with us. We’ve worked with this customer and could change it.