So, now I need to move my application’s workload on the production environment. What’s next?
That’s me thinking out loud in my mind a few months back when we decided it’s time to move our workload on production to a new environment.
Our team’s mission was to migrate the workload of our website from one AWS account to another, the reason being was account separation.
As we did the migration project, along the way we learned about a few important factors that greatly increased the project success rate and to ensure that every involved stakeholder understand and being supportive of it.
Everyone’s experience will be different based on their company culture, team member and etc. Below are some of the lessons we learned along our project’s journey.
6 Lessons Learned
1. Understand the Values
Our Product Owner may ask:
Why do we spend time doing this?
What do we get out of this?
Why do we need a few months to do this? Isn’t it just a “Ctrl + C” and “Ctrl + V” thing?
Well okay, the last question might be too extreme but you get the point. These are just a few questions that we might get from non-technical stakeholders while we mention that we are spending X amount of months doing this.
Is it important that our stakeholders, whether being business stakeholders or development teams to understand the purpose of doing this. Even though we’re just tasked to do this, we also have to be onboarded with the reasoning behind instead of just blindly following other’s instructions.
In our case, these are some of the benefits:
- Immediate Benefits: Flexibility and cost savings
- Side Benefits: Control and cleaner architecture
- Lastly: A happier team!
In the past, we‘ve faced an overloaded server issue and needed to wait for a week for the upgrade. User experience was impacted for almost a week and the website was significantly slower which resulted in page timed out. With better infrastructure control, we get to react and mitigate the risk much quicker.
2. Understand our Application
Migrating workload isn’t just a lift and shift, there’s a little more involved and we required to understand the full picture of our application’s architecture and dependencies.
Be like a user: What I would normally do is to first understand the business use case, what the user trying to do follow by the user’s request in the backend to see how it’s all connected to each other. I would also talk to other engineers in the team to get an idea of what are some of the things that we might overlook.
By having a deep understanding of our application, we have higher confidence in making better decisions. Questions like these can be easily answered:
- How’s the traffics pattern look like?
- What kind of downtime can you afford?
- When is the best time and strategy to cutover on production?
- Does component A required to boot up before component B?
In our application, we have a cache engine, full-text search engine, database, object storage, load balancer, background cronjob and others. It might not be the most complex architecture out there but it’s enough to easily break things if we overlooked them.
Generally, I prefer to document my understanding of the application using online diagramming tools to illustrate the architecture design. Instead of spending a few hours trying to explain what we’re trying to do, it’s easier to just show the diagram to other stakeholders during a discussion. Trust me, we saved a lot of emails and meeting from it.
As I always believe — “A picture is worth a thousand words”
3. Improve When Necessary
It’s easy to fall into the trap where you might just migrate the application exactly as it was designed today, however technology changes and there might be better AWS services that could replace some of the existing workloads.
If the effort is low and the gain is huge, try it: In our migration journey, we learned that AWS has a new database service called AWS Aurora Serverless. We tested it out and learned about how the underlying components work and were amazed, after some comparison of cost, performance, scalability, and maintainability we decided to switch over to AWS Aurora Serverless.
4. Planning is the Key to Success
I first never believed why spend so much time planning to migrate? Why not just do it as we figured things out? But the thing is, there are many uncertainties that we can’t plan for.
Plan with the information we have in hand: We should at least plan with the information we have in our hands. Especially the task I’m doing does not just involve myself, but a few engineers, Product Owner, QA, other team and etc.
Checkpoint with a clear goal: A clear checkpoint of what we want to achieve in each phase of the migration is helpful, this helps to not only communicate the plan to stakeholders but also to increased motivation when each checkpoint is reached.
Some other relevant items to look at while planning the migration:
- Clear role & responsibility
- Reasonable timeline with sufficient buffer time
- Get a feeling of where the bottom neck might occur
- Get the buy-in from the Product Owner and development team
Understand your success criteria: Define a checklist with a list of success criteria to meet for a successful production cutover migration. So, we know what does “done” means. Some of our items on the checklist include:
- Data is migrated successfully
- User’s latency, errors, and compute resource utilization are below the threshold
- Updated DNS records are propagated across the world (We used DNS A Record for our production cutover)
5. Communicate, Communicate, Communicate
Communication might be one of the least valued items on the list. I truly believe that most of the issues or delays came from miscommunication.
In our project, we operate using scrum methodology on our day-to-day work. Daily scrum meeting is a good opportunity to communicate where we stand on the migration project.
Over-communicate it, no surprises: Even though this is a purely technical project, but we also communicate it to our Product Owner and any other relevant parties. Especially when there’s a disruption that you expected to happen, say a database reset on QA environment where no testing can be performed which might lead to delay of other delivery. Make sure everyone is well aware of what’s going on, let others know any planned activities that might affect them. It’s better to over-communicate than under-communicate, this would ensure everyone does not get surprised.
Inform and be transparent: Ensure everyone has enough information about what will be changing especially on production cutover and document the procedures. Some of the items we communicate for our production cutover:
- When it’s going to happen?
- What will be performed? E.g.: Data migration, traffics cutover and etc.
- How does the user experience looks like during the process of production cutover?
- What are some contingency plans in place in case of failure and its consequences?
- How we ensure the changes are reflected for all users?
- What are the plans in place to closely monitor the application’s health after the production cutover?
6. Plan for the “What If” Scenario
Things never always go as planned, there’s always something unexpected that will happen despite going through the production cutover’s activities 10000x times. Embrace that failure would happen, we can’t fully prevent it but we can mitigate the risk as much as possible.
Have a fallback plan in place: This will be useful when the production cutover didn't go as planned and you needed to fall back to the previous infrastructure setup.
Have sufficient buffer time for the unexpected thing: There’s always something that we learned we need to do as we do the migration, having a reasonable buffer time would help in case of that.
It’s been a fun and challenging journey for my team and myself personally. In addition to that, it’s also equally rewarding in the end once it’s all done! We can’t know about every single thing we wanted to do early in advance so, be open-minded and learn as we go.
Certain soft skills are equally as important as technical skills, being communication, leadership, project management and etc.
Most importantly, learn from the mistakes. There are probably a few things on the migration project that I wish I did do it differently today. However, that’s for a different time to talk about.