2021: ManoMano IT Odyssey - Part III
Cloud Mission
After Antoine told us about the Dawn of ManoMano IT and Laurent about our Architecture Spaceship, I will continue by talking about our new Cloud Mission. I am Cyril and I joined ManoMano mid-2019 as a Solution Architect.
Back in early 2019, we were evolving very fast but still stuck to an old-fashioned hosting company. You know, when you have a lot of visitors one day and you didn’t plan it, you know that you have only 1 server available left and it will be not enough. So you watch the graph and wait, crossing your fingers. It was impossible to think about auto scaling or automation.
In the whole company, the move to the Cloud became necessary for multiple reasons: scalability, ownership, automation, redundancy, innovation, …
At this time, Amazon Web Services was the leader in terms of data centers, services, innovation, adoption, and our choice was simple. As this project involved 40 applications, migrating Databases and Message Broker, almost 20 teams worked on it and it has a high visibility in the internal ManoMano universe.
How do you plan a migration from On Premise environment to Cloud? Even if this subject has been discussed many times in a lot of Internet posts, I will explain how we did at ManoMano with the minimum of downtime.
The idea was quite simple: first move to the Cloud by doing a lift-and-shift (without rework on any application) and after, embrace the power of AWS and explore new services.
For the first step, the migration took several months. After analyzing the situation, the SRE (Site Reliability Engineering) team developed a 3 month migration plan: double run, testing, DNS switching and a lot of monitoring.
Some tools were also necessary to be able to build/deploy/monitor our applications on AWS services and teams worked on rebuilding our pipeline to fit our cloud objectives, fast and simple.
Time has come to migrate the first microservice to an EC2 instance. After testing in non production environments that everything was ok, the shift was done and we learned a lot from it.
After that, we migrated the read/write databases: stop the write operations on the database, sync data to the new instance and start again the write operations. Once the two important steps were done, we migrated almost all components, improving our knowledge about AWS and by adding more strength in our operating modes. We applied a progressive approach by migrating gradually the applications to avoid big bang and mitigate risk.
We also took care of the Events Messaging System because production events could not be lost in some cases. We must identify all producers/consumers to prepare a migration order (migrate producer, wait for consumer to consume all events, migrate consumers to new queues).
Last stage of this migration was to migrate our 3 monoliths. As it’s difficult to stop them for a while, to dynamically change the configuration, or even to test all processes, we decided to have a 3 hour slot where the Marketplace will be offline to switch (migrate monoliths and switch DNS).
It was done during the night of 22th October 2019 with a squad team of 30 people from two countries, dedicated to operate, monitor and celebrate. And it was done, with only 4 hours of our Marketplace unavailable.
The following week was not so easy, we found some holes in our perfect migration plan. Exchanges between Customers and Sellers by email were not sent during one week… Some IPs filtered were hardcoded… Login issues… Lack of communication to the Sellers about application DNS change… But we saw that all Manas and Manos were very reactive and united to fix these issues.
At the same time as we moved to the Cloud, we changed our mindset about deployment. Training was given to curious developers, culture evolved to better understand the power of Cloud. To reduce TimeToMarket and give more autonomy to our Feature Teams, all tools needed to be deployed and monitored were created or improved during this time. As a Feature Team, access was given to database migrations, deployment in multi branches, application configuration and instances resources declaration.
As we were now entirely responsible for the proper working of the platform and so the instances behind, our teams developed a new tool to declare and manage incidents called FireFighter.
The switch to Infrastructure as Code was made with Terraform and Ansible at the beginning (I invite you to read this post). With the idea to deploy fast and consistently, we based our applications on Amazon Machine Images. The idea to pack the code in a prepared image to build a new image that you deploy on several environments was fastly adopted by the teams. Not yet containers but a big step in technology and culture.
We were now able to redeploy our entire vessel fleet in a couple of minutes.
IaaS (Infrastructure as Service) is far better than On Premise but we slowly discovered that we didn’t have any value to operate some of our services like databases, file storages, logs.
We naturally migrated from NFS file storage to S3 buckets. We switched from MySQL databases managed by multiple SRE to RDS databases managed by one DRE.
We also entrusted our logs to Datadog. Even if it’s hard to change some habits in a company, people often discovered that new tools deliver more value and features. And if you today ask developers if they prefer our new SaaS data analytics platform or our old logging system, I know the answer : they won’t ask to change.
We were more and more focused on Applications and Data to deliver better Product and better Experience to our Users. This turnaround in ManoMano also opened the door to SaaS options. Before, we were proud to develop everything but we are learning to delegate our support domains to others and enhance expertise on core domains. When you want to go on Mars, you don’t develop yourself the light bulb of the toilets.
So we were now at a high level of Space Expertise with a scalable system and dedicated infrastructure to support our needs, but new issues appeared. The multiplication of services implied flows in our security system, communication gaps between growing teams and our SRE teams were more and more taken by the orchestration of all these components.
But this story will be part of another post!