From Chaos to Recovery: How We Restored Our AWS Microservice After Accidental Deletion at Dolap

Berkay Bilgin
Trendyol Tech
Published in
8 min readJun 23, 2023

Hello! In this article, I will provide an insightful account of the temporary shutdown and subsequent restoration of a highly utilized traffic microservice within our team.

I would like to share an intriguing incident involving the Comment Service, a crucial component managed by the Moderation & Fraud team within Dolap, a thriving C2C (consumer-to-consumer) marketplace platform renowned for its second-hand shopping opportunities. On this platform, users heavily rely on the comment feature, situated beneath product details, to engage in communication with one another. Given the absence of alternative communication methods, the comment feature plays a pivotal role in the Dolap application. Allow me to recount the captivating tale that unfolded around this vital service.

Product Detail (Dolap App)

First, let me provide some information about the relevant project and working environment.

Our focal project revolves around the Comment Service, a fundamental element of the Dolap application entrusted with comment management. Built on the robust foundation of JAVA and Spring Boot, this service operates seamlessly within the AWS infrastructure. PostgreSQL, integrated with AWS’ RDS, serves as the database, while Redis is the chosen caching method leveraged through AWS’ Elasticache service.

To effectively handle our project’s infrastructure, we rely on AWS CDK (Cloud Development Kit). This powerful tool empowers us to effortlessly create and manage various components within the AWS ecosystem. With AWS CDK, we orchestrate essential services including ECS (Elastic Container Service), ECR (Elastic Container Registry), RDS (Relational Database Service), ElastiCache, Network configurations, and CloudWatch for comprehensive monitoring and observability. By utilizing AWS CDK, we ensure seamless infrastructure management for our project.

For more information, you can visit the AWS CDK.

Now, allow me to guide you through the captivating narrative of the events that transpired within the Comment Service, followed by our prompt and effective resolution to restore order amidst the chaos.

A Case Study: The Impact of Human Error on an Incident

In a critical work scenario, the consequences of a single human error can have far-reaching implications. Consider a recent incident involving the accidental deletion of a CloudFormation stack.

Within the final 15 minutes of the workday, the individual in question found themselves juggling responsibilities, attempting to simultaneously engage in a meeting and review data pertaining to the Comment Service on the CloudFormation screen. Amidst the competing demands for attention, a momentary lapse resulted in an unintended click on the DELETE button, swiftly followed by confirmation of the irreversible action.

The aftermath of this inadvertent mistake unfolded rapidly, plunging the situation into chaos. Deleting a CloudFormation stack is an irreversible operation, leaving no room for corrective measures or reversals. The consequences of this error were now poised to impact various aspects of the system and its users.

Everything is at Stake

Immediately after clicking the DELETE button, all the services associated with the Comment Service on AWS started to be deleted. The cluster and service on ECS, the log configurations, alarms and monitoring dashboards in CloudWatch, the Docker images stored in ECR, and all the network configurations… Everything began to vanish.

This deletion process was going to incur significant costs for us. Deleting resources on AWS takes time, and in CloudFormation, these deletions were displayed one by one. Helplessly, I began to watch as all the resources tied to the Comment Service were being deleted, unable to take any action.

Redis Deletion and the Dimensions of Chaos

During the tension-filled moments, your vigilant monitoring of the deletion process on the CloudFormation screen brought an alarming discovery: the impending deletion of resources tied to Redis on Elasticache. The significance of this finding compounded the sense of panic, realizing that all aspects of the Comment Service were being eradicated and the crucial caching mechanism provided by Redis.

The gravity of the situation prompted swift action. Recognizing the urgency, you immediately rallied the entire team for an emergency meeting. The need to bring together collective expertise, devise a strategy, and mitigate the rapidly unfolding crisis took precedence.

Redis contained approximately 35GB of data, and it played a crucial role in the performance of our application. The deletion of Redis was turning into a major disaster for us.

The potential loss of the database loomed as an even greater catastrophe, threatening to plunge the organization into an abyss of complex and challenging recovery.

In a stroke of fortune, the proactive measure of enabling deletion protection on the PostgreSQL databases hosted on RDS (Relational Database Service) played a pivotal role. This protective feature acted as a safeguard, shielding the databases from the deletion process and preventing them from being irreversibly lost.

AWS -> RDS -> Databases -> Select Database -> Modify

While the database remained secure, the deletion of all AWS services associated with the Comment Service, especially Redis, posed a significant challenge for us.

Escaping Chaos Through Teamwork

Recognizing the urgency of the situation, the entire team convened to evaluate swift solutions and address the problem at hand. Leveraging the power of Slack as the platform for internal communication, the issue was promptly shared in relevant channels, creating a central hub for collaboration, coordination, and information dissemination.

Given the nature of the deletion process within CloudFormation, the unfortunate reality was that it couldn’t be halted or reversed. The team, therefore, had to grapple with the challenging reality of waiting for alerts to signal the completion of the deletion process. Patience became a virtue as they navigated the tense and uncertain waiting period.

The fact that your infrastructure was managed through a code-based CDK (Cloud Development Kit) project presented both advantages and challenges in the face of this incident. Leveraging this approach, the team was able to respond swiftly and decisively, deploying the infrastructure project to AWS with relative ease.

The deployment process, which took approximately 20 minutes, marked a race against time to recreate all the essential components that had been inadvertently deleted. With each passing minute, the team worked diligently to rebuild the network configurations, log configurations, monitoring dashboards, alarms, ECS cluster and service, Redis, and other critical components that formed the backbone of the application.

With the complete deletion of Redis, the consequences swiftly manifested as the application’s traffic shifted to the database without the crucial support of Redis for caching. The implications of this shift became immediately apparent, as the database servers struggled to cope with the increased load generated by read-and-write operations.

The limitations of the database servers quickly became evident as they strained under the weight of the traffic. The absence of Redis, which had previously alleviated the burden on the database, now led to a cascade of issues. Timeout errors and connection failures began to plague the service, hindering users from accessing the application and impeding their ability to interact with its features.

Recognizing the urgency to address the strain on the database servers, the team devised a plan of action to alleviate the traffic load and restore stability to the service. Although upgrading the existing database instance to a more powerful machine proved challenging, alternative measures were explored.

To distribute the load and enhance the database’s capacity, the team decided to augment the existing infrastructure by adding 2 additional Reader instances to the database cluster. By introducing these new instances alongside the existing Writer and Reader instances, the team aimed to leverage parallel processing capabilities and scale horizontally.

Insights Gained for a Future

As our company culture, we are not afraid of making mistakes; instead, we learn from them. The important incident caused by a simple human error taught us several helpful lessons, and we identified action items to prevent such errors from recurring.

  • We realized that our permissions in CloudFormation were flexible, which made us vulnerable to errors. Therefore, we made adjustments in CloudFormation to grant deletion permissions only to specific individuals such as leaders and the SRE team.
  • We understood the importance of not deleting important services like Redis and the need of adding deletion protection.
  • This incident was like a test of chaos and taught us the importance of staying calm and acting quickly in chaotic situations.
  • We once again recognized the criticality and importance of infrastructure-as-code projects. If we didn’t have our infrastructure-as-code project and had to manually create the infrastructure through AWS panels, resolving this incident could have taken hours.
  • There were instances where we performed manual operations in the AWS panel without updating our infrastructure-as-code project. We have shifted our focus to maintaining the latest version of our infrastructure-as-code project and ensuring that all operations are performed exclusively through the CDK project.
  • The positive outcomes from this incident were that we discovered areas where we could still ensure fast responses even when Redis was empty. With this awareness, we completely disabled some caches and reduced the TTL (time-to-live) of some caches. As a result, we made cost savings by using smaller and fewer instances on the Elasticache side.

The Impact of the Incident

  • For approximately 20 minutes, the comment services (comment write/read) in the Dolap application were unavailable.
  • Throughout the incident, when looking at products within the application, users encountered error messages saying “An error occurred while loading the information” resulting in a negative user experience.

    (It was unwanted for a problem occurring in the Comment service to be visible as an error on the product detail page. Therefore, we created a task to add a feature flag that can completely disable the Comment feature in case of an error. This way, we will be able to deactivate the comment service and provide a better user experience when a problem arises.)
  • The deletion of log groups in CloudWatch resulted in the loss of all logs. Since we had set the log retention to 3 days, we lost the past 3 days of data.
  • Because of the new Redis, we needed to run a few extra instances on the database side. As a result, we had to pay an extra $100 for the database for around 3 days.
  • We decided to run the service with 20 tasks instead of the usual 6 tasks until everything stabilized. As a result, we had to pay an extra $110 for the ECS.

The End of the Incident

The lessons learned and the measures taken following an incident in an AWS-based project provide valuable experience to prevent similar issues in the future and build a more strong infrastructure. This incident demonstrated how a human error can have an important impact and how quick actions can resolve the problem. Infrastructure-as-Code (IAC) tools like CDK or Terraform facilitated the secure and fast development of projects.

Configuring infrastructure with code enhances control over operations and enables quick recovery. Protection policies such as RDS’s Delete Protection feature ensure data security by protecting critical services from accidental deletion.

By leveraging these experiences in your AWS projects, you can build stronger and generate quick and effective solutions to potential issues.

Thank you for reading.

If you want to be part of a team that tries new technologies and want to experience a new challenge every day, come to us.

--

--