Shareplex real-time observability using AWS Infrastructure | Case Study

Samir Cano Iriarte
Condor Labs Engineering
7 min readApr 8, 2020

--

One of the biggest challenges and also one crucial part of implementing SRE in a company is increasing the visibility of each service, providing a way to continuously see the status of our applications. This allows us to improve the reliability of our product, make decisions faster and give us a way to raise alarms when something is not working as expected.

First, it is essential to understand how our architecture is defined and why it forces us to search for solutions that guarantee that our data is synchronized in geographically distant data centers, but also, to have a way to check if the system is working as expected.

In our company, we understand that any system is prone to failure at any time, and these failures should not hinder the user experience of our customers. That’s why we implement an active/active architecture, basically, we have two geographically separated data centers. In each one we have redundancy of each of our micro-services. If necessary, or in case of failure of any node, we can redirect all our traffic to the node in a good state. This increases our tolerance for disasters.

In our architecture, we have two main Oracle databases (one in each datacenter) actively used by our applications. This creates the need for a solution that allows us to keep the information synchronized as changes are made in each database. This is where Shareplex comes into play as a complete replication solution. Basically, we use Shareplex as an asynchronous data replication system, that is, it is capable of detecting the changes that are made in a source database, and through an internal queuing system it sends these changes to the destination database. This can be appreciated in the following diagram:

Although Shareplex satisfies the main need to replicate information in multiple instances of our databases, it does not have an easy-to-access and user-friendly interface that allows us to monitor the progress and status of synchronization. Despite the fact that Quest (the company responsible for Shareplex) offers paid tools to solve this, we consider that it did not meet our needs and would not allow us to have the desired degree of freedom, on the contrary, we would be subject to the restrictions of the tool.

In our company, we believe that software should adjust and meet our needs, not the other way around. Therefore, instead of investing money in a monitoring tool, we decided to create one that would allow us to have the results we wanted and resolve our Shareplex visibility issue.

More specifically, we needed a solution that would allow us to obtain real-time alarms of the synchronization status between the instances of our main Oracle database, in addition to alerting us if an anomaly occurred in their health status. As well as this would allow us to collect information that would help us to understand better the evolution of the system over time when modifications are made to it.

We found some services provided by the AWS platform as a quick, effective and simple alternative to solve our problem. Creating a cloud infrastructure in a short time, without having to worry about deploying new servers on-premise, nor about their future maintenance or meeting basic security requirements (already covered by AWS). Based on that premise, we decided to implement our solution in the cloud using AWS.

At this point, it’s worth mentioning that all we accomplished by creating a monitoring tool, not only benefited us but also had a great impact on one of our main clients, let's call it E Corp (EC for short) from one of our favorite series Mr. Robot. From now on, every time we want to refer our client, we will do so under this alias.

To start, AWS EC2 allowed us with just a couple of clicks to launch the compute instance where our Monitoring Application’s main API would be deployed, to which we would send all the data and performance statistics captured by our collector server. Obtaining what is shown below:

Just like that, we had a way to process and interpret the information coming from our DB server.

Although having a way to monitor our system increases visibility, this is only valuable when based on the information collected, we can make decisions that will increase reliability and have a positive impact on the SLA. Therefore, we were required to create a way to present the data so that it was useful in the decision-making process, for this, we developed a website whose static content is hosted in AWS S3 Bucket and in which we would show a dashboard with the most relevant information, allowing the stakeholders involved to access easily to it:

Thanks to AWS static website hosting technology, we didn’t have to worry about creating a server for our website, just uploading its content to the bucket, reducing our development time and effort.

We applied the KISS to the interface of our website, The first version of our website looks like this:

We can easily restrict or allow access to certain kinds of data based on the need of the end-user. We are even able to customize the hole interface according to the changing requirements of our software. Also, as you can see, it is intuitive to the user to check when something is wrong just based on the red color of the panel that indicates a delay in some queues of the replication process, while green means that everything is good and you can go for a sandwich.

So far, we already have a system that is capable of collecting useful information, interpreting it and presenting it to the end-user through a web interface, however, as a key principle of visibility, the system must be able to notify when an anomaly occurs in the operation. To achieve this, we needed a tool that would allow us to easily integrate third-party services that we already had and also would give us a high degree of flexibility. For this, we used AWS SQS to enqueue the notifications generated and AWS Lambda attached to it in order to process and submit them to the ending applications which are already part of our daily platform monitoring workflow:

With AWS SQS, we were able to gather messages from our monitoring API when an anomaly occurred without the need of implementing a queue system ourselves, being able to manage the concurrency of notifications to send out without having to develop that logic inside of our API. These messages will be processed by an AWS Lambda that would be in charge of sending the alarms to our third-party services, this represented for us an elegant & simple way to respect the Single Responsibility Rule, by delegating the functions to communicate with our third-party monitoring services (PagerDuty & Site24x7) to a serverless application, in a way that those could be deployed in a self-managed environment, ran on demand and their code can be maintained independently from the rest of the system.

An example of a simple notification sent by our AWS lambda notifier to our Slack channel through third-party monitoring services looks like this:

AWS made it much easier for us to create a service to monitor the main flows of our system, not only because of the ease and speed with which we can deploy services but also because it eliminates many concerns related to maintenance and operation, thus allowing us to focus effort on the construction and development of the software. In the end, we were able to build the whole system in way less time we would spend creating it on-premise, meeting our deadline and reducing work.

Even though in this first version out main goal was to increase visibility, we will keep focusing efforts to automate our processes and improve our SLI, for that we believe that using AWS is a good way to go.

Keywords

SRE: Stands for Site Reliability Engineering, is a discipline that incorporates software engineering aspects, with the goals of creating a reliable and scalable system.

SLA: Stands for Service Level Agreements, It is an agreement between a provider and a client about responsibilities, uptime, metrics, etc.

SLI: Stand for Service lever indicator, It is a measure of the service level provided by a service provider to a client.

AWS: Stands for Amazon Web Services, one of the world's most adopted cloud platforms, offering more than 170 services.

Shareplex: Solutions to replicate data between Oracle databases.

E CORP: Alias for our main client name.

EC: Alias for E Corp.

--

--