Cermati Infrastructure Development: Five Years of Progress

Published in

Cermati Group Tech Blog

26 min readJan 25, 2023

When I joined Cermati back in May 2017, we didn’t have a specialized team to take care of our infrastructure. Another engineering manager at Cermati used to be the go-to person for most infrastructure-related stuff for Cermati’s development and production infrastructure at the time. But his job was, and is, mainly in product development.

I happened to have some knowledge of systems administration and computer networks, and at the time I was hired to work on our VoIP infrastructure project. Since I was also looking for a new domain to focus on after about four years of working as a contributor at my previous company in their product development efforts, our CTO decided to let me handle our cloud infrastructure and data pipelines when I wasn’t working with the VoIP infrastructure.

Our infrastructure platform team didn’t form as a proper team until March 2018 when we finally had another engineer with an interest in infrastructure joining the company.

In this article, I’d like to share what we’ve achieved throughout the years 2018 to 2022 that I noted when I was revisiting our progress so far to look for inspiration to decide on our next steps.

The painting “The Persistence of Memory” by Salvador Dali.

2018

I used to handle the infrastructure by myself, but in March 2018 a new team member joined to work in the infrastructure domain and I was assigned to supervise him. Hence, the infrastructure platform team was officially formed.

At this point, we don’t really have a well-defined scope of work because my scope of work at the time was basically everything that no other engineer really owns — including a certain component in our product’s back-office module. I was also working in a pair with another engineer who wasn’t working on our infrastructure projects to improve our ETL setup after we found several issues with the unmaintained ETL modules built by our previous interns. This status quo pretty much stayed the same after we formed a team of two.

The scope of work of the infrastructure platform team was anything that requires working closely with the infrastructure components, including the ETL pipeline (because it requires a lot of interaction with the data infrastructure products) and VoIP infrastructure (because it requires networking knowledge and hacking the VoIP appliance we were using at the time).

Due to the lack of clear separation between data work and infrastructure work at the time, some of the stuff we built fell under the data domain. We were also still trying to assess what kind of improvements we should make to our infrastructure while handling various works that fell into our lap because no other team own it.

Considering the infrastructure setup we were using at the time wasn’t very fancy and the team was still very small (the whole engineering team was under 15 people), there wasn’t a real need for a specialized infrastructure platform team and the infrastructure work was practically odd jobs work. I still remember this one time we were asked to reverse engineer a client app for a third-party application with a client-server model so we can learn how it works and enable our product team to understand how to integrate it better with our system.

We were still deploying our services directly to cloud VM instances, with Ansible for configuration management and executing the deployment sequence.

But in the second half of 2018, after getting a new series of funding, we started preparing our infrastructure and data pipelines to handle more loads and more users.

The following is some of the notable work we did back in 2018.

Improving the Data Infrastructure

One of the works we did in 2018 was rewriting our data pipeline modules. The data pipeline was previously developed by a few interns over the span of a few years. This led us to one point where the pipeline was broken and nobody knows exactly what was broken because nobody owns the pipeline. So we were asked to investigate the issue and get it back up.

The implementation built by our interns worked well for the volume of data handled during their internship period. That volume has grown by a lot in 2018, and the pipeline was overloaded and brought down by the volume of data it needed to handle.

There were also heavy operations that depends on Apache Spark at the time, which required the VM running the ETL pipelines to be quite big. If I remember correctly, it had about 16GB of memory. In the end, we eliminated Spark from the ETL workflow and rewrote the pipeline to use only vanilla Python code and Pandas for the data transformation process. The pipeline then ran without issues and we could downgrade the machine to a smaller VM with just 4GB of memory.

Another thing that we fixed was the way the data was loaded to our OLAP infrastructure in the cloud. After this, the infrastructure team owned the data pipeline and the OLAP cluster for a while — until the data platform team was formed in 2019.

We also did some hacks to the VoIP server suite we were using in order to improve the flow of the data processing in our efforts to improve the data processing workflow from the VoIP servers to our business analytics data pipeline.

Prometheus Adoption and Fotia Development

We used to utilize NewRelic for our application performance monitoring and alerting, but we started to explore ways to enable us to capture more metrics from the application runtime to monitor our production environment and product pipeline funnels.

Prometheus seemed to be an obvious choice, so we decided to explore how to set it up so our services’ metrics can be collected by Prometheus. Prometheus can retrieve the metrics by scraping them from the metrics endpoint exposed by the service, or the service can push the metrics to Prometheus. We found the former simpler to set up, so we went with the former.

At the time our services were all developed using NodeJS, and there was an issue with Prometheus’ NodeJS library that we were using. We were running the NodeJS back end server with multiple processes, and each process maintains its own metrics. Whenever Prometheus tries to scrape the metrics, it gets only the metrics from one of the processes, and the process whose metrics were served to Prometheus seemed to be random.

To solve this problem, we developed Fotia, an internal service we use to collect the metrics from our services to be served to Prometheus. So Prometheus only needs to scrape the metrics from the Fotia service endpoint, where all of the metrics we’d like to monitor were collected in one place. Fotia used Redis to store the metrics, and the older data can be forgotten as it won’t be served to Prometheus anymore.

The painting “Prometheus Carrying Fire” by Jan Cossiers, Φωτιά (fotia) is the Greek word for “fire.”

Fotia was retired in 2019 after our Kubernetes adoption since we configured the pods running NodeJS services to only run one instance of the NodeJS process per pod — which eliminated the condition that led us to build Fotia in the first place.

Service Deployment Slack Bot

During the second half of 2018, we started to prepare to scale our organization. Our CTO requested that we started enforcing strict rules regarding who should be allowed to access our production network.

This required us to rethink our deployment setup since we used to require our engineers to run Ansible playbooks directly from one of the machines in the production network to trigger service deployments. While access to the other machines was limited, access to the deployment machine was granted to many of our engineers in order to enable them to deploy the services they own.

We decided to develop a back end service that runs on the deployment machine to trigger the deployment playbooks as requested by the engineers performing the deployment, so the engineers no longer need access to the production network to perform the deployment. The engineers can communicate with this service by interacting with a bot in their deployment coordination Slack channel. We (the infrastructure engineers) call this bot ChatOps, but each product development team is free to name the bot in their Slack channel as they wish so we had some funny names there.

ChatOps has the capability to maintain an access control rule. Therefore, we can configure it to only respond to the commands invoked by people who’re authorized to perform deployments.

The usage of ChatOps went out of fashion after our Kubernetes adoption, as we later developed a more powerful tool called svctl to perform the software build and deployment sequence. But ChatOps is still used by a few teams even at the time this article is written in January 2023.

ChatOps is no longer under the infrastructure platform team’s responsibility, as we have dropped the support due to it not being a part of our core infrastructure services anymore and we want to direct our focus on maintaining the other services and tools. But one of the product development teams found that some of the features provided by ChatOps are pretty useful for them to improve their development workflow even after they migrated to svctl for service deployment, and they decided to take over the ownership of ChatOps after the infrastructure platform team stopped maintaining it. So I guess it’s a very successful internal product.

2019

2019 was the year when we started building the foundation for our infrastructure development in the coming years. A new member of the infrastructure platform team joined the team in January 2019. With an additional engineer on board, we could start to divide our work better and increase our capacity.

The ownership of the data components was taken by the data platform team around this time and we started to have a better separation between the components owned by the infrastructure platform team and the components owned by the data platform team. Sort of, because there are things that we built and couldn’t be handed over that smoothly until a bit later. Also, some of the data components deployed in our infrastructure are still very tightly coupled with the infrastructure itself so they remained in a gray area regarding who should own them.

We started to develop an idea of what the infrastructure should look like in the future, and invested more in the development and standardization of our internal tooling. We developed BCL to define the standard interface and structure for the internal CLI-based tools which also handles the tools’ distribution workflow — we also wrote an article about it — and started to move towards the adoption of Kubernetes. BCL is a very simple tool that we built quickly very early in 2019, but it’s powering some of the tools we’re going to talk about in the later sections: svctl, dbctl, and pkictl.

The following is some of the notable work we did back in 2019.

Database Replication Tool

We were helping the data platform team to improve their data ingestion pipeline from the VoIP servers in our call centers to their data warehouse, as they were asked to reduce the time interval for the data to be transported into the data warehouse to enable the business intelligence team to provide real-time analytics dashboard to our business ops team.

Previously, we were using some hacks we implemented when we were asked to assist the call center IT infrastructure team. But the way it was implemented, it’s quite difficult to reliably transport only incremental changes to their data warehouse by querying to the MariaDB server in batches. The database table structure of the VoIP appliance and how the columns are updated make it difficult for us to improve the query to fulfill the business intelligence team’s needs.

We ended up building myshipper to solve this problem, which is a DB replication utility for MySQL and MariaDB to incrementally ship the DB’s binlog files to a cloud object storage to be archived and then retrieved by a worker in our data warehouse. The usage of myshipper was later deprecated in 2022, as our network infrastructure setup in 2022 allows us to directly use MariaDB’s native replication capabilities to directly replicate the data into a replica DB instance inside the data warehouse network segment.

Kubernetes Adoption and Service/DB Deployment Tools

We decided to aim for Kubernetes adoption in 2019. In order to push for Kubernetes adoption, we needed to define a set of standards that our services need to comply to in order to make our lives easier after the Kubernetes adoption. These standards became the contract between the engineers and our infrastructure and was then formalized in the form of svctl, our service build and deployment utility tool.

A bronze figurine of a helmsman, or κυβερνήτης (kubernetes) in Greek.

svctl automates the process of setting up the service build and containerization workflow, along with the deployment sequence to our Kubernetes clusters. svctl was initially built as a client-side wrapper of kubectl which also has the capabilities to generate standardized configurations for generating the monitoring dashboard for the services in Grafana and running build pipelines on Jenkins. svctl’s architecture will be heavily modified in 2021 to turn it into a client-side architecture, which we’re going to discuss in a later section.

Aside from svctl, another major tool that we implemented for improving our infrastructure management workflow standardization was dbctl (not to be confused with this DBCTL). If svctl is used to standardize our service build and deployment life cycle, dbctl is used to standardize our managed cloud database instance provisioning and management, and also database server and access management. dbctl leverages Terraform for interfacing with the target cloud platform’s managed database instance management functionalities, and provides framework for performing administrative database operations to the database schemas and tables.

Public Key Infrastructure

We already have a few articles written about the PKI (Public Key Infrastructure) before:

The PKI was initially developed as a component to the set of CLI tools we developed, along with svctl and dbctl mentioned in the previous section, in order to prepare our infrastructure workflow and standards for Kubernetes adoption. The PKI component is called pkictl and have been used to manage the developers’ access to our internal software supply chain and delivery infrastructure, allowing them to access our internal library repositories and deploy services to our Kubernetes clusters.

We initially developed it as a part of our internal developer tools that are very tightly coupled with the development workflow and infrastructure, which made it use the developers’ GitHub accounts as their identity source and required heavy use of Git for certificate request and signing processes — since the list of signed certificates at the time was stored in a Git repository.

We soon saw its potential to be used in a wider context, so not long after the initial iteration we decided to rewrite a major part of it so it can use our company’s Google Workspace as the identity source of truth and use a proper database for storing the list of valid certificates of the employees. A GUI desktop application called PKI Toolbox was also developed to support less technical employees and we built the PKI Toolbox for Windows, Linux, and Mac OS— since pkictl is only usable on Linux and Mac OS, and it is a CLI application which makes it more difficult for non-technical employees to use.

Once the PKI is usable for our non-engineering divisions, we started to integrate our VPN authentication with our PKI so the certificates we issued to our employees can be used to log into their VPN accounts. To this end, we also added extra functionality to the PKI Toolbox application so it can be used to generate VPN configurations for its users — improving the experience of our employees when they need access to a different VPN server.

After we integrated the VPN authentication workflow to our PKI and saw that it was a success, we also did the same for our SSH access management and developed CSSH to integrate our SSH access workflow with our PKI. CSSH has four components: cssh-authz server back end that contains the access control rule definition for CSSH authorization, CSSH PAM module that’s deployed to each of our servers to let the SSH servers use our PKI for authentication, cssh-daemon that runs on our machines to manage the SSH users on the said machines, and csshctl that’s run on our engineers’ workstations to generate short-lived SSH certificates they can use when accessing our machines using SSH.

Maintaining the access control for the VPN and SSH servers was still quite a hassle at this point. The VPN servers’ access control was maintained locally in each VPN server instance and needs to be manually populated by using Ansible whenever there’s an update. The SSH access control rule was already stored and managed by a central cssh-authz backend service, but we also still needed to use Ansible to update the rules when an update needs to be made. This issue will be addressed by the components we built in the following years.

2020

We continued building upon the foundations we laid in the previous year to improve our infrastructure processes and component integration flows this year. We had a database administrator in late Q1, along with two security analysts in late Q4.

The remaining data components, whose ownership was still blurry between the infrastructure platform and the data platform teams, were finally handed over to the data platform team. We did it by leveraging the Kubernetes setup we built in the previous year and moved the deployment of the remaining components to Kubernetes, which allows the data platform team to tweak the runtime environment without affecting the stability of the lower-level infrastructure and effectively allows them to own the whole data stack.

We also started designing and developing a web-based access management and auditing dashboard that later serves as our company-wide IAM (Identity Access Management) platform.

The following is some of the notable work we did back in 2020.

Replacing Nginx with OpenResty

One of the major changes we did was replacing our Nginx setup with a new OpenResty setup. Previously, our Nginx servers were deployed directly to a public-facing cloud VM located in our DMZ network segment area.

While the old Nginx setup worked quite well for us, this setup put some limitations on our operations workflow.

First, because rolling out changes to the Nginx setup — even just for a simple route update which is commonly required by the developers — requires us to run Ansible playbooks to the cloud VM and apply changes directly to the machine, any mistake during the execution risks bringing the whole site down. Depending on what kind of mistake happened during the operation, we might not be able to recover fast enough.

Second, due to how much risk the operation for this setup potentially has, we were reluctant to offload even just a simple act of adding new routes to the developers. Not to mention that the developers will need access to the Nginx cloud VM instance, which could be another problem as the number of developers we authorize to make the changes increases. This led to any changes on the Nginx configurations being restricted to be performed only by the infrastructure platform team.

Third, we only had one Nginx instance acting as the gateway to our application, making it a single point of failure. We could deploy a redundant Nginx server on another cloud VM instance and set up a failover mechanism if one of the servers fails for some reason, which could be somewhat complicated to set up and might still cause a noticeable disturbance to our users. Or we can try another approach leveraging Kubernetes, since we already set up Kubernetes before and Kubernetes already provides the self-healing capabilities that we can utilize.

To solve these problems, we decided to containerize our Nginx setup to be deployed on top of Kubernetes using svctl, which makes it considerably more convenient to deploy new Nginx configurations to our VPCs. We also replaced the vanilla Nginx we previously used with OpenResty — which is basically just Nginx, but with Lua scripting capabilities.

With OpenResty, we implemented some Lua modules that contain additional logic to be included in the routing rules. We also restructured the server configurations to make it easier for the developers to add their application routes to the existing config, allowing us to offload quite a lot of the work to the developers themselves while having us on the back seat just acting as the reviewers for the configuration changes.

Redash on Kubernetes

At the time, most of the data components the infrastructure platform team used to develop were already under the ownership of the data platform team. But we had an issue with Redash because it was still using our old setup where Redash was deployed directly on top of a cloud VM, this makes it difficult for the data platform team to perform upgrades and modifications to our Redash because it could affect the cloud VM’s stability and if anything is broken they might need the infrastructure platform team’s help to fix it.

So just as we did with OpenResty in the previous section, we also reconfigured our Redash deployment setup to allow it to be containerized and deployed on top of Kubernetes, allowing the data platform team to own the whole Redash setup without the risk of breaking anything on the infrastructure that might take some work to fix.

There’s one unique thing about this Redash setup when compared to the other services we deployed on Kubernetes before. The Redash setup required a lot of memory for its pods due to it needing to load a lot of data from the database for analytics purposes. Because of this, we needed to provide a special set of worker nodes for the Redash deployment and configured the Kubernetes cluster to use those special worker nodes for Redash deployment with taints.

After this setup was done, we practically eliminated Redash maintenance from our work.

IAMX Development

Our organization was growing fast and there are a lot of access provisioning tasks that needed to be performed. The way we were doing it, we wasted a considerable amount of time to provision accounts for new joiners in our product and engineering teams. The access provisioning tasks themselves are pretty repetitive and theoretically we should be able to automate it if we have a good standardized interface to the platforms on which we’re provisioning the user access.

Because we didn’t have a standardized interface to the target platforms, we decided to build one ourselves. So we started designing and developing IAMX (Identity Access Management eXecutor).

IAMX serves as an engine to validate the interface support to the target platforms implemented in plugin modules we termed as IAMX connectors, and also execute the access provisioning, retrieval, and revocation logic implemented in the IAMX connectors.

We designed IAMX in a way that allows the IAMX connector modules to be independently developed by people from other teams — or even other companies, if anybody from another company’s interested to implement one for their use case and publish the code as an open-source project — and then plugged into the application running IAMX so that IAMX can leverage the IAMX connector module to perform access management tasks on the target platform it’s connected to.

IAMX by itself is not very useful since it can only be used to trigger access provisioning and revocation actions to the target platforms and nothing else. Because of that, we need to build another application — with IAMX as its core — to manage the company’s end-to-end IAM process workflows. Thus, we started designing and developing IAMD (Identity Access Management Dashboard) which implements the whole business flow for Cermati’s internal IAM process.

IAMD was launched in Q1 2021, so we’re going to talk more about it in a later section.

2021

We got more people on our team in Q1: a security analyst and two software engineers.

In this period, we mainly focused our efforts on improving our infrastructure operations security and governance capabilities. With the newly-formed information security team working with us, we also implemented new processes and structures in order to enable us to run our system in an even more secure manner.

We had shed off quite a lot of our work on supporting the development team by setting up OpenResty deployment on top of Kubernetes before. We also shed off our work on supporting the data platform team in managing their Redash setup by deploying the Redash instances on top of Kubernetes.

But we still had quite a lot of manual work to do regarding access management. We developed IAMX in the previous year, but we still needed to finish the development of IAMD to finally shed off the manual IAM work.

The following is some of the notable work we did back in 2021.

IAMD

We already have articles written about IAMD before:

With more software engineers in our team, we could invest more manpower into developing IAMD and more IAMX connectors to allow IAMD to cover more target platforms such as access to our back-office dashboards, digital marketing platform tools, and — most importantly —the rest of our infrastructure components such as SSH servers, VPN servers, and Kubernetes clusters.

As we mentioned in a previous section, back in 2019 we integrated our SSH and VPN authentication workflow with our PKI. But the way we maintained the access control rules for authorization purposes was still quite messy.

We wanted to integrate both of them (SSH and VPN access management) with IAMD to eliminate the manual work we needed to perform — manually making changes to Ansible playbooks, getting the changes reviewed, and running the playbooks — in order to provision access to our VPN and SSH servers. But due to the mess with how the access control was managed, building IAMX connectors to directly adjust the existing access control rule according to the operations performed in IAMD didn’t sound like a good idea to us.

So we developed an intermediate service called Torii — which uses Ory Keto to manage the access rules — that’s acting as the interface for IAMX to communicate with for managing the access control rules to our SSH and VPN servers. We then proceeded to migrate our SSH and VPN access rules to Keto so it can be managed in one place. Finally, we created an IAMX connector module to Torii to handle the SSH and VPN IAM workflows and then plugged the IAMX connector into IAMD.

We also leveraged Torii to implement our Kubernetes cluster access control management integration with IAMD, which completed our goal to unify all of our infrastructure components’ IAM workflow to IAMD.

Revamping `svctl`

svctl, our service build and deployment tool, was developed as a client-side wrapper for various Kubernetes and Jenkins operations to improve the development experience of our engineers when developing services for their domains.

But svctl performing all of the required logic in a CLI tool that resides in the developers’ machines made it difficult for us to make major changes in the infrastructure without having to coordinate with the developers to deprecate the svctl version they were using and upgrade their svctl to the latest version — which sometimes has a lot of incompatibilities with the version they’re using especially if they’re many versions behind.

To address this issue, we decided to perform a major architectural revamp on svctl, where the new version would be a client-server application with svctl only acting as a thin client that communicates with svapp, a back end service that performs the core logic previously handled by svctl to perform operations to our Jenkins and Kubernetes clusters.

With this new client-server setup, we can be more aggressive in improving our software build and delivery infrastructure and workflow without worrying as much about the changes affecting our developers. This is due to the fact that we can simply roll out an update to svapp in order to comply with our infrastructure changes and have the developers use the svctl version they have been using without any disturbance on their side, as long as the interface between svctl and svapp maintains its backward compatibility.

And if we want to deprecate old versions of the svctl thin client, we can simply put the deprecation warning as a part of the response to be shown by the affected svctl versions to the developers so the deprecation notice’s communication flow can be simplified.

Infrastructure Security Improvements

With the information security team on board, we started working to push for even more improvements in our infrastructure security setup.

From the process and monitoring side, we rearchitected our GCP access rules to make it easier to manage and applied a stricter permission structure to lower the chances for an employee to be granted excessive access beyond what they require in order to perform their job functions. We also ingested our infrastructure activity logs to Wazuh to improve the log audit and alerting structure we had in place.

We also enforced MFA with TOTP on our VPN and SSH access workflow to make it even more secure. We use PrivacyIDEA for the TOTP authentication system’s back end.

Since we also got a direction from top management to pursue PCI-DSS certification, we started rearchitecting our infrastructure setup for the deployment of payment-related services to comply with PCI-DSS requirements regarding the security of the cardholder data environment network segments and the network segments of the components interfacing with it. We didn’t manage to get audited for PCI-DSS compliance in 2021, but we passed the audit and got certified in 2022.

2022

Now that we have practically eliminated most of the manual work that put unnecessary toils to our day-to-day work, we can invest more into improving our system’s architecture, tools, and processes that we didn’t manage to touch before. We also expanded our team once more and had another security analyst join us in April 2022.

Our main focus in 2022 was scaling the organization’s capabilities to take advantage of our infrastructure while improving the overall security posture of the organization in aspects relevant to the usage of our infrastructure services. We also improved our toolkit to better manage our infrastructure at scale and optimized our existing system setups to make it easier to maintain while also providing better service to the users.

In 2021, we redesigned the architecture of our payment system’s infrastructure to comply with PCI-DSS. In 2022, we improved the other aspects of our security operations standards to meet the PCI-DSS compliance requirements and arranged to have our system audited. We passed the audit and got our PCI-DSS compliance certificate.

The following is some of the notable work we did back in 2022.

Jenkins Setup Improvements

We planned to improve the Jenkins setup to allow our developers and test engineers to define their own custom Jenkins pipelines, but our initial Jenkins setup was dangerous because the custom pipelines could be used to destroy the Jenkins cluster or exfiltrate confidential system information contained within the network segment by running malicious code on the Jenkins pipeline. Hence, the infrastructure team had to be the one setting up custom pipelines as requested by the developers and test engineers whenever they need one.

This was due to our Jenkins workers deployed directly on cloud VMs inside the same network segment as the rest of our infrastructure services, which made it prone to breaking if a pipeline is configured to run commands that break something in the worker host while the worker hosts are allowed to access some infrastructure-related resources within the same network segment.

In order to avoid these issues, we provisioned a new Jenkins cluster inside a separate network segment, and this new Jenkins cluster’s workers are deployed as containers on a Kubernetes cluster. This new Jenkins cluster is safe for our developers and test engineers to tinker with and define custom pipelines on, eliminating the infrastructure platform team from being the bottleneck for their custom pipeline needs. As a bonus, we configured the cluster to be auto-scaled to save cost during the period when the load of the Jenkins cluster is low — which we didn’t set up on the old Jenkins cluster.

One issue we initially had with our new Jenkins cluster was that it needed to pull all the Git commits from the repository it was pulling from every time the pipeline was run, instead of simply pulling the latest changes as with our old Jenkins setup — where the latest commits from the previous Git pulls are stored in the disk, ready to be reused by the next pipeline runs.

Due to our new Jenkins pipelines running inside a container on a Kubernetes cluster, the cloned repositories couldn’t be reused by the later pipeline runs. To solve this issue we set up a mechanism to cache the result of the previous Git clone and pull operations for the later pipeline runs to use, which sped up the pipeline runtime performance significantly for our older repositories — which have significantly more commits than the newer ones and was badly affected by the need to pull every commit every time the pipeline is run.

These Jenkins cluster improvements could be implemented relatively quickly because of the svctl revamp we did in the previous year, allowing us to mostly just interface with svapp backend when setting up the new Jenkins cluster without disrupting the developers’ workflow.

Security Process and Tooling

Starting in the previous year, the information security team — which is now a subdivision within the infrastructure platform team — started experimenting with our security processes and technologies. One of the things we started doing was internal phishing campaigns, which are integrated into our security awareness programs in order to train our employees to recognize phishing threats.

The first phishing campaign we conducted in 2021 was done with bare minimum tools, so we tried conducting another phishing campaign with a trial run of a paid phishing campaign toolkit. The toolkit didn’t really meet our expectations so we decided not to subscribe.

After some discussions, we started thinking to develop our own phishing campaign toolkit and decided to develop one by ourselves. The resulting phishing campaign toolkit has the features we needed to deliver the phishing emails to the targets, analyze the delivery and open rate of the phishing emails, and how far the victims had proceeded into the phishing funnel.

We also enforced stricter rules on our Cloud Armor web application firewall setups, which are closely monitored by our information security team for ongoing attacks and potential tweaks we can make to our rule set.

For the security of our VPC network internals, we decided to apply stricter network firewall rules for our internal VPC. But this is a bit difficult to manage for us, so we decided to build a network firewall management tools that allow us to manage the firewall rules in a more convenient manner while allowing better coordination with the developers for the firewall rule management within their staging and production VPC networks.

To help us detect potential threats that manage to get inside our VPC network segments, we also set up a few honeypots — which will trigger an alarm for the information security team to respond to if suspicious activities are detected on the honeypots.

Database and Data Exchange Infrastructure

The network infrastructure integration between our call center site and our data warehouse had undergone a few improvements over the years, and we were already at the point where we could set up the VoIP server databases to replicate the data directly into the data warehouse instead of using cloud object storage buckets for synchronizing the data as we initially did with myshipper. So we finally deprecated myshipper — which we developed back in 2019 for data replication from our VoIP servers — and leveraged MariaDB’s native replication capabilities instead.

We also started to upgrade our Postgres database instances from Postgres 10 — which was nearing its end-of-life—to Postgres 13. For stability reasons, we intentionally decided to not upgrade to the latest version (Postgres 14, when the decision to upgrade was made) but instead, we upgrade to the second latest version which we consider to be more stable due to it having been around for a bit longer.

Due to the number of Postgres 13 instances we needed to upgrade and the coordination required with the stakeholders of the systems depending on those database instances during the upgrade planning and execution, some of them must be carried over to 2023.

For partnership purposes, we maintain some SFTP servers for data exchange as some of our partners require us to provide it for them to exchange reports with us. As the number of partners we have grows, so does the number of our SFTP servers. To allow us to better manage the SFTP servers at scale, we decided to adopt SFTPGo with a cloud storage bucket as its storage engine (instead of using cloud VM disks, which were used in our old SFTP setup standard).

Conclusion

Five years isn’t a short time, and we have achieved quite a lot in the five years since the conception of our infrastructure platform team. The infrastructure architecture and process have evolved and matured a lot during these five years.

The earlier years were quite brutal since our scope wasn’t very well-defined and we needed to handle many things that should belong to a separate team while also working towards developing the infrastructure domains that we should focus on — while having just two or three people in the team working to make sure everything’s handled during our company’s period of growth. It wasn’t until 2020 that we finally managed to shed off the domains that shouldn’t be the main focus of our domain ownership.

Starting in 2020, we invested in hiring more manpower to enable better workload distribution and started hiring for more specialized roles such as database administrator and security analyst. We also started hiring more software engineers to boost our development speed in platform development and adding more capacity for various custom infrastructure works that we might need to do to support our product launches.

We’re definitely in a way better position now compared with how it used to be back in 2018 in terms of our infrastructure systems’ operability, security, and reliability. It’s all thanks to the hard work of everyone on the team and those who have been supporting us throughout the process.