Our Journey to Cloud (english version)
The WisePorter smart product catalog (PIM) is not the simplest type of software (you can read more on its use on our product blog) and is not easy to run, because it contains a number of technological components:
- Designer-frontend written in React
- Designer-backend and Runtime written in Java (Spring Boot, Webflux)
- A relational database, ideally PostgreSQL
- NoSql storage Elastic
- Cache, ideally REDIS
- A rule engine for defining, managing, and making calculations, e.g. OpenL Tablets or Apache Drools
- IAM — KeyCloak
- RabbitMQ / ActiveMQ / Kafka for asynchronous queues for communication between the WisePorter Designer and Runtime components or for integration with surrounding systems — for importing data to the catalog or exporting information on the product changes from the catalog
WisePorter is used by companies large and small, each with different requirements for the operation of its applications. Some of them prefer WisePorter running in their own environment, while others give preference to the product catalog being provided as a service (SaaS), when they do not have to care at all for the application itself and its operation. It is for these customers that we have created the WisePorter SaaS.
From Heroku to MS Azure
Our first cloud environment was Heroku, which we have successfully used for several years, but with time we started to become aware that this would not be our target environment, for several reasons:
- Heroku is a platform which is simple and easy to use for developers and to a certain degree offsets the need for DevOps. Nevertheless, this advantage is at the same time a disadvantage, because it works very well for simple cases, but it becomes a disadvantage when one needs to go fully automatic and separate development from infrastructure and operation.
- Heroku has a great command line (CLI) and it is also possible to use Terraform, but the options for automation are limited compared to the major cloud providers. Our goal was to set up a new SaaS instance by just one click.
- Heroku offers a limited stack of technologies, and if you lack certain services or applications, they can be added by using addons. Nevertheless, you still cannot do everything, you still have a limited flexibility in using the technologies you need. Heroku is not for geeks, but for developers whose primary wish is to write applications, without having to manage the configuration and the operation of the technologies needed to run their application.
- Heroku contains a number of limitations (e.g. an HTTP request length must not exceed 30s) and often these limitations cannot simply be changed and generally have to be resolved by changing the solution pattern.
- Heroku does not offer the PAYG (pay-as-you-go) model, that is, the option to pay for an exact used computing power. Instead, on Heroku you reserve some computing power or data (resources, in general) which is available to you for the entire time for a fixed price. Scaling is possible, but you have to constantly bear in mind that you cannot exceed the resources reserved. The selection of virtual machines is limited, often one cannot tackle the CPU and memory separately and needs to reserve and pay for both, even though one does not actually need such a large CPU or memory.
- Heroku does not offer managed Kubernetes, which is a major disadvantage for us. We consider the K8s to be the basic building block for the operation of our application components and the fact that K8s is a de facto standard reduces the risk of being locked in to one particular cloud provider.
- Heroku does not provide any guaranteed SLA, it is a best-effort service.
- Monitoring and logging are not part of the solution and must be outsourced to third parties.
At the beginning of this year, we got down to selecting our new cloud platform and, as expected, we eventually had three main options to choose from — Amazon Web Services (AWS), MS Azure, and Google Cloud Platform (GCP).
From the start, we were clear about the architecture of the entire solution, including the preferred technologies. On each platform, we have created a PoC with the aim to get the WisePorter running with all its components. For the implementation of the PoCs, we have invited experienced specialists of the given platforms, addressed the local partners of Azure and GCP or a freelance specialist for AWS. The aim was to collect a larger set of input information for the final decision, which one will not get just by reading articles.
Unfortunately, this did not generate any clear winner. All the three platforms have a lot to offer and yes, the largest scope of services is offered by AWS, which probably is also a technological leader, but we only wanted the basics — managed Kubernetes, data storage, queues, monitoring, etc. — nothing “advanced” or too “special”, that either AWS, Azure, or GCP cannot offer. If we had to make a decision in this particular phase, we would probably go for AWS, because it is a platform we are most familiar with and have already gained a lot of experience with from past projects.
So why have we finally chosen MS Azure? The answer is simple, because of money. We do not mean money for the operation of our product catalog on a given platform, we have not found any significant differences in this respect. We mean money to start with, to be able to launch and subsequently tune and test the solution, which can cost a lot. AWS did not offer anything, GCP offered hundreds of dollars, and only Microsoft offered as much as $120,000 under the Microsoft for Startups programme. It was not for free, we had to register in the programme and succeed, or in other words to convince the “jury” that we were doing something meaningful and, most importantly, that we had a potential for growth.
Application adjustments for cloud environment
The product catalog is a Java Spring Boot application which runs on any application server, such as Apache Tomcat, or which can be launched in the embedded mode. The architecture of the application is not entirely monolithic, but one cannot say it is microservice either. It is just the right kind of component decomposition :).
The first major challenge was to adjust the WisePorter application to run on the cloud platform, so that we are able to use all the cloud options, such as scalability, or run tens of WisePorter SaaS instances simultaneously, while handling monitoring and support. Our list of non-functional requirements had over fifty items, of which more than half were “must-have”; without them we could not deploy the WisePorter in the cloud. To illustrate, the list contained the following groups of requirements:
- bootstrap / provisioning / graceful shutdown of all components
- log unification
- health probes
- user and technical security
- operation in a cluster
- configuration, including secrets
- docker image
Even though the implementation itself by several people took several months, the most difficult thing was to become aware of and define such requirements, which would not be possible without our previous experience in microservice architecture. Now I’m getting ahead of myself, but this was the most important part in our successful transition to the SaaS model, which is actually confirmed by the experience of other companies.
A new environment upon a click
The second major challenge was the creation of the cloud environment, where Kubernetes and the other necessary technologies would run. Firstly, we have created the environment manually and subsequently tackled its full automation through Terraform. Today, we can create a brand new unified WisePorter environment by a single click based on the initial configuration, it takes about 2–3 minutes and the entire process is fully automatic.
Most of the technologies mentioned at the beginning of this article are provided for as a service directly in MS Azure:
- Kubernetes (AKS) as an application server for all Java components, React frontend, KeyCloak, and Elastic
- Service Bus input / output queues
- Azure Database for PostgreSQL
Although Azure offers Elastic as a service, primarily for the purposes of fulltext search and Kibana, we need Elastic as a data store in the first place. For managed Elastic Cloud service outside Azure, we would have to pay from the start, because it is not covered by the Microsoft credits. Another problem would be the integration in our automated operation and we were also concerned about latency (that is why now we have it in one AKS, in one virtual machine).
Similarly, we have made our own provisions for REDIS — we have the REDIS Sentinel running on AKS. The main reasons for this being price and latency. Moreover, we have a very good knowledge of both the technologies, therefore running and managing the service was easy.
What have we gained?
We have successfully launched the WisePorter SaaS on the 1st of September, and from October we have had our first production customer on this platform and in this mode. Now, we are gradually transferring our existing Heroku instances here. We have managed to successfully complete our mission, which started with the beginning of the year. Nevertheless, there is still a lot of work to be done — the platform runs fully, but it needs fine tuning — the optimization of the settings, the sizing of all parts of the platform to minimize the operational costs, monitoring and active alerting settings, optimum back-up, and making improvements on inconvenient failures (which will surely occur sooner or later).
What the WisePorter SaaS brings us:
- incomparably more options and higher flexibility than Heroku could offer
- everything standardized (each environment is identical), everything fully automatic
- we have a guaranteed SLA, which we can subsequently offer to our customers as well
- we can offer each customer the computing power they actually need. We do not have any de facto limitations; during stress tests, we were able to run WisePorter in 40 instances …
- we can offer our customers the PAYG model
- Azure Monitor is a strong tool for monitoring standard and proprietary metrics, log search and alerting, if things do not work as envisaged
Where are the pitfalls?
Reading the article, one might gain the impression that everything was simple and straightforward. Unfortunately, such perception is incorrect, because the challenges and difficulties were many.
It is necessary to learn to balance the unlimited technical options offered by the cloud platform against reasonable costs. Most notably, this means changing the mindset of developers and DevOps engineers and continuously work on this cost optimization, because it is not difficult to run an SaaS application with unlimited costs, the mastery is to run it as cost effectively as possible, without an impact on the user quality of a given service. Several examples:
- The most expensive items are data (logs, transfers to/from the cloud), and therefore it is necessary to find the right balance between the amount of the logged information and their retention, so that the product support and error correction, if any, is as fast as possible, and the price for storing the logs.
- A suitable sizing (particularly CPU and RAM), as well as the correct selection of discs, virtual machine types, and other items, which always involves a choice between the price and the computing power, and to a certain degree also the degree of certainty and the stability of the entire solution. This cannot be optimally achieved without ongoing measurement and tuning.
- The size of the backed up data is proportionate to the volume of the data in the primary database and the frequency of changes. Is it better to use service back-up and put data on the side and pay for a data space several times larger than the original database, or expand the primary data storage, rely on time slicing (the Azure Database function) and, possibly, change the manner of storing the data, to minimize the scope of the changes, and thereby generate less change data for each change?
With the above-given examples I want to illustrate that the optimization options are numerous, some concerning the settings of the cloud platform only, but mostly relating to the application itself, the way it is used and integrated with other systems outside the cloud platform, etc. This kind of thinking only comes with practice and experience. If a simple optimization saves us just $10 each month per environment, then with 20 environments we can save a fair amount of $200 per month …
Considering the above, I cannot even imagine having to pay fully for all of it from the first moment. That would cost us thousands of dollars monthly. We have several environments for different needs (development environments for our production team and for the individual projects/customers, environments for performance fine tuning, for the presale demos of our potential customers, etc.), we are at the start of the optimization process, we are collecting the first real operation data, and monitor each day our metrics, trying to learn from all this, and set the priorities for further optimization. I am really grateful for the support from Microsoft and I think that they have come up with a well-thought-out programme which will pay back in future many times over.
Cloud is definitely not a cheap solution, which is a common myth. A cloud (just like microservices) is not for everybody and one has to define beforehand the reasons Why? We are learning how to price the WisePorter SaaS for our customers, because one customer may have a better picture of how cloud solutions work and is ready to accept the PAYG model of paying for infrastructure, while some other would rather have one guaranteed price which is the same each month. We have to be able to satisfy both types of customers.
I like the options and functions that Azure Monitor brings — anything I need, I can visualize, I can create my own metrics, add graphs, which I can monitor in time. Similarly, I can actively monitor certain indicators, so that Azure actively warns me of any issues, for instance, when the average response time increases above 200ms or the number of messages in the input queue is higher than 1,000. Nevertheless, this requires a change of mindset from developers — a direct approach to applications, the individual technologies and logs is not often possible (or desirable) and developers have to rely on Azure Monitor, where a lot of information is mediated; it is not the same as to connect to Redis via a command line and find out what exactly is going on with a few commands.
I am convinced that our decision to go to a cloud was right and I am glad for our team who were able to cope so well with this difficult technical challenge. From the start, we had a pre-set launch date, because we no longer wanted to offer Heroku-based solutions to our customers. We did not have much time and we worked in two simultaneous streams — the application stream, which adjusted WisePorter to the cloud platform, and the cloud stream, preparing and automating the new environment. We have gained a lot of new experience and expanded our knowledge, which we can build on during further fine tuning. There is still a long way to go …