Tech @ Wehkamp: A travelogue

Published in

wehkamp-techblog

8 min readJan 5, 2022

Original photo by Annie Spratt on Unsplash.

I have been at Wehkamp for quite a few years now. An important reason for that is our unofficial motto: “Never a dull moment.” That is true, because of the nature of our business: in commerce it’s important to keep moving with trends and sudden demand. It is also true because of our tech stack, which has changed a lot in the past years. So much, that when I was discussing it with a friend recently, I lost track myself. I thought it would be a nice idea to write it down.

A long, long time ago

When I joined Wehkamp as a web developer in 2007, we were in the middle of a migration from classic ASP to ASP.NET (2.0). The application was a monolith with layered architecture. We were using Microsoft SQL Server as our data store and the site was running on Microsoft Internet Information Services. The back office was a Java application and we were using the Hessian (binary) protocol to transfer data between the website and back office. We adapted to new technologies, like LINQ, ASP.NET MVC (used to build a new checkout and self service environment) and ASP.NET Web API (our first attempt towards a service oriented architecture).

The rise of Blaze

In 2014 Wehkamp decided to expand its market to — mainly the Flemish speaking part of — Belgium. Although that adventure was short-lived (we closed the doors after only two years), it was a huge driver for innovation. Instead of using the existing code base, we saw this as an opportunity to do things differently: build a new platform for our website and back office systems from scratch, using the latest technologies. We realized that this was the perfect opportunity to move from our monolithic applications to microservices. Many articles have been written about the advantages of that architecture, so I won’t bore you with them. For us, the fact that they are small and relatively easy to maintain was an important reason.

To find a catchy name for our new platform, we made a poll. The name Blaze won with a hair length lead from the runner-up: Phoenix. Our tech stack was dubbed the SMAC stack: Scala, Mesos, Angular and Cassandra. We decided to use Elastic Search to help customers to quickly find the product they are looking for. This search engine became the backbone of the overview pages as well.

Fun fact: SMAC is also a kind of spam, marketed by Unox.

At the core of the platform was our Mesos cluster, supported by Marathon for managing our containers. Consul is used for configuration management. The combination of HAProxy-proxies and consul-driven configuration enables easy routing. The gateway to the platform is an nginx proxy, with some custom routing logic, written in Lua. With these technologies we were able to launch a new website and order management system in six months, opening our doors for Belgian customers.

“Why this stack, when you have a history with Microsoft?”, you may wonder. There were several reasons actually. First of all, breaking with Microsoft would force us to rebuild — and thus rethink — our applications, instead of just copying code. We wanted to use open source products and to go faster, we wanted to split the front end (Angular) from the back end (Scala). And another important reason was that running .NET in a container simply wasn’t possible at that time.

This changed with the release of .NET Core, which many of our developers, who were still working on the “ancient code” embraced with much enthusiasm. By this time, some other changes in the stack were made as well. We moved from Angular to React, because at that time it did not support server-side rendering, which was imperative for SEO. Also we introduced Cypress for end-to-end testing.

We moved away from Cassandra as we found that the one-database-to-rule-them-all was not the best strategy. When doing certain IO intensive operations, it sometimes became unresponsive, which is not what you want for a fast website. We decided to let the teams choose their own data store, as long as they were AWS native. Based on the scope of the team and application, most of them chose either PostgreSQL or Elasticache (Amazon’s version of Redis).

When envisioning the platform, we wanted it to be developer-friendly and support our new way of working. This required a smooth CI/CD pipeline. At the heart of this is Jenkins. Scripts frequently poll our GitHub repositories for changes and take action if they are found. Build scripts are created for new repositories and executed when code changes are detected. Builds are done in a container, so it doesn’t matter which language (or version of that language) you use. A command-line tool, called blaze-cli, helps in the process. It detects the language and version of the application that will be built and spins up the right build container.

Basically, a developer can deploy an application to production, within minutes, by just creating a repo in GitHub.

The output from a build container — an application container — was pushed to Docker Hub. The newly created container is then deployed on what we call our development environment, which hosts the latest, greatest and sometimes untested versions of our applications. Promotion to the production environment can be done automatically, based on the result of tests. Basically, a developer can deploy an application to production, within minutes, by just creating a repo in GitHub. And because it is so easy to deploy, our teams do it all the time. If something breaks, a rollback — and more often a roll-forward - is never far away.

As more technologies became available, we started to use serverless solutions as well. By the time I’m writing this, we have almost fifty lambda functions running in AWS, doing stuff from maintenance tasks, providing metrics, image modification and running Machine Learning models for image classification.

On the Cloudflare Edge, we’re running workers for tasks, reverse proxies for some of our sites and a worker that does 19k (!) redirects for SEO purposes.

DevOps and increased productivity

Although perhaps not the most sensational change from a technology perspective, one of the most important changes was the introduction of Slack. Although nice for person-to-person communication — indispensable in the current covid-world — it was the basis for some of the things the tech hub runs on: ChatOps and SRE. Several bots help us to get information in a jiffy, create resources like databases without needing to code or access the AWS Console and support our incident process. Slack, in combination with our bots, really shines during incidents. Anyone can request graphs or other information, without having access to databases, logs and dashboards. A bot will get the data for you, which helps to reduce the time to repair.

The power behind this is Prometheus: an application for event monitoring and alerting. Prometheus frequently “scrapes” our services and sites and collects the metrics that they expose. These metrics are used for both dashboards — for which we use Grafana — and alerting.

For quite a while, one of our biggest challenges has been to store the huge amount of data we collect. For normal operations, and incidents, the last couple of hours is usually sufficient, but what if you want to compare traffic now to that of the week, or month before? Or perhaps even the year before? How was this Black Friday’s traffic, compared to last year? We couldn’t tell, because we could only store the last few weeks. Needless to say that monitoring Service Level Objectives over time was impossible. Until recently. We now have Thanos in place, which covers the long term data storage — with the option of compacting it — and also improves performance, by offloading queries for many dashboards.

Perhaps one of the greatest initiatives was the introduction of Databricks. The idea was to make data available for everyone and provide them with the tools to work with that data in an easy way. Everyone with a little programming knowledge, can write queries to find whatever data they need.

Although this was a bridge too far for people without a programming background, it really helped to make more data available for more people. This made it possible for anyone to quickly find answers to questions that previously required the help of specific colleagues. For example, a team trying to solve a certain problem can access Cloudflare logs to investigate suspicious requests. In the past this would require help of the Platform team to get that data and share it, which would have taken more time and work.

Dawn of a new Era

For many of the back office applications, BizTalk was used to transfer messages to and from systems. While the technology works well, it requires specific knowledge (i.e. a separate team) to run it. As we like our teams to be responsible for their own products, we decided to move these flows to Apache NiFi — based on NiagaraFiles, developed by the NSA. This enables teams to build and maintain their own data pipelines.

Although the Blaze platform has brought us many good things, there are some things that can be improved as well. Scaling out — increasing the number of container instances for applications for example — still requires manual steps. Even more important: it takes time to have every team increase the amount of their instances, which causes a problem when our PM announces a new lockdown and many people want to buy toilet paper at the same time. Or when a new stock of the Sony PlayStation 5 becomes available. Even riskier is a massive increase of the amount of instances of every microservice at the same time. The Mesos nodes on which those services run need time to grow in numbers. If the desired amount of service instances requires more resources than the amount of nodes can provide, every deployment stalls until additional capacity is available.

Our teams are currently working on moving to a new platform. We named it Atlas, as the platform needs to carry the full load of sites, services, gateways and processors for both Wehkamp and Kleertjes.com, which is quite a load to carry.

Although we want the developer experience to mostly stay the same, many things will change under the hood. First of all, we will move to Kubernetes as the heart of the platform and ArgoCD is replacing blaze-cli and Marathon for CD. Our containers have already been migrated to Amazon ECR, so we said bye-bye to Docker Hub a while ago.

Conclusion

Although we’re not the first to use new technologies at Wehkamp we strive to apply them as soon as possible. They help us to move fast in an ever changing world. It’s interesting to see the move from self hosted products to cloud based solutions and how we add new tools to our tech stack, while abandoning those that don’t fit anymore. The journey has been an exciting one, so far, and I am looking forward to what the future will bring.