tl;dr; As a rapidly growing startup, our entire business depends on our ability to build, scale and reliably operate the platform powering our customer experience and logistics. Learn why and how we do DevSecOps to achieve all of this while keeping our team happy and productive.
At Kolonial.no, we are rethinking the way you purchase groceries. We are one of Norway’s most recognized tech scale-ups, with way more under our hood than just an app. Technology is at the heart of our entire business, and our in-house digital platform powers everything from our personalized shopping experience to complex warehouse logistics and delivery to doorsteps all over Eastern Norway.
Having grown rapidly since we started in 2013, we’re now handling thousands of orders every day. Our customers expect to effortlessly place orders at any time of the day, and our warehouse is in full operation around the clock, with people and automation working together to prepare customer orders in time. Hundreds of vehicles are out on the road delivering from sunrise to sunset, with both drivers and customers relying on our platform to get everything delivered correctly and on time. All this means our systems are truly business-critical, and any problems or outages quickly affect both customers and employees.
A key challenge for our tech team is to be able to build, improve and scale these business-critical systems as we grow, while keeping a growing team of engineers happy and productive. To achieve these goals, we are true believers in DevSecOps.
What is DevSecOps?
There are lots of good descriptions of DevOps around, but a few things unite them all — even though it has a lot to do with technology, it has just as much to do with people. At Kolonial.no, we consider DevOps the combination of our technical infrastructure and tooling, and the way we build, operate and improve our systems. In practice, these tools and processes are always evolving, so one of the most important aspects of DevOps is that it ties these aspects together into something we can reason about and discuss.
Just like Dev and Ops, we consider security a responsibility shared between everyone involved. Instead of trying to tack security on as an afterthought, we strive to follow the principles of DevSecOps. One of the most important principles, “shifting left on security”, involves considering security aspects from the earliest phases of development as a collaborative effort.
Our tech stack
Our tech stack consists mostly of modern open source languages, frameworks and tools. Python is our lingua franca, and we use it for everything from building Django web apps (we’re proud sponsors!) and machine learning pipelines, to controlling robots and real-time automation in our warehouse. We run our infrastructure on Google Cloud, along with some on-premise servers for latency and redundancy (e.g. for controlling hardware). Everything is infrastructure as code using tools like Salt and Terraform on Linux, and we instrument and monitor in real-time using tools like Datadog.
We have opted to standardize on languages, frameworks and tools across our teams, and believe this approach has a few important advantages. Instead of each team spending scarce and valuable time building tooling and knowledge about how to run different things in production, we can instead double down on creating robust solutions that can be used in multiple teams. We’d rather build things that make our customers happy and solve real-world problems than reinventing scaffolding and supporting technology. We also want to have a culture for team mobility and think a familiar stack reduces the friction involved in switching teams, and contribute to delaying Conway’s law. That being said, we do of course evolve our tech stack continuously — it’s definitely not about having a static stack, but rather giving deliberate thought on what we bring into it.
Scaling systems and productivity with growth
In the beginning, when you’re only a handful of developers, things are simple. Everyone knows everyone, and everyone is familiar with most parts of the majestically monolithic codebase. Deployment is easily automated with a few scripts, and intimate knowledge of the systems make debugging and fixing issues easy enough.
Fast-forwarding a few years to our current state with multiple two-pizza teams working on distributed systems, it becomes apparent that it requires more deliberate action to remain reliable, secure and productive. This need for speed and reliability in both the short and long run is the reason we choose to invest in our technical infrastructure, tooling and processes in addition to working on important business problems.
We have generally been able to stay ahead of increasing complexity as we’ve scaled by making incremental improvements to existing components. However, sometime last year we started seeing increasing symptoms like slow deployment (taking 15–20 minutes instead of 2–3 minutes), broken database migrations leading to outages, failing tests with unclear ownership, and increasing complexity in resolving production incidents.
This led us to start working towards a more structured approach to DevSecOps, including assessing our current capabilities based on the DevOps State of the Union and spending extra engineering time improving important issues. Here are a few examples of some of our recent wins:
- Automatically built cloud test environments for all Github pull requests
- Improving our deployment pipeline to get deploys down to a couple of minutes
- Switching our data warehouse from PostgreSQL to Snowflake
- Tools for improved visibility into database migrations
- Rolling out Vault for better secrets management
- … and much more
How we work
A basic principle of DevOps is that developers and operations people work together from initial development to operations in a production environment. If you have a DevOps department that‘s supposed to do it all, then you’re not doing it right. Instead, we believe in empowering our product teams to own their stuff in production, while standing on the shoulders of well-proven shared platform components and support from an infrastructure team.
In practice, we ideally have one or two DevSecOps engineers embedded in each of our product teams. These aren’t full-time roles, but software engineers who have an extra interest in infrastructure, systems and tooling. They contribute to site reliability and sustainable engineering by encouraging good DevSecOps practices and building platform components, in addition to contributing as software engineers in the team. Our product teams are responsible for prioritizing these tasks to make sure that each team is both responsible for and capable of fixing their pain points.
In addition to the teams’ DevSecOps engineers, we have an infrastructure team that works closely with the product teams. The infrastructure team has the overall responsibility for our infrastructure, and serve as experts, discussion partners and contributors to infrastructure and platform efforts, and have the final say on things that affect our production infrastructure. Working closely with the product teams to fix real pain points, we reduce the risk of wasting time building “solutions looking for a problem”.
We have lots of good DevSecOps stuff in store:
- Fully migrating to container-based infrastructure for CI/CD, testing, deployment and production operations
- Introducing “KEP” — a lightweight process for evolving our stack asynchronously inspired by PEP
- Enabling high reliability and visibility into warehouse robots and automated hardware with improved tools for operators and our on-site maintenance crew
- … and even faster CI/CD and deploy
Do you get excited by topics and challenges like these? We’re looking for DevSecOps Engineers for our product and infrastructure teams and would love to have a chat with you. Apply to one of our open positions or get in touch with our tech recruiter Sabrina if you’d like to talk!