Modern Infrastructure for Modern Software Delivery — Moving Porsche to the Cloud

Published in

PorscheDev

8 min readOct 23, 2017

Moving towards an agile software delivery model brings changes to the way teams work. They

are empowered
need to gather feedback quickly
move fast
use modern technologies

For this to succeed, they need infrastructure tools and processes that support this workflow. They need insights and control over their technology stack to iterate quickly and build the best solutions.

IT departments’ traditional ticket-driven processes and standard products are often not the right fit for this type of work. They need to evolve alongside software delivery teams.

This post describes the Porsche Connected Car programs’ journey towards a new, AWS based infrastructure.

Finding the hook

For business leaders, it can be hard to grasp the challenges, tradeoffs and benefits of technical tasks (be it a small refactoring or replacing the whole infrastructure). So if we want to get their support, we need to frame technical decisions in a way that makes sense in their world:

“This might make us better in developing software” might be true and obvious to any engineer, for outsiders it is vague and hard to follow.

“This will improve customer experience and increase revenue” is concrete, measurable and a straightforward argument for any business leader so finding it will go a long way to bring them to our side.

For us, this argument was the performance experience for our customers. As a car manufacturer that aims to provide a superior quality to customers all over the world, serving all of them from data centers in Germany does not seem to be the best approach. But how to prove it?

Measuring Performance

If we want to argue from a well founded base, we first have to find out if we even have a problem.

So first, we need to look at how fast our product is for users all over the world. For this, WebPagetest served us very well, allowing us to run simple scripted user sessions across a wide range of locations and gathering detailed performance-data. It also allows us to generate videos comparing user experience across regions that allow non-technical people to understand performance impact.

So, we established that we have a problem. Now it’s time to drill down to find the actual bottlenecks affecting performance.

Properly instrumented applications can provide a wealth of data to do this, logging performance metrics for incoming and outgoing requests that we can query and visualise in a log analysis tool (in our case Splunk).

Once we had a more detailed picture on the critical requests in our applications path, we used Gatling to build load tests that allow us to reproduce critical aspects in various scenarios.

In the end, we found a number of different aspects that affected performance and we were also able to prove our hypothesis that network latency between our customers, their cars and a central datacenter played a role end user performance, giving us a great starting point for discussions with various stakeholders.

A vision and new team

Having convinced various stakeholders that changes to our infrastructure stack would not only improve the ability to build and operate our software but also give our users tangible benefits, a small team formed around a core vision:

An agile, modern infrastructure stack owned by empowered delivery teams.

By design, this team would be temporary. They would prototype and clear the way, then coach delivery teams to take ownership of their infrastructure.

Experimentation and Prototyping

Before we could get started migrating our applications, we first needed to get a feel for the new platform. We needed a realistic testbed that would take real-world constraints into account.

To accomplish this, we decided to deploy a minimal subset of our existing application landscape, consisting of part of our edge layer (responsible for, among other things, managing authentication of requests) as well as one of the simpler applications with a minimum of dependencies to external systems.

This setup was simple enough to iterate quickly but also complete enough to run complete user journeys and to test a number of more complicated requirements like dependencies between systems, encryption in transit and at rest, secret management and monitoring.

Using this testbed, we were able to figure out which technologies and approaches worked for us, which solutions we want to carry over from the existing infrastructure and where we needed to find new solutions. We could experiment with migration strategies, find dependencies and pitfalls. We could extend it to new use-cases, allowing us to learn aspects we didn’t know before.

Scale it

After a while, our testbed evolved more and more into a production-like setup we were comfortable with. We felt we had covered most of the critical aspects and found solutions that worked for our case.

At this point, our testbed became the seed for our new infrastructure. Since it was built from the start with the same approaches we would use for a “real production” system (testing and monitoring, infrastructure as code, resilience), there was no need for re-engineering. We could use our existing testbed infrastructure and scale it.

In our case, this meant starting to hand over responsibilities. While application delivery teams were always in touch with the experiments of the cloud enablement team, now was the time to get more hands-on. To work together to create infrastructure they would own and evolve, for the applications the team owned.

This is where we are right now. Team by team, application by application, we are moving pieces onto a new kind of infrastructure, learning, adjusting and improving as we go.

Learnings

While this process is far from over, we already learned quite a lot. So here’s a collection of things that we found out, that worked well for us or learned the hard way.

Be prepared for surprises and plan accordingly

It is in the nature of transformation that not all aspects can be planned reliably. You are trying to establish something new, something your organisation doesn’t have experience with so necessarily there will be surprises, there will be things you didn’t expect. Things might take longer or won’t work in the way you want them to. Your stakeholders should be prepared for this, be on board with it and account for it in their planning.

Use realistic examples and build for production from the start

When starting something new, we might be tempted to start with something that is simple, to “do it properly after we are done experimenting”. This divorces your experiments from reality. You end up with situations where your results don’t work when tried “for real” or where a lot of rework is necessary.

Instead, try to find a small subset of what you are trying to do and do it as if it were a production system: Build testing and monitoring in from the start. Automate. Create delivery pipelines. Design for resiliency and scalability.

You don’t have to do all of this at once but until you took those aspects into account, you won’t discover the little edge-cases, hidden constraints and intricacies that will make or break your solution sooner or later.

Find out what you can reuse but don’t shy away from building new things

When moving from on-premise to cloud infrastructure, it will be tempting to reuse as much as possible. After all, you probably already have a lot of infrastructure code, custom deployment tools, technologies, architectures and processes that contain years of organisational knowledge. And some of those can be very valuable. Others don’t apply in the new world and there’s no sense in keeping them around: For lots of solutions, the cloud works fundamentally different from your on-premise IT.

As one example, on-premise datacenter often work with long-lived, mutable servers while cloud providers optimise for a model where individual servers are unreliable, short-lived and immutable. This has consequences for the way you build and maintain your infrastructure and for the tools you use:

Deployment changes from deploying a new artefact to replacing the whole server
You roll out new images instead of orchestrating changes from a puppet-master
Scaling up and down based on demand replaces static upfront capacity planning

Make sure you start early working on organisational aspects

While most of this post focuses on technical aspects, we shouldn’t forget that in large organisations, non-technical or organisational aspects will dominate a lot of what you do.

Architecture Committees will want to know about your plans, IT Security might need to sign off. Accounting and Procurement need to figure out how to pay the bills and business units will want to know how this affects their existing features and your ability to build new ones.

Most of these things will take a while to sort out. They can derail a project if we don’t. So involve people early and work with them but don’t let them set your agenda: You are trying to build something new so by definition, it will be different in some sense so standardised approaches will not apply.

Build for your teams current capabilities but design for growth

As a team with the mission to enable others, one of the hardest tradeoffs to make is how much abstraction to design upfront vs. how much freedoms to allow individual delivery teams. Do we want delivery teams to have a single “deploy” button and that’s all they can do or do we want them to design their own networking structure, CIDR-ranges and all?

Too much flexibility means more things to learn and keep track of for the individual, more things to go wrong and more inconsistencies across teams. Too much abstractions means longer lead times (because abstractions need to be built), less flexibility, more dependencies and the very real danger to build the wrong abstractions.

The truth will almost certainly (unless all your employees already are expert cloud engineers) be somewhere in between and will depend on your organisation. But as a general rule of thumb, try to find a solution that works for your audience’s current capabilities (to allow them to learn and adapt quickly) but design it in a way that allows for them to grow out of them eventually, to drop the abstractions when necessary.

For example, for us that meant starting to prepare Terraform modules for some common infrastructure pattern (e.g. a web application with a load balancer). Teams can use them if the default works for them but can always choose to abandon them and work on a lower layer and with different tools if they need a more custom solution.

Prevent scope creep, aim for production as early as possible

Our approach started with a phase of experimentation, prototyping and preparation. It allowed us to gain confidence and find solutions that we could recommend to delivery teams.

It’s also very vulnerable to scope creep: There’s always another thing to explore, another thing that “just needs to be done before we are ready”, another thing that just “isn’t good enough yet”.

So keep in mind that in isolation, this phase doesn’t add value. Value is only added once delivery teams start building on top of it, once real software is running on real infrastructure for real users.

So aim for production early. Aim for production not when things are perfect (they never are) when things are just barely good enough. This will tell you what’s important and what’s just nice to have. It will give you feedback if you are on the right track and tell you all the things you missed.