Lessons learned from automating Network Configuration Management at TransferWise
In January 2020, I joined the Governance, Infrastructure and Networks team (yes, that does spell GIN!) at TransferWise. Our team provided other platform teams with interfaces and tooling to create compute resources using best practices and standardised methods. My challenge was to automate configuration management for all cloud network appliances we were responsible for.
Why? We needed automation to scale the team as our network estate grew in size and complexity. Automation also buys us a lot of efficiency at audit time. As a FinTech, we’re constantly being audited. Reducing the network audit process from an infrastructure audit involving many SSH and show commands, to checking out a Git repository is a huge efficiency improvement.
This is the story of how our network automation tool — twnet — was born, and what we learned along the way.
Getting started — what did configuration management look like for us?
The benefits of configuration management were clear, but getting there was not.
We started with a network inventory, had a short discussion on what configuration management looked like, and agreed on the following:
Atomic configuration builds and deployments. In this context, “atomic” implies we will never push or build partial sections of configuration. This:
- Avoids configuration drift — a common issue with manual configuration.
- Simplifies rollback.
- Enforces template based configuration.
Using Juniper devices means atomicity is relatively trivial to achieve, and is (largely) non-disruptive unlike some other vendors. More on that later!
Configuration changes would be peer reviewed using version control. Using pull request reviews for changes lets us:
- Reduce human error. The pull request process ensures at least one other engineer has reviewed the code.
- Simplify maintaining an audit trail. All changes, proactive, or reactive require an approved pull request.
Using Python for configuration management. We chose Juniper’s PyEZ library — which uses NETCONF — to wrap appliance operations like deployment and rollback, along with YAML and Jinja2 for templating. Alternatives considered were Ansible, Netmiko and NAPALM. Custom code means we can use libraries like pytest to test builds and deployments, versus using a framework like Ansible.
We were now ready to begin templating our configurations.
Templating our devices, step-by-step
It’s worth noting templates without a delivery mechanism are of limited value. It’s nice to be able to generate a template, but if you’re still relying on a human to paste it into a terminal window, you’ve not removed your largest risk of an outage.
We embarked on fully templating a device once we’d verified that our chosen toolset could deploy a minimal bootstrap configuration.
Let’s take a look at how we started out with templating:
We let the device do the work…
Junos can represent configuration files as JSON. This played heavily in our favour, as we used YAML to model the data held in our devices, and Jinja2 templates to render the data to generate a device configuration.
… but not all the work
The challenge once we had the data, was to make it more readable. For example, manipulating the JSON output from a Juniper vSRX to yield a YAML representation of it’s NAT configuration you get the following:
- name: src-nat-foo-pool
- name: 10.1.1.1
- name: src-nat-foobar-pool
- name: 10.1.1.2
The above YAML structure tries to enumerate a list of NAT pool addresses.
A more succinct representation would be::
- name: src-nat-foo-pool
- name: src-nat-foobar-pool
This shorter representation lets you readily distinguish between source nat using the name for the structure storing the value.
You can try this for yourself as an exercise to see where you can make the device’s representation a lot more succinct. BGP configuration in Junos is a prime candidate for this, as are a lot of policy statements.
If there’s one takeaway from this whole exercise, it’s this; think about your data.
Some questions you could ask yourself are:
- How would you represent the functions you want to configure?
- What makes the most sense for you as an engineering team, or business?
- How would you make a function self service? What would let a user achieve a given outcome and write the least lines of YAML? Can they still understand what they’re doing?
- Can you map data models to business logic?
Keep complexity out of your template
Templating languages aren’t the place to handle complexity in your configuration.
Don’t use Jinja2 to handle complexity. Use a higher order language (Python in our case) instead. If you find yourself writing several if/else statements in Jinja2, chances are you need to standardise, or find a different way to represent the model you’re trying to parse and render.
Which brings me to the next point.
Manual configuration led to a lot of variance. Variance then led to complexity. VPN establishment parameters were a good example of where we found a lot of needless variance we were able to eliminate.
Pay your tech debt down as you go. Document it if you can’t.
Sysadmins and Software Engineers alike are familiar with tech debt. Maybe you put in a firewall rule to test something and it never got removed after tests were complete. Maybe you forgot to turn logging off for something. Maybe you put a hack in to get a feature across the line.
We found a lot of tech debt while templating and we’re now paying it down.
If you can’t address these immediately (usually because the “temporary” hack is still in production!), document it, and address it later. JIRA works well for this.
We now had templating working, and it was time to test a deployment. But before this…
What else did twnet need to do?
So far, we had enough code to:
- Build configuration locally.
- Deploy what we built in the step above.
It was time to refine the process.
We replicated the workflow a typical engineer used during configuration. A common workflow is to configure something manually, validate it (e.g:
commit check), compare it with existing configuration (e.g:
show | compare) and note the differences.
As we build and deploy configurations atomically, we also needed a step to perform a “go/no-go” decision when applying configuration. So we wrote code to add the following features:
- A “check” mode deployment. This does a dry run of a deployment to validate the configuration built, and displays the difference (diff) without applying changes.
- A confirmed deployment which displayed a diff and applied the configuration if confirmed, and aborted the deployment if not.
These two features let us approximate a typical configuration workflow.
Time to test!
Up to now, deployments were in our test environment. It was time to use it in production! Things went successfully, barring a hiccup in the deployment mechanism. When replacing configuration atomically, Junos allows you to force a replacement or perform an incremental change.
Inadvertently choosing the former, when we wanted the latter meant we ran into a bug on our first deployment. A configuration replacement — as opposed to an incremental change — results in process restarts, which then result in temporary service degradation as they go offline! Even with automation you need to know how the internals work! Obvious, but worth stating.
After fixing this minor setback with a quick read of the PyEz documentation, and a quick Pull Request (yay for VCS!) we rolled twnet out to our remaining environments. If you’re a developer, you’ll notice this looks a lot like a software release ;-)
If you’re a sysadmin, you’ll see network configuration has gone from someone — usually a single engineer — handcrafting a set of commands and pasting them into a terminal with little, to no oversight — to editing code, having it peer reviewed, and running a command (or pushing a button in your CD solution) to deploy it. No more worrying if you’re on the right device, or if you’re in the right config section, or if someone else is simultaneously making a change. You write code, submit a pull request, and deploy once approved. Need to roll back because something broke? Just check out the last good commit and deploy. You can even write and run unit and integration tests to ensure you’re not breaking things, as part of your deployment.
But automation doesn’t mean it’s over…
The more devices we assimilated into twnet, the more time began to free up. We could use this to pay down the tech debt, and start asking some interesting questions like:
- Can we put twnet into a CI/CD pipeline? Spoiler; yes we could!
- What network functions could become self service for other teams?
Taking the IPSec example we’ve now gone from an IKE proposal being the following commands typed by a network engineer:
set security ike proposal foo authentication-method pre-shared-keys
set security ike proposal foo dh-group group2
set security ike proposal foo authentication-algorith m sha1
set security ike proposal foo encryption-algorithm aes-128-cbc
set security ike proposal foo lifetime-seconds 28800
to the following, which any DevOps savvy engineer can get done like this:
- name: foo
There’s a lot of potential to unearth here. It takes a lot less time to write too!
Also, now we’d gotten to a point where things were automated, it was crucial to ensure we didn’t regress to manual changes for quick wins. At a startup like TransferWise, things change fast, deadlines are always round the corner, and so is the temptation to “just change one little thing manually”. Fortunately, we’re very pro-automation. As engineers, our value to the business is more than being human keyboards.
What we learned from the process
There’s definitely a cultural shift required once you make the decision to automate things.
For starters, once you decide to automate, you’re asking people to change an approach they’ve been successfully using for a long time. This is not a friction free process. Once again, team and — as importantly — management buy in, is crucial.
In addition to this, let’s not forget the technical challenges:
CD tools aren’t built with network device deployments in mind
Note the distinction between CI and CD. CI allows you to integrate your code into your code base, CD allows you to deploy your code.
We evaluated a few CI and CD tools, and performed a proof of concept exercise . The harsh reality is, most CD tools don’t play well with network devices. “Thing-doers” like Concourse work better, but they have a steep learning curve.
Local development for network automation is hard
Look at how far server virtualisation has come. As a network engineer, it’s hard to not look across the aisle at our systems engineering colleagues without some jealousy. They get to use things like Docker, Vagrant and other cool virtualisation tools to test their deployments. (The network team at TransferWise does get to use those things too, more for tooling than networking, but we live in hope!)
With networks — especially vendor appliances — your choices previously were to test in production (never a good idea), or maintain costly lab infrastructure. At a cloud first organisation like TransferWise things are a little better. We could maintain an on-demand lab environment, but it still costs a decent 5 figure sum annually if always on. Unfortunately, “cloud first” doesn’t mean “cloud only” either. Regulatory requirements dictate physical appliances sometimes, but if anyone can make that change, it’s us!
Immature deployment code
NETCONF is nice, but it’s relatively new, as are the libraries using it. It’s not uncommon for deployments to fail with essentially blank error messages. Part of our troubleshooting workflow requires replaying the deployment using the CLI in our test environment if you’re really stuck. Contrast this with server deployments, where you can expect useful, detailed error messages and a suite of mature tools to troubleshoot.
What’s next for network automation at TransferWise?
Network automation — despite its challenges — has undoubtedly been an interesting project to work on at TransferWise. We’ve got a long way to go, but we’re on the right path. It’s one of those things you can’t stop once you start (like most tasks you automate).
Some things we’ve got in the pipeline for the future are CD for networks, turning YAML models into an API and auto remediation (which feeds into a personal goal to never get paged — but that’s something for another day!). Network automation is a lot more than just configuration management.
It’s hard to not be excited by this. So if you’re reading, I’m hoping you’re considering being a part of it!
P.S. Interested to join us? We’re hiring. Check out our open Engineering roles.