Smooth Sailing: The Benefits of Automation for Offboarding

Published in

Airwallex Engineering

8 min readMay 20, 2024

Airwallex takes an engineering-first approach to solve common security problems. This article will discuss an innovative, event-driven approach to fully automate the process of employee offboarding to improve efficiency, accuracy and maintainability of our core security function.

Problem Space

Employee offboarding is a key part of the employee lifecycle — every company needs to do it. Despite this process being common practice it is often considered a “taboo” topic and as such it is rarely prioritised and best practices are not frequently shared. A well-run offboarding process, however, can help avoid common pitfalls and make the overall experience a positive one.

At Airwallex, we looked at this head on as an opportunity. An employee’s exit brings along a lot of paperwork and processes that can result in a tedious and poor experience for both the team and the departing employee. Accounts need to be locked, access restricted and data protected. As a company, we want to ensure a smooth and seamless exit process, and ensure all loose ends are tied up with the departing employee.

To solve this, we went through the process of fully automating our legacy manual offboarding processes. This blog post will cover the lessons learned in building a scalable, extensible and maintainable offboarding infrastructure which significantly reduced manual effort, improved security and allowed our teams to automate various other post-employment actions.

Early Lessons

Our first iteration started with a single Python offboarding script and deployed to a cloud function hosted on Google Cloud Platform (GCP). This function would continuously poll for employee details from our HR system to identify which employees were due for offboarding and compile a historical list of offboarded employees. Sequential API calls would then be made to check the statuses of these users in critical tools like Okta, Slack and Google Workspace. If any were still active, the function would suspend their access.

This initial implementation sped up our offboarding processes but had 3 key drawbacks.

Consistency — Offboarding should be executed completely, or not at all. Failures in the script could result in premature termination resulting in users not being offboarded completely. This might mean an employee is deactivated from one system like Okta, but not from another like Google Workspace.
Efficiency — We ran the function at regular intervals, but most of the time our script didn’t find anyone to offboard. This meant that the function would poll all users in Okta, Google Workspace, and Slack, regardless of whether people needed to be offboarded.
Maintainability — Adding new features would also make the function unnecessarily long and complex, raising efficiency and maintenance concerns. Although we used different classes and functions, the sequential nature of the script meant that each new tool and employee added to the onboarding process would increase the execution time and risk not completing within our target service-level agreement of two minutes.

Based on our lessons learned here, we redesigned our offboarding infrastructure to mitigate these issues. Instead of treating this service as a BAU task, we modelled our offboarding process as a core business product and approached the problem space with the same rigour that we apply when building delightful & secure customer facing features. Starting from requirements capture and supported by a first class continuous build, test & deploy framework, we created a scalable offboarding infrastructure that is open for evolution and extension.

Revisiting our Requirements

Creating an improved offboarding infrastructure meant building a system that is efficient, durable, and scalable. We took a step back to review what ultimate goals we wanted to achieve.

Efficiency — We wanted to ensure that we consistently suspend and revoke a user’s entire access within two minutes of their offboarding date and time. This includes access to apps, physical devices, and other tooling. We wanted to make sure that there was minimal computational overhead and that processes run as quickly as possible.
Durability — Implementing fallbacks like continuously retrying if an offboarding run fails would make the system more resilient to failures. In order to reduce the likelihood of missing any steps in the offboarding process, we wanted to make sure that any errors we encountered when offboarding a user from the service wouldn’t impact other parts of the system.
Scalability — As the company continues to grow, the offboarding process will likely become more complex. We want to be able to easily add any new services or workflows to the system as needed, so we can adapt to any new use-cases that arise.

Event-driven Implementation

An event-driven system comprised of independent cloud functions was a good fit solution for these requirements. This architecture allowed us to separate our monolithic offboarding function into smaller, independent “deactivator” functions for each critical tool, with Pub/Sub topics for triggering and relaying information. The resulting web of functions means no more single point of failure, parallel processes, and easy pinpointing and triage of failures. Additionally, adding new functionality is now as simple as connecting a new cloud function to the system.

During the implementation, we had to decide how to best translate the offboarding process into functions within our system. Since the bulk of IT offboarding lies in going through different tools and revoking access, we separated each cloud function based on the specific tools where employees needed to have their access removed, and added the necessary deactivation features within each of these functions.

These “deactivators” are triggered by a polling function which runs every two minutes. Similar to our original implementation, this polling function retrieves historical data of offboarded employees.

A key innovation was to store a hash of this historical data and compare these hash values between runs. If the hashes were different, we knew that an additional employee needed to be offboarded and would relay this information to our deactivator functions. Because we are now able to identify change in state, our current system only runs deactivations when it needs to, reducing the computational load and cost needed to make unnecessary API calls.

By making use of Pub/Sub topics, we were now also able to trigger the deactivators concurrently. This way, a user’s access can be revoked from multiple tools in parallel to one another, unlike our initial sequential implementation. Doing this greatly helps in our goal of making offboarding as “moment in time” as possible.

Because GCP Pub/Sub, by default, uses “at-least-once” delivery, we have high confidence in the delivery between topics, ensuring that data will be sent to our deactivators with minimal latency, even if network failures occur.

Impact

Extensibility

Now that we’ve designed the system to be more modular, it’s easy to add new deactivators for other new tools that we might use in the future. We just need to implement a new cloud function for the tool and connect it to Pub/Sub topics. For example, we added new functions which connect to our deactivator function for Okta to remote-lock user laptops through our mobile device management (MDM) services. When the Okta deactivator completes its offboarding run, it sends out messages containing details of the offboarded employee to a device deactivation topic which will trigger offboarding in our MDM.

In addition to this, we can implement helper functions that further reduce manual effort. Aside from deactivating users from tools, we’ve added functions to run housekeeping tasks for us like transferring user drives into a centralised archive, revoking user application sessions, and freeing up IT resources. What was primarily a security-driven project to start led to broader collaboration which resulted in significant efficiency gains for multiple teams across the business!

Observability

Using this framework also makes it easy to debug the flow of data through the system. We’ve configured each cloud function to log important steps in their execution — including details like the user, device name, and any errors for the current run. We collect these logs into a centralised SIEM management tool and set up alerts that will notify us on any errors that have occurred in the functions’ runs. We can easily create a timeline for where the data has flowed, through which functions, and pinpoint where in the execution the error has occurred.

This system also notifies us of offboarding status updates through functions that run on a schedule to collect batches of offboarding information and send them to us in Slack channels. This allows us to have much better tracking of the offboardings and awareness if any errors occur, and supports centralising communication and notes about any offboardings in Slack.

Productivity

Productivity has increased since our team no longer has to channel significant manual effort into offboarding. IT also helps other teams — like our HR team members — to focus on more strategic work. Our security has also improved with significantly improved consistency in offboarding and better tracking, allowing us to detect potential incidents and respond to them much quicker. Additionally, the modularity of the system has made it more robust and maintainable, allowing for more efficient troubleshooting and freeing up more time for building out new features.

Conclusion

At Airwallex, we take an engineering-first approach to solving our security and IT challenges. We’ve extended this mindset to our offboarding process, modelling it as a core business product and approaching the problem space with the same rigour that we apply when building our customer facing features. Adopting an event-driven pub/sub deactivation pattern has proven to be a great fit, meeting our core requirements of efficiency, scalability, and durability, while simultaneously realising additional benefits such as extensibility, observability, and productivity.

Looking forward, we will continue to build on and improve our system. There are a lot of new features that we can still implement. Adding the ability for automatic shipment handling to return devices from users, API and SSH key revocation in applications, and deactivating corporate-managed mobile devices are a few of these features we plan to add in the near future.

This offboarding service will remain an evolving process and still has further iterations to come!