Navigating the Driver Portal Migration: A Technical Odyssey

Published in

Gett Tech

7 min readJun 24, 2024

In September 2023, news broke that Gett had secured the Israel Airport Authority (IAA) tender, an achievement that generated considerable excitement both within the company and in the media. However, the initial celebration quickly gave way to the practical realities of meeting the tender’s stringent requirements. Product managers and operations personnel immediately began outlining the new capabilities the system would need to support, with the Driver Portal playing a pivotal role in many critical workflows. It was clear that the portal would need to meet Gett’s highest technological standards for reliability, availability, scalability, and monitoring.

As highlighted in the first part of this series, the Driver Portal ecosystem required improvements to fully meet the standards we have for Gett’s platform. Recognizing this, in October 2023, the R&D management set a clear goal: deprecate the existing system and transition to Gett’s state-of-the-art infrastructure just in time for the project’s rollout. Only through a complete migration could we attain the desired technological standards.

As we began planning the roadmap for this ambitious project, we anticipated numerous unforeseen challenges in the coming months. To mitigate potential delays, we prepared for the possibility that the migration might extend beyond our planned timeline. Thus from the outset, we actively implemented the necessary features in the legacy system as a backup solution, ensuring we could meet the IAA tender requirements regardless of migration progress.

Despite the challenges, the team executed the migration within an impressive five-month timeframe, concluding ahead of schedule. This rapid turnaround was achieved while simultaneously delivering the functional requirements for the IAA tender, ensuring that both the migration and the new capabilities were ready for the early 2024 launch.

Let’s now dive into the details of the endeavor !

Code base migration

The migration process started with a comprehensive identification of all services utilized by Gett Drivers and Gett Israel Operations personnel. Each service’s code base was meticulously analyzed to identify its dependencies, including other services within the Driver Portal ecosystem, services from the different Gett R&D teams — located in a different AWS account, external third-party applications (such as SAP and Salesforce), storage solutions (like databases and S3 buckets), and AWS-specific services (such as Lambda functions and SNS/SQS).

Below you can find schematics for a service-centered diagram and for an illustration of a complete system architecture.

Once the system architecture diagram was clear and understood, namely each and every application belonging to the Driver Portal ecosystem, we forked the relevant GitHub repositories into our main R&D account, adapting the code to run as any other R&D service at Gett. For that purpose, numerous key modifications had to be implemented, to name a few:

all secrets were moved into the secure storage used by Gett platform
environment variables were transferred to a dedicated repository to enable seamless deployment to testing and production environments
golang version of the micro-services was updated to a common standard
API requests relying on Gett’s public API were replaced with direct calls eliminating the need for authentication/authorization mechanisms which were previously required for inter-VPC communication
new docker files were created for efficient deployment via dedicated CI/CD pipelines.
we switched over to Kubernetes for container orchestration ensuring automatic scaling and efficient pod management; pre-migration, each service was deployed at all times on 2 different EC2 instances for redundancy purposes

This collaborative effort between the infra and driver portal teams spanned several months. Once the services and their dependencies were ready, we began the testing phase.

Testing strategy

At this stage, our focus was on validating the application code along with the recent changes done in the code migration step, so we maintained the same storage and AWS related components to ensure consistency — facilitated by a VPC peering connection between the two AWS accounts. Needless to say that since the data being manipulated was the same, the services were expected to display and behave exactly as in their legacy version.

The URL domains of the migrated services were initially shared with a restricted group, including IL Ops personnel, QA engineers, and product managers. From here on we entered an iterative process of debugging every possible feature/scenario to validate the new applications until all issues were resolved.

Below you can find some of the main technical problems we encountered :

missed external dependencies: such as Google Sheets documents and static files hosted on S3 which weren’t initially identified
missed request headers: when we stopped using Gett’s API interface, certain built-in headers in the existing process were overlooked, breaking specific flows, which we then had to include into our direct API calls to restore functionality
shortage of logs: as some functionalities broke, we had to add extensive logging to trace and identify the sources of problems
resource constraints: necessary logic re-implementation to meet pod’s CPU and memory constraints in Kubernetes to ensure services operated efficiently within the new environment

For services requiring real user traffic, such as the call center management service, we employed canary deployment, redirecting a small percentage of traffic (typically 10%) to the new application for a predetermined period of time (typically 1 hour). By integrating appropriate monitoring and logging tools, we quickly identified and resolved any issues.

Once the Driver Portal and its dependencies were validated, we opened it to beta users (a select group of Gett drivers) to gather feedback and increase real-world exposure.

Monitoring and alerts

Legacy services lacked any type of monitoring and alerting, instead they relied on a data analytics dashboard that exposed the raw tables information, allowing IL Ops personnel to debug the application behavior to some extent. Additionally, certain actions and errors emitted events that could be queried through the data analytics interface.

To align with Gett’s best practices, we made databases available only to the selected personnel outside the R&D department that is required to operate the system, deprecating the existing data analytics interface. Additionally, we implemented high-quality monitoring and alerting mechanisms, improving internal logging and creating alerts based on metrics such as error rate, Apdex, web response time, and throughput, as demonstrated in the example below.

The events being emitted by old services were migrated into Gett’s event pipeline, which can be consumed by visualization tools like Grafana and Mixpanel. This allowed us to deprecate the SNS/SQS implementation previously in place, streamlining data handling.

Storage migration, optimizations and rollout

Confident that the application layer was functioning correctly, we proceeded to the final step of the migration: transitioning the data storages (RDS and S3).

The databases were migrated into our AWS production cluster with minimal downtime for users during a pre-scheduled nightly maintenance window. Post-migration, we established robust monitoring and alerting for the newly migrated databases, tracking CPU utilization, free storage, database connections, and long queries through performance insights.

Analyzing a month’s worth of data on pod memory usage and database metrics allowed us to scale down instances to optimal sizes, not wasting unnecessary resources. Moreover, by identifying the most expensive queries, we were able to optimize the data structure through indexing, for greater performance and reduced cost. Later on, we were able to estimate our actual cost reduction, decreasing AWS expenses by 80%, from $100k to $20k annually.

With the migrated services already supporting back-office operations, we gradually rolled out the Driver Portal to a larger driver base, moving from beta users to full scale in under a week! Careful traffic analysis on legacy EC2 instances and databases allowed us to safely shut down all old infrastructure, and so completing the transition.

Project Outcome and Future Directions

The migration project not only met its immediate objectives but also brought about significant improvements across the board. By integrating the Driver Portal ecosystem into Gett’s unified R&D infrastructure, we enhanced reliability, availability, scalability, and monitoring. The project enabled streamlined code management, automated deployment processes, and optimized resource utilization, which collectively reduced overhead costs and improved operational efficiency. Additionally, the migration facilitated better monitoring and data integration capabilities, empowering data-driven decision-making and enhancing the customer experience.

As part of the R&D infrastructure, we also enjoy easy access to various other resources, including well-maintained and highly available testing environments, built-in Continuous Integration (CI) pipelines for unit and integration tests, as well as Role-based access control (RBAC) for effective user access management.

One of the critical improvements was the optimization of our global firewall. Initially, the firewall was handling a significant number of inter-VPC sessions across different accounts, which strained its resources. This strain often resulted in database requests taking up to five times longer, potentially slowing down the applications. By eliminating inter-VPC connections, we reduced the overall session load by approximately 85% (from an average of 20k to 3k, as illustrated in the image below). This change significantly improved performance and reliability across our systems.

The successful rollout of the Driver Portal also demonstrated its potential as a platform for innovation and swift implementation, as it was initially intended to be. A couple of months after the completion of the project, during Gett Hackathon 2024, it already served as a stage for exciting new features conceptualized by visionary Getters, showcasing the portal’s potential for rapid yet meaningful development.

As other Gett business units outside of Israel learn about the potential of this renewed product, projects such as the revival of the UK Driver Portal suddenly make sense, and are starting to take shape. This ongoing drive for improvement and consolidation underscores Gett’s commitment to technological excellence.