The Glovo Tech Blog - Medium

Aurora MySQL at Glovo — The Foundation

Nishaad Ajani — Mon, 02 Dec 2024 13:46:51 GMT

Aurora MySQL at Glovo — The Foundation

Let me take you back to a time when managing Aurora MySQL databases at Glovo felt like wrestling with a growing beast. It was 2021, and our small Platform team was juggling a rapidly expanding fleet of databases with tools that, while powerful, were showing their limits. Every new challenge — scaling clusters, rolling out updates, handling upgrades — felt like a mountain we had to climb manually, armed with Terraform, custom scripts and a lot of caffeine.

We knew there had to be a better way. We dreamed of a system where managing Aurora MySQL clusters didn’t require late-night interventions or painstaking coordination across teams. What if we could build something that just worked — automatically, safely, and at scale? That dream led us down an ambitious path, one where a handful of engineers would build a Kubernetes operator that changed everything.

This blog series is the story of that journey. It’s about how a small team tackled big problems, transforming database management at Glovo from a tedious manual process into a seamless, automated system. It’s a story of innovation, challenges, and the power of leveraging Kubernetes to not just solve problems but create a foundation for future growth. Join me as we dive into how we built this operator, the impact it had, and what we learned along the way.

The Challenge: Growing Pains in a Rapidly Expanding Company

Back in early 2021, our database infrastructure at Glovo was manageable — barely. With just a handful of Aurora MySQL clusters, a single Terraform module was enough to keep things running. But as Glovo grew, the cracks in this setup started to show, and what once felt straightforward turned into a maze of complexity.

It began with distributed configurations. Each team owned its own git repository and Terraform workspace, which sounded great for autonomy but quickly turned into a headache. Rolling out a simple update meant tracking down dozens of configurations, hoping nothing broke along the way. It wasn’t long before essential tasks — like scaling, backups, and reboots — became anything but straightforward. These jobs ate up hours of engineering time, and as the number of clusters grew, so did the grind.

The real pain came with major version upgrades. MySQL upgrades are tricky at the best of times, but doing them manually, often late at night to avoid disrupting traffic, was downright brutal. And then there were the inevitable mishaps — a misplaced configuration or a poorly reviewed Terraform apply could mean downtime or worse, leaving us scrambling to recover a deleted database cluster.

As our database fleet ballooned to over 200 clusters, even simple updates became cumbersome and error-prone, taking weeks to roll out across all teams. It was clear that the system we’d relied on for so long just wasn’t built to handle this level of growth. We needed a new approach, one that didn’t just patch over the problems but completely rethought how we managed our databases. It was time to scale smarter, not harder.

Terraform Scalability Challenges

Terraform was our trusty tool for managing infrastructure, but as our needs grew, we started to hit its limits. It’s great for describing the end state you want — “Make it so!” — but not so much for handling the messy in-between. Managing Aurora MySQL clusters highlighted these gaps, especially when we tried to scale.

Take complex business logic, for example. Imagine you need to change the instance type of a database cluster, but only during a specific maintenance window. Terraform doesn’t natively support adding that kind of conditional logic. Either you manually intervene at just the right time or lean on AWS features, like maintenance scheduling, when they’re available. And if they aren’t? You’re stuck with manual effort and a bit of hope.

Then there were the orchestration challenges. For operations like scaling or major version upgrades, we often needed multi-step workflows. A task as simple as resizing an instance might involve draining traffic, updating configurations, restarting instances, and checking everything is back online — steps Terraform can’t sequence dynamically. This left us juggling AWS automation tools where possible and writing custom scripts to fill in the gaps.

Perhaps the trickiest part was state management. Terraform’s state file is great for tracking what’s been done but doesn’t handle transitions well. If changing instance type fails during the scheduled maintenance by AWS, Terraform might think everything’s fine just because the desired state was technically applied. Recovery often meant rolling up our sleeves to manually fix state files — a risky, tedious process.

It became clear that while Terraform was powerful, it wasn’t designed for the dynamic, time-sensitive workflows that managing Aurora MySQL at scale demanded. We needed something more — something that could handle the transitions, incorporate business logic, and still let us leverage Terraform’s strengths. That’s where our Kubernetes operator came into play.

Interim Solutions: Bridging the Gap

We knew we couldn’t solve all our challenges overnight, so we introduced several interim measures to ease the growing pains and reduce operational overhead. These stopgap solutions weren’t perfect, but they gave us the breathing room we needed to keep things running as our infrastructure scaled.

One of the key changes was making better use of existing maintenance windows on AWS. Instead of leaving them unmanaged, we optimised how we used maintenance windows, scheduling them during low-traffic hours to reduce risk and improve efficiency. By carefully distributing updates, we minimised the impact of potential issues and ensured non-critical problems could be addressed promptly. This approach wasn’t revolutionary, but it was effective — preserved high availability and provided a reliable, structured way for teams to make certain changes with greater confidence.

Another improvement was the partial automation of MySQL version upgrades. This tool streamlined a notoriously complex process with a structured workflow:

Clone Creation: A new clone of the database was provisioned.
Upgrade Process: The clone was upgraded to the target MySQL version.
Binlog Replication: Synchronisation was maintained between the old and new clusters.
Integrity Checks: Data integrity was validated to catch issues early.
Traffic Cutover: With manual approval, traffic was shifted to the upgraded cluster.

These were just two examples. Across the platform, we worked to streamline other operational tasks and build tools that tackled immediate pain points. From automating routine maintenance to refining monitoring and alerting, we made incremental improvements wherever we could.

While these measures helped reduce some of the toil and risks, they weren’t enough to address the underlying complexity of managing Aurora MySQL at scale. Each solution felt like a patch on a system that needed a complete rethink. We knew the only way forward was to build a more cohesive and automated approach — one that could handle the scale and complexity of our growing infrastructure. That vision set us on the path to creating our Kubernetes operator.

The Turning Point: Introducing a Kubernetes Operator

The breakthrough came with the decision to build a Kubernetes operator tailored to manage Aurora MySQL clusters. Kubernetes operators extend the Kubernetes API, encapsulating the logic required to automate the lifecycle of complex applications. This approach aligned perfectly with our goals:

Why Operators?

Automate complex, application-specific tasks (e.g., scaling, backups, upgrades).
Manage stateful applications (like databases) seamlessly in Kubernetes environments.
Provide consistent deployment and management across environments.
Encapsulate domain-specific knowledge, reducing manual interventions.

For Glovo, this meant transitioning from manual, distributed workflows to a centralised and automated control plane, tailored for Aurora MySQL at scale.

The First Generation: A Hybrid Approach

In its first iteration, our operator leveraged the existing Terraform module instead of building everything from scratch or relying on the AWS RDS operator. This allowed us to capitalise on the rich, business-critical logic already built into our Terraform setup, including:

Custom Metrics Collectors: Automated provisioning of Lambda functions to capture detailed InnoDB table and query level metrics that went beyond CloudWatch’s default capabilities.
MySQL Partitioning rotation: Lambda functions to automate the creation and rotation of MySQL range partitions, optimising query performance and storage retention for time-series data.
Disaster Recovery Readiness: Support for provisioning Aurora global clusters, ensuring a robust setup in our disaster recovery (DR) region.

However, we designed the architecture to clearly separate developer responsibilities from platform management, ensuring simplicity and safety.

Developer-Centric YAML Configuration

Developers interacted with the system via a minimal YAML configuration stored directly in their service repositories. This specification included only the details they cared about, such as instance size, scaling limits, and partitioned tables. For example:

apiVersion: storage.platform.glovoapp.com/v1alpha1
kind: AuroraResource
metadata:
  name: orders-db
spec:
  version: 5.7.mysql_aurora.2.11.2
  instanceClass: db.r6g.large
  scaling:
    targetCpuUsage: 70
    minReaders:     1
    maxReaders:     3
  parameters:
    maxConnections: 200
  mysqlPartitionedTables:
  - name: my_table
    intervalType: "DAY"
    intervalFormat: "Snowflake"
    retention: 7
    buffer: 5

This approach allowed product teams to define their database requirements declaratively, abstracting away the complexities of underlying infrastructure.

Platform-Controlled Terraform Repository

On the platform side, all Terraform code was centralised in a dedicated repository managed by the Platform team. This repository contained all the Terraform configurations for Aurora MySQL clusters, which were automatically generated by the Kubernetes operator based on the developer-provided YAML specifications.

The repository served as a standardised and centralised home for all database clusters, replacing the fragmented, team-specific Terraform setups that had been manually maintained before. This approach allowed us to:

Provision and Update Aurora Clusters: Automatically translate YAML configurations into Terraform code to handle cluster lifecycle tasks.
Rollout updates faster: Updates to our custom metrics collectors, and other advanced functionality could be rolled out quickly and transparently from the developer teams.
Enforce Guardrails: Use automated checks to validate Terraform plans, ensuring safety and consistency.

All configurations for database clusters were linked to their own Terraform Cloud workspace, creating a controlled environment for running plans and applies. The lift-and-shift process brought all existing Terraform configurations under a single, standardised structure, ensuring consistency across all database clusters.

This setup completely eliminated the need for product developers to interact directly with Terraform, reducing errors and freeing them to focus on their applications. Instead, their simple YAML configurations drove the entire process, with the operator handling the generation and application of Terraform code behind the scenes.

How It Worked

Developer Workflow:

Developers updated their database configurations in a minimal YAML file located in their service repositories.
These changes triggered GitHub Actions, which synced the YAML to the corresponding Kubernetes CRD.
The operator then took over, orchestrating the necessary Terraform updates in the platform’s central repository and managing the lifecycle of the database cluster.
Status updates were reported back to the developer repository via GitHub commit statuses, providing visibility into the progress and outcome of the changes.

Centralised CI/CD Pipeline:

The operator translated the developer’s YAML spec into standardised Terraform configurations and committed these to the platform’s central Terraform repository.
Updates were validated in Terraform Cloud workspaces, enforcing safety and consistency:
Sentinel Checks: Automatically blocked unsafe changes, such as accidental deletions or misconfigurations.
Automated PR Validation: Ensured all changes adhered to predefined standards before being merged and applied.

Manual Review When Needed:

For high-impact changes — such as major version upgrades, provisioning global clusters, or adjusting disaster recovery setups — the system flagged updates for manual review and approval to ensure additional oversight.

Safe Application:

Once validated, the operator applied the changes via Terraform Cloud, ensuring consistent and safe updates across all environments. Developers could monitor the entire process through the commit status updates in their service repository, ensuring transparency without requiring direct interaction with the Terraform workflows.

Challenges Faced Along the Way

Building the Kubernetes operator wasn’t without its hurdles. One particularly tricky challenge arose from how Terraform Cloud interacted with GitHub. As we scaled, we ran into significant bottlenecks caused by GitHub API rate limits.

Here’s what happened:

Rate Limiting on GitHub API: Terraform Cloud frequently updated Git commit statuses to report the state of each workspace. However, as our fleet of Aurora MySQL clusters grew, these calls overwhelmed the GitHub API, triggering rate limits.
Unintended Consequences: When Terraform Cloud hit the rate limit, it couldn’t accurately detect which files had changed. Instead of running plans only for the affected database, Terraform would trigger plans for all workspaces in the central repository. This created a cascade of issues:
The Terraform apply queue became overwhelmed, blocking changes from other teams.
With limited Terraform Cloud agents, critical updates were delayed, impacting productivity across multiple projects.

The Solution: Smarter Commit Status Updates

To address this, we made a critical adjustment:

We updated the operator to enable commit status updates in Terraform Cloud workspaces only on demand.
For any database change, the operator dynamically toggled this setting to ensure that only the affected workspace updated Git commit statuses.

This adjustment drastically reduced the number of API calls to GitHub, avoiding rate limits and ensuring Terraform Cloud only processed the necessary plans. It also prevented the apply queue from being flooded, allowing teams to work without interference.

This challenge highlighted the complexities of integrating multiple systems at scale, but it also reinforced the value of automation. With this workaround, we ensured our operator could continue to scale alongside Glovo’s growing infrastructure needs.

The Results: A Transformed Landscape

The introduction of the Kubernetes operator was a game-changer for how we manage Aurora MySQL clusters at Glovo. What started as a small-scale experiment soon became the backbone of our database infrastructure. Here’s what changed:

Centralised Control: Gone were the days of fragmented configurations. Now, everything was unified — one consistent approach to managing all clusters across the platform.
Reduced Toil: Routine tasks like terraform module updates became automated, giving engineers more time to focus on strategic projects that added value.
Enhanced Safety: Built-in guardrails, canary releases of new terraform changes, and automated checks dramatically reduced the risk of human error, ensuring safer deployments and fewer incidents.
Improved Developer Experience: With simple YAML files in their service repositories, developers no longer needed to worry about the complexities of Terraform or underlying infrastructure. They could self-service their database needs, boosting productivity and reducing friction.

This shift didn’t just streamline operations — it reshaped how we think about infrastructure management. The operator turned a complicated, manual process into something that scaled with us, providing reliability and efficiency.

Looking Ahead

This operator has laid the foundation for a more scalable and efficient infrastructure management system at Glovo. In Part 2, we’ll explore how this architecture enabled us to automate one of the most complex and critical tasks: MySQL version upgrades — and the advanced features we built to support product teams. Stay tuned!

Aurora MySQL at Glovo — The Foundation was originally published in The Glovo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Using Airflow in Glovo

Pablo Rodríguez Madroño — Mon, 02 Dec 2024 13:46:29 GMT

Using Airflow in Glovo for data orchestration

Apache Airflow logo

Summary

In this article we briefly introduce Apache Airflow as a data workflow orchestrator, and we present Glovo’s data strategy based on the Data Mesh paradigm. We then illustrate how Airflow is used in Glovo, and present some customizations that have made a successful implementation of Data Mesh possible. We finally show the evolution towards a declarative approach that truly democratizes data production and usage, making Airflow a cornerstone of Glovo’s data architecture.

What is Apache Airflow?

Apache Airflow is “an open-source platform for developing, scheduling, and monitoring batch-oriented workflows” [1]. It was created by Maxime Beauchemin in 2014 while working at Airbnb to handle increasingly complicated data engineering pipelines. The project joined the Apache Software Foundation in 2016, and became a top-level project in 2019, ensuring its future continuity [2]. Today, it’s probably the most used orchestration tool in the Data Engineering field [3][4].

Orchestration, in the context of Data Engineering, automates the scheduling of jobs, and the sequencing of the steps required to perform the movement and transformation of data between systems. It is crucial to ensure timeliness and quality of the data to be used in analytics, reporting, modeling or machine learning [5][6][7].

Sarah Ioannides conducting an orchestra. Credit: Izabel.zambrzycki, CC BY-SA 4.0, via Wikimedia Commons

Airflow organizes the data pipelines or workflows in so-called DAGs (Directed Acyclic Graphs):

The different jobs to be performed are represented as nodes in a graph, which are called tasks in Airflow. Tasks are instances of operators that perform a certain type of work (for example, reading from a database or writing to a file).
The relationships between the tasks are reflected as arcs connecting the nodes, and they are called dependencies in Airflow. These relationships between tasks are directed: a certain task needs to be executed after one or more other tasks.
Being acyclic means that once a task is completed it is not possible to go back and re-execute it.

DAGs are not exclusive to Airflow, and there are many applications of this data structure.

Example of a DAG

These properties allow Airflow to implement the “sequencing of the steps” component of orchestration by ensuring that:

There is a clear beginning of the execution of the tasks.
There is a clear path forward when each task is completed.
There is a clear ending of the execution of the tasks.
Eventually all the tasks will be completed.

Additionally, Airflow implements the “scheduling of jobs” component of orchestration by allowing a cron-like expression in the DAGs, and determining at which moment in time a DAG needs to be run. More complicated running configurations can be set through timetables, which do not need to be periodic in time.

Dependencies between DAGs can be handled in several ways:

The dependent DAG’s schedule could be set up so that it starts after the dependencies have normally completed execution. However, this setup is vulnerable to errors or delays, as there is no way to verify whether the dependencies have effectively run.
It is also possible to trigger a DAG run from another DAG through the TriggerDagRun operator.
Airflow’s recommended way is to use the ExternalTaskSensor operator in the dependent DAG to check for the dependencies to be completed.

Glovo’s data strategy

Three years ago, data at Glovo was in distress. The growth of the business implied also a growth in the usage of data for informed decision-making, but the infrastructure and the organization supporting the increased usage could not scale more. There was no way to make our centralized Data Warehouse larger or more powerful. Outages were frequent, recovery times were in the order of days, and there was no clear technical, operational or business owner for many of the data processes.

Glovo decided to migrate this centralized approach to a Data Mesh organization. After more than two years of work, the decentralization strategy has been a resounding success, and we have been able to turn off our gigantic Data Warehouse. The four pillars of Data Mesh have been crucial to achieve this level of success [8]:

Logical architecture of the Data Mesh approach, showing the four pillars, from https://martinfowler.com/articles/data-mesh-principles.html

Domain ownership

In the same way as Engineering teams have achieved decentralization by embracing domain-driven design and adopting the Reverse Conway Maneuver, Data teams need to be arranged into separate business domains. Each domain covers a bounded context, for which the data team has full ownership and is fully accountable. Each domain data team decides which data to expose to other domains, while keeping the implementation details internally.

Data as a product

In Data Mesh, a data product is the smallest architectural block that can be deployed as a cohesive unit, and is the result of applying product thinking to domain-oriented data. It is composed of the code needed to perform the required transformations, the data resulting from those transformations, the metadata that identifies the product and the outputs, and the infrastructure required to run the previous elements.

A data product in Glovo represents a set of tables designed to fulfill the same use cases / user needs, with the same timeliness, loading frequency and criticality requirements. These tables can be exposed to users via multiple interfaces such as other data products, query engines, BI tools or others. There is no way to produce a data table that is outside of the data product enclosure.

Self-serve data platform

Data domain teams can autonomously own their data products by having access to a data platform that provides a higher level of abstraction than the direct management level. This platform removes the complexity and friction of provisioning and managing the lifecycle of data products by providing simple declarative interfaces, and implementing the cross-cutting concerns that are defined as a set of standards and global conventions across the organization. The self-serve data platform also includes capabilities to lower the cost and specialization needed to build data products.

Federated computational governance

Data Mesh follows a distributed system architecture: a collection of separate data products, each with independent lifecycles, built and deployed by autonomous data domain teams. However, to get greater value, these independent data products need to interoperate. which is possible through a governance model that embraces decentralization and domain self-sovereignty, global standardization, a dynamic topology, and, most importantly, automated execution of decisions by the data platform.

Usage of Airflow in Glovo

Airflow honors its orchestrator role by acting as the central piece in the computation of Data Products in Glovo, as illustrated below:

Airflow as an orchestrator for Data Products

Each Data Product unit includes at least one Airflow DAG for periodic computation, although in many cases there are additional DAGs for a variety of purposes:

Running some transformations that are different from the main ones: they have a different periodicity or temporality, a different intent or even for splitting outputs with different criticality.
Performing initial data loads.
Backfilling data.
Doing auxiliary operations such as table definition changes or deletions.

Regardless of their purpose, all the Data Product DAGs have the same general structure, although some of the blocks may not always be present:

General structure of a DAG in Glovo

DagFactory to simplify DAG creation

This general structure has led us to build an internal package to abstract and simplify the definition of DAGs. Writing convoluted Python code defining a workflow is replaced by a much simpler file setting up a DAG configuration. We call this module DagFactory, much inspired by the dag-factory package by Astronomer [9] and, to a lesser degree, by the airflow-declarative project [10].

Below there is an example of how a DAG is defined in DagFactory:

from datetime import datetime
from datetime import timedelta
from pathlib import Path


from data_pipeline_tools.airflow.dag_factory.dag_factory import DagFactory


dag_configuration = {
   "dag_name": "my_first_dag_factory_dag",
   "image_version": "0.2.22",
   "dags_path": str(Path(__file__).parent.resolve()),
   "process_name": "calculate_odps_courier_distances",
   "domain": "courier",
   "data_product_name": "order_flow",
   "data_product_key_prefix": "OF",
   "schedule_interval": "30 5 * * *",
   "default_args": {
       "owner": "Operations Data Engineering",
       "description": "A set of ODPs covering order-level and city-day KPIs related to courier distances of Orders.",
       "start_date": datetime(2021, 11, 12),
       "retries": 2,
       "email_on_failure": False,
       "email_on_retry": False,
       "retry_delay": timedelta(minutes=5),
       "depends_on_past": False,
       "max_active_runs_per_dag": 1,
   },
   "runtime_date_local": "'2022-01-08'",
   "runtime_date_dev": "'2022-04-15'",
   "num_days_local": 8,
   "num_days_default": 30,
   "slack_channel_prod": "data-mesh-monitors-courier",
   "slack_channel_dev": "data-mesh-monitors-courier-dev",
   "slack_conn_id": "slack_webhook",
   "script_module": "order_flow.jobs.courier_order_flow_job",
   "jobs": {
       "courier_distances_points_intermediate": [],
       "courier_distances_order_level_attributes": [
           "courier_distances_points_intermediate"
       ],
   },
   "sensor_specs": {
       "ORDER_DESCRIPTORS_ORDER_DESCRIPTORS": {
           "checkpointer_path": "COD__DAG_CHECKPOINT_PATH",
           "domain": "central",
           "product": "central_order_descriptors",
           "task": "order_descriptors",
           "freshness_hours": 8,
           "timeout_hours": 8,
           "mode": "reschedule",
           "poke_interval": 300,
       },
   },
}


# Magic words, DO NOT MODIFY
airflow_dag_factory = DagFactory(**dag_configuration)
globals()[dag_configuration["dag_name"]] = airflow_dag_factory.create_pyspark_dag()

Although this seems quite a simple DAG, in reality it is formed by 9 tasks (not counting groups). There is a stark difference with the code requirements of a standard Airflow DAG definition: the DagFactory script is much shorter, reducing the cognitive load required to understand the structure, and lowering the possibility of introducing errors.

As a parallel benefit, this package has brought a high level of standardization in the definition and operation of the Data Products. Before DagFactory, the DAGs defined by the different domains, and even the ones in the same domain, had significant differences in the grouping of tasks, naming, parameterisation, and others. This made operating the DAGs quite dangerous, as mistakes were relatively easy to make, even leading to accidental deletion of data. After implementing DagFactory, all the workflows behave in the same way, and anyone can operate them with confidence that no unexpected side effects will occur.

Another benefit of the standardization is that the blocks of tasks forming the general structure of the DAGs have the same name across all of the data pipelines. In particular, the block of transformations always ends with a “transformations_end” task. This has been crucial to facilitate the creation of early alerts for DAG failures in observability tools using only standard Airflow metrics.

The “transformations_end” task is always present at the end of the transformations group in the general structure the DAGs

In summary, DagFactory has been an accelerator to the Data Engineers tasked with creating Data Products.

Checkpoints and sensors to manage DAG dependencies

Another component that has been developed in Glovo is an alternative way to ensure that the data dependencies for the transformations contained in a DAG are met before starting to process them:

In every DAG, a custom operator writes a small JSON file indicating the time it has executed. We call these files checkpoints, and the operator is named CheckpointerOperator.
Each dependent DAG can use custom sensor operators that understand the previous file format, and are able to determine whether the execution can take place or it should be kept on hold while the dependencies finish. We call this sensor operator CheckpointSensor.

{
  "data":{
    "updated_at": "2024-09-24T08:07:51.058120+00:00",
    "backfilling": false
  }
}

These custom checkpoints allow greater flexibility than the standard solutions, as they abstract out the Data Product contents from the DAG that generates them. That is, they work at the table level, and frees the owners of a Data Product to define how it is computed without affecting their consumers. In consequence, they favor the separation in domains that is key to our data strategy.

In the general structure of a DAG we saw how the sensors are the first set of tasks to be run. As for the checkpoints, they are created as part of the transformations group of tasks. This group is formed by chaining together different transformation blocks, each of them composed of three stages:

The computation of the transformation, either a PySpark step or a set of dbt models.
A data quality assessment of the transformed data.
The creation of a checkpoint file.

If the computation of the transformation or the data quality assessment tasks fail, then the checkpoint is not generated, and downstream users are not signaled that a particular Data Product output is ready for consumption.

The three steps of a data transformation block

Different 3-step transformation blocks can be linked according to their dependencies so that the overall process is performed in the right order. Also, transformations computing more than a single output can be split in parallel blocks to allow a more granular control of the checkpoints. In this case, some transformation blocks may have failed, but checkpoints would be generated for the successful blocks. This pattern increases the robustness of the Data Mesh, as subsequent Data Product DAGs dependent on the successful outputs can start their updates.

Example of chained transformation blocks to account for dependencies

Example of split transformation blocks to increase robustness

Checkpoints along with transformation splitting have increased the overall availability of data, which would be seriously impaired if checkpoints worked only at the Data Product level.

The next iteration: more democratization and autonomy

The first implementation of Data Mesh has been so successful that Glovo has developed an even simpler approach based on purely declarative interfaces, as was publicly introduced in our Data Experts’ RoundTable Meetup some months ago. The main advantage of this approach is a higher abstraction layer over the way of defining data transformations and the underlying infrastructure that runs them. As a consequence, the technical skills and the cognitive load needed to build a data product are highly reduced. This has produced a powerful democratization of data, and a surge in the value of the Data Mesh (which is based on the number of meaningful relationships among data products, not on the number of data products per se [11]).

The comparison below shows the differences between the first and the second approaches to Data Mesh:

In Data Mesh v1 the data product creators were only Data Engineers, whereas in the Declarative Data Mesh they can be anyone with SQL and basic coding skills.
In Data Mesh v1 the infrastructure management was distributed and handled by each data domain, whereas in the Declarative Data Mesh it belongs to a centralized data platform.
In Data Mesh v1 the computing engine availability was limited and created on demand, whereas in the Declarative Data Mesh it is always on.
In Data Mesh v1 the orchestration was managed by each domain, whereas in the Declarative Data Mesh it belongs to a centralized data platform.
In Data Mesh v1 there was no golden path and standardized structure, whereas in the Declarative Data Mesh it is well-defined and enforced.
In Data Mesh v1 the maintainability was low due to the lack of standardized structure, whereas in the Declarative Data Mesh it is high due to the centralization of infrastructure and orchestration.
In Data Mesh v1 the cost efficiency was low and managed by each domain, whereas in the Declarative Data Mesh it is high and centrally managed.
In Data Mesh v1 the technical complexity and the testability were high, whereas in the Declarative Data Mesh they are low.
In Data Mesh v1 the time to develop a data product was in the order of hours to weeks, whereas in the Declarative Data Mesh it is in the order or minutes to hours.

Airflow plays a crucial role in Glovo’s data platform. Each declarative Data Product is mapped to a DAG in a centralized Airflow instance, which is one of the main visible interfaces of the data platform (the other being the query/computing engine). Owners have full capacity to operate the DAGs of their data products: clear tasks or DAG runs, marking them as successes or failures, trigger manual runs, and enable or disable entire DAGs. This capacity goes in line with the “domain ownership” and the “self-serve data platform” principles of Data Mesh: owners cannot ensure timeliness and quality of the data they are responsible for if they are not able to operate their data products effectively. A great responsibility needs to bring great power along.

Declarative DAG definition

In the new approach, DAGs are automatically created from a declarative definition of the tasks to be executed. Only an internal Python package defining a SDK for Data Product creation is required to start building pipelines, as illustrated in the following code fragment:

from glovo_data_platform.declarative.manager import DataProductManager
from glovo_data_platform.declarative.utils import print_deployment_info


def data_product_definition() -> DataProductManager:
   data_product_manager = DataProductManager(
       domain="growth",
       name="sample_ddp_scripting",
       owner="pablo.rodriguez@glovoapp.com",
       tier="t2",
       contacts=[
           {"kind": "email", "value": "ga.eng@glovoapp.com"},
           {"kind": "email", "value": "pablo.rodriguez@glovoapp.com"},
       ],
   )


   data_product_manager.add_sql_transformation(
       data_classification="l0",
       sql="""SELECT
               gsc_date,
               COUNT(1) as cnt_records
           FROM
               "delta"."growth_master_attribution_odp"."google_search_console"
           GROUP BY gsc_date""",
       partition_by=[],
       target_table="summary_gsc",
       write_mode="FULL",
       is_odp=True,
   )
   return data_product_manager


if __name__ == "__main__":
   schedule = None
   publish = False
   creation_reason = "Testing creation of T2 DDPs through scripting."
   revision_name = None
   data_product_manager = data_product_definition()
   revision = data_product_manager.submit(
       schedule=schedule,
       publish=publish,
       creation_reason=creation_reason,
       revision_name=revision_name,
   )
   print_deployment_info(revision)

The path between this code and an Airflow DAG is not straightforward, however, although this is hidden from the Data Product creator. The SDK first encodes all the definitions in a JSON DTO (Data Transfer Object). Secondly, the SDK invokes an internal API to deploy the Data Product. Finally, the SDK communicates the result of the deployment to the Data Product developer.

The internal API is in reality the gateway to the Meshub system. Meshub stores the defining characteristics of the Data Products, manages their lifecycle, and provides information for Data Mesh Governance purposes, among several other functions. When deploying a Data Product, the appropriate lifecycle management methods of Meshub validate the encoded Data Product definition, attach the relevant parameters of the different Airflow operators to be used, and copy the modified JSON DTO into a Python file ready for Airflow to parse as a DAG. The following code block shows the file generated from the sample declarative Data Product code illustrated above.

from glovo_data_platform.declarative_airflow.dag_creator import build_dags

REVISION_DEPLOY_SPEC_JSON = r"""
{
  "revision":{
    "revision_id":"2af464bc-96f5-4443-b73f-f0507a61fdde",
    "data_product":{
      "domain":"growth",
      "name":"sample_ddp_scripting",
...
}
"""

globals().update(build_dags(REVISION_DEPLOY_SPEC_JSON, __file__))

The conversion of the file to an actual Airflow DAG is performed by a custom package at every file processing cycle. This step translates the JSON DTO into Airflow operators, relationships, and parameters. Additional operators for execution control are also added. The following DAG is the result of processing the Python file shown above:

Airflow DAG for a simple declarative Data Product

The whole process is summarized in the next diagram:

Steps to deploy a declarative definition of a Data Product as an Airflow DAG

Checkpoints and sensors to manage DAG dependencies

The way to handle DAG dependencies has evolved towards a more orchestrator-independent solution, in exchange for more cloud-dependent components. In particular, the checkpointing subsystem leverages several services from AWS, Glovo’s cloud provider. Equivalent components can be found in other cloud providers.

The following diagram shows this process schematically:

Architecture of declarative checkpoints

As Data Product outputs are ultimately daily-partitioned Delta Lake files written in AWS S3, the checkpointing system is notified whenever a Delta changelog file is added. This triggers a lambda function that processes the changelog and extracts the paths of the modified partitions. These paths are then checked in Glue to get the database and the table names. Finally, the information about which table and partitions have been modified is recorded in a DynamoDB table for later usage.

Contents of the DynamoDB table for checkpoints

Downstream Data Products can check whether their dependencies have completed their processes through a custom CheckpointSensor operator:

Transformation task with a sensor checking a dependency

The custom CheckpointSensor operator queries the DynamoDB for the existence of a partition of a particular table. An interplay between Airflow macros and partition names allow checking whether the daily data required for a given execution is ready or not:

wait_for_order_descriptors_v2 = data_product_manager.add_wait_for_table(
    domain="central",
    data_product="order_descriptors",
    table="order_descriptors_v2",
    partitions=["p_creation_date={{ data_interval_start | ds }}"],
)

Conclusion: the role of Airflow in Glovo’s Data Mesh

Whether in the first implementation of Data Mesh or in the declarative approach, Airflow is a cornerstone in Glovo’s data architecture. Going beyond the already powerful features of Airflow, Glovo has implemented improved components to handle dependencies between the DAGs that orchestrate the computation of Data Products. Also, abstractions to simplify the definition of DAGs have been designed in order to reduce the cognitive load to build Data Products.

Airflow is the main interface to inspect and operate the computation of most of the Data Products that compose Glovo’s Data Mesh. Understanding how Airflow works is crucial, as anyone with basic coding skills is now able to create Data Products, bringing a true democratization of data transformation and usage across the company.

References

[1] ^ https://airflow.apache.org/docs/apache-airflow/stable/index.html

[2] ^ https://airflow.apache.org/docs/apache-airflow/stable/project.html#history

[3] ^ https://gradientflow.com/wp-content/uploads/2022/06/GradientFlow-2022-Workflow-Orchestration-Report.pdf

[4] ^ https://6sense.com/tech/workflow-automation

[5] ^ https://en.wikipedia.org/wiki/Orchestration_(computing)

[6] ^ https://www.reddit.com/r/dataengineering/comments/uvckp1/can_someone_please_explain_orchestration_and_why/

[7] ^ https://www.ascend.io/blog/what-is-data-pipeline-orchestration-and-why-you-need-it/

[8] ^ https://martinfowler.com/articles/data-mesh-principles.html

[9] ^ https://github.com/astronomer/dag-factory

[10] ^ https://github.com/rambler-digital-solutions/airflow-declarative

[11] ^ Dehghani, Zhamak. Data Mesh: Delivering data-driven value at scale. O’Reilly Media, Inc. March 2022. ISBN: 9781492092391.

Using Airflow in Glovo was originally published in The Glovo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How We Engineered a Scalable Architecture to Power Videos, Social, and Picks in Our Delivery App

Glovo Technology — Fri, 29 Nov 2024 10:32:56 GMT

One month ago, we launched three new features in Glovo: Videos, Social, and Picks. These features enable users to experience a more engaging, social, and personalized browsing experience. We aimed to introduce new ways for users to explore, connect, and save their favorite stores with minimal friction. The main target of these features was to introduce them in the Store Wall, a screen within our app where users can see a list of stores or products for a specific category.

We wanted to deliver these features within a strict 90-day timeline, and it was possible because we have an architecture capable of handling high modularity, scalability, and rapid iteration. This post explores the architecture we built a year ago — powered by a plugin-based, server-driven design with a Backend-for-Frontend (BFF) service layer. We will explain how this approach allowed more than ten teams to work in parallel, coordinate complex dependencies, and maintain stability for a seamless user experience.

Engineering Videos, Social, and Picks

1. Videos: Enhancing User Engagement with Dynamic Content

Adding Videos to the Store Wall meant handling high-resolution media while maintaining performance. The goal was to add a new way of discovering products through videos and to achieve this, a new video carousel was added to the Store Wall.

This new feature delegates the domain logic to a specific microservice in charge of customer content discovery. This service encapsulates the logic for calling multiple microservices across different domains to retrieve videos available at the user’s location, filtering to keep only those related to available products, and enriching them with relevant information.

2. Social: Integrating Friend Recommendations and Social Proof

The Social feature leverages user networks by recommending products your friends often order. This involved connecting data streams from social profiles, orders, and product ratings.

Like Videos, this feature delegates the domain logic to a specific service in charge of customer content discovery, encapsulating all the logic of getting the product recommendations (provided by Data models), filtering by store and product availability, and enriching with all information.

3. Picks: A Modular Approach to Personalized Curation

A pick is a list of stores that the customer decides to group. It is similar to creating a playlist in a music app. Picks allow users to organize their favorite stores, adding personalization to the Store Wall. This introduced specific requirements, such as modularity for different types of stores and future support for sharing and social integration.

In the Store Wall, we provide quick and easy access to the user’s picks and favorites. For this, the Store Wall delegates to the Picks microservice the logic of getting the users’ picks, filtering by store availability, and enriching with relevant information.

Evolution of the Store Wall with a Plugin Architecture and Server-Driven UIs: Building for Flexibility and Scalability

We redesigned the Store Wall screen a year ago to use an architecture that enables dynamic and personalized store walls powered by templates. We needed to deliver high-performing, category-specific experiences without overloading the client application. Here’s how we structured it:

Templates: We call Templates a group of components/features (internally known as Modules) inside a screen, with their specific positions. You can imagine this as how the visual components on your screen will be presented. Those components or visual elements come both from a configuration, enabling business to inject special features, and some others come from Machine Learning (ML) models targeting a better experience and adoption of a user.
Technically speaking, each Template provides a list of independent modules that we will execute and show configuration details for rendering a specific Store Wall. For instance, the “Food” category has a different layout from “Retail,” with each view receiving specific modules in pre-defined positions.

Template Selection: We allow selecting different templates for the store wall screen based on category, country, city, user segment, and even by specific dates, which is usually helpful for events like Sant Valentine’s or Halloween. We also allow experimentation to test different templates for the same category to define the best alternative. As an example, here we have a template for a restaurant category in Spain where each selected module is marked with a red square.
Orchestrator: The Orchestrator is the central engine executor, controlling the process of building all modules. This engine receives a Template as an input, together with a RequestContext, which has the current request details like user location and device information, for example. With these, it is in charge of calling every module that needs to be executed. It is also in charge of reporting metrics for each module (success, failures, and latencies), but it also throws errors in case a critical module fails. Lastly, it is also responsible for ensuring that each module is resolved within its latency thresholds and that the whole template is generated before its timeout for that screen. This allows us to:
— Prevent failures and ensure seamless experience: If a module considered not critical fails, it is ignored, returning all the other content to the user. The same happens if the module is taking more time than expected to be executed. This also prevents teams collaborating in the Store Wall from breaking it if a bug or unexpected behavior is introduced.
— Improve accountability: As we report metrics for the execution of each module, each team can have their monitors and metrics based on the modules they own.
— Improve performance: Since each module is executed in parallel, the latency is now dictated by the slowest module in the screen, allowing us to introduce new modules inside the same screen without penalizing the overall performance.
Modules: A Module is a plugin that can be injected into the template. Modules are independently developed features, which allowed us to scale up with the Videos, Social, and Picks features while keeping our system modular and maintainable. Each module handles its own:
— Backend logic and data retrieval: Each module includes specific logic for fetching relevant data and transforming it into a server-driven component.
— Monitoring and metrics: Individual modules log their metrics, which allows us to monitor and address module-specific issues without impacting the entire Store Wall.
Data Providers: Once each module is executed in parallel, some may require the same data to calculate their business logic. As an efficiency measure, we introduced a proxy design pattern to data providers to ensure that each data request (e.g., store information, city metadata) is fetched only once per user request. For example, if three modules require store details, the data will be fetched once by a Proxy and shared across the modules, reducing redundant requests to backend services and improving response time/user experience.
Server-Driven Components: The last piece of the architecture that enables high reusability is the Server-Driven UI layer that we injected on top of the plugin architecture. This layer acts as a support package used by Modules to render server-driven elements. This approach decouples backend logic from frontend dependencies, allowing us to reuse components across different modules/screens, which is essential for high-scale apps with numerous feature experiments and regional differences.

Here, you can find an overview of the flow with all the components:

Ensuring Stability and Scalability Under Tight Deadlines

Even though our architecture allowed us to move forward fast and develop features independently, the critical nature of our 90-day deadline required meticulous dependency management and stability assurance, as each module had downstream dependencies where more traffic would be added. To understand that our features were production-ready and to ensure a stable release, here’s how we achieved simultaneous rollout of all three features:

Real-Time Monitoring: Each module collected metrics, enabling us to monitor and respond to issues in real-time. We set up dedicated alerts for each feature, ensuring a quick response to any emerging issues.
Load Testing: We simulated traffic load across all features to understand system behavior under peak conditions, adjusting resource allocation to manage surges without compromising performance.
Caching: Some modules share some of the data they need as input to process their logic. We could ensure minimum impact with caching strategies designed to manage parallel requests effectively.

Conclusion: Powering the Future of Delivery with a Modular Store Wall

With several teams working on various features simultaneously, it was proven that its modular design not only enabled these parallel efforts but also minimized code conflicts and dependencies, contributing to a highly efficient and independent workflow. This decoupled approach ensured that new features could be added seamlessly without compromising app performance, allowing us to maintain a consistent user experience.

With features like Videos, Social, and Picks, we’re taking steps towards a more engaging, user-centric delivery app that enhances the user experience while preserving stability and scalability. This architecture will continue to support rapid feature introduction, evolving the Store Wall into a personalized, content-rich experience for all users.

Authors:

Victoria Perelló, Software Engineer from Glovo

Hernán Malatini, Software Engineer from Glovo

How We Engineered a Scalable Architecture to Power Videos, Social, and Picks in Our Delivery App was originally published in The Glovo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building consistency at scale: Our journey with Compose Design System

Matias Isella — Tue, 12 Nov 2024 08:57:48 GMT

Modern applications must provide a seamless experience across all platforms. At Glovo, we have constantly faced this challenge and we’ve recognized the need for a unified Design System to ensure consistency across our growing products.

In this article, I will share our journey of creating a Compose Design System to support multiple applications at scale, focusing on API design. Similar to many other Design Systems, we identified several key characteristics which have influenced our decisions:

Consistent: Ensures a similar user experience on both web and mobile.
Extensible: Allows for unique configurations and components while sharing a core experience.
Flexible: Enables teams to test new ideas without reinventing the wheel.

Google Material Design System — yes or no?

If you’re an Android developer, chances are your first question is the same as ours. Should we use Material Design?

Material is one of the most used Design Systems in the world and ranks number one in most of the charts. When facing this question, I believe there are three main options, and lucky for us, they are documented by Google.

Option 1. Extend Material

The first option is to use Material and extend it if needed. The main disadvantage of this approach is that the Design System API will contain the Material Design API. Therefore, the UX team needs to be onboard with the usage of the Material API (i.e. tokens and api definitions), as well with some inconsistencies that might occur between the designed Components and the Material Implementation.

Option 2. Replace Material

The second option is to replace Material. If your UX team requires a specific semantic token and you need to reuse Material components internally, replacing Material will give you the best of both worlds. Your Design System will expose only the tokens and components you define, and it will have access to Material components when needed. The main disadvantage of this approach is the maintenance overhead of managing a custom implementation while ensuring compatibility with Material updates.

Option 3. Not use Material

The last option is to not use Material. By taking this path, there is a significant disadvantage: you lose access to the constant development of Material. Hence, you cannot leverage Material for experimentation or to fill gaps in your Design System while working on the final Components. Unless you’re building an application from scratch, chances are you already have Material as a dependency. So we believe that you’re better off having access to the Material Component Catalog.

Decision

The second option was the one for us. Since our UX team required specific semantic tokens and we wanted to have access to the Material Components catalog internally, this approach best met our needs.

Compose (+ View System?)

Once we decided the fate of our Design System regarding Material, we faced our next question: to support or not support Android View System. Depending on your codebase, this question might not be relevant, but for us, it was critical. This determines the scope of the Design System as well as the availability and success of the project.

Option 1: Support View System

Supporting View System ensures compatibility with existing features and facilitates a smooth adoption of the Design System in older features. But, this approach may result in duplicated work for the Android Design System Team, since they would need to create two implementations for the same Component.

At the same time, there is a big risk of sinking time into working on Supporting View System Components, which might become outdated mid-term, with the added cost of delaying support for Jetpack Compose.

Option 2: Support only Compose

Alternatively, supporting only Jetpack Compose helps enforce the usage of Compose, teams can leverage this to push for adoption and foster consistency across the Project. But it’s worth mentioning that the adoption of the Design System will be limited by legacy features without Compose, since those will not not have access to the Design System.

The decision ultimately depends on the codebase. However, given the current state of Compose, there may be compelling reasons to prioritize support for Compose over the View System.

Taking the opportunity to highlight this great talk on Kotlinconf 2023. Related to Compose but focused on the learnings from adopting new technologies.

https://medium.com/media/e869b4edb7a0285a8eecd45e4d37e2ae/href

Design System API

As explained above, we will be focusing on the API, and based on our requirements and decisions at this point, we will be developing a Kotlin Compose API.

Theme

The Theme is the main entry point of the Design System and, at this point, it is expected in all Compose Design Systems to implement this pattern and expose a public Theme in their APIs. This pattern is built using three main components:

1. Theme Composable Function: This encapsulates all the theme properties, like colors and typography, and provides those to the corresponding composition locals in the composable tree.

@Composable
public fun CoreTheme(
    colorScheme: CoreColorScheme = CoreTheme.colorScheme,
    typography: CoreTypography = CoreTheme.typography,
    content: @Composable () -> Unit,
) {
    CompositionLocalProvider(
        LocalCoreColorScheme provides colorScheme,
        LocalCoreTypography provides typography,
        content = content
    )
}

2. Theme Composition Locals: These allow for the static referencing of theme values in composable functions. Avoiding repeating or passing these as Composable functions parameters.

internal val LocalCoreTypography = staticCompositionLocalOf { CoreTypography() }

3. Theme Object: A singleton with the main purpose of increasing the discoverability of all the local compositions defined above.

internal object CoreTheme {

    val colorScheme: CoreColorScheme
        @Composable
        @ReadOnlyComposable
        get() = LocalCoreColorScheme.current

    val typography: CoreTypography
        @Composable
        @ReadOnlyComposable
        get() = LocalCoreTypography.current
}

For more details, find the full Theme Anatomy in the Google Docs.

For us, this pattern provides the right level of scalability and flexibility. First, it gives the capability to wrap the main Theme to override the default values. Second, it allows the creation of new values based on the requirements of each Theme while sharing the core experience.

We highly recommend the Compose Theming Codelab to understand how theming works on Android. Although it uses Material 2 instead of Material 3, the basics of overriding a Theme, accessing the composition locals, and creating your own values are compatible.

Along with the Theme, the Design System must expose the components in its API. Ultimately, the goal of this API is to match the requirements from UX while minimizing the cognitive effort required by engineers to interpret the design.

Content

Composable APIs can follow a few different variants to deal with the Content of the Component.

Option 1: Slot API

From: Compose Component API Guidelines

This type of API is the most flexible, allowing the content to be any composable, and it is the recommended approach for all Composable APIs.

For us this type of API is used publicly for Layouts or Containers. And used internally for generalizing Components within the Design System to foster reusability. Exposing solely this type of API while increasing the flexibility of the Design System, could also increase the variants of each component causing a drop in Consistency.

@Composable
private fun Avatar(
    // ...
    modifier: Modifier = Modifier,
    content: @Composable () -> Unit,
)

Option 2: Restrictive API

Such API ensures that developers will be able to use the component only in the predefined way, leaving no space for possible mistakes and inconsistency.

From: Refining Compose API for design systems

This Compose API doesn’t have a Composable lambda in the method signature. The Component can only be used in a few ways restricted by the function parameters.

For us this type of API is good for removing ambiguity in the handover process of a new Design, since the content is highly opinionated and pre configured to look exactly like the UX Team defined, we only need to request the minimum variable portions of the Component. In our case, under the hood, these APIs always consume a Slot based one.

@Composable
public fun Avatar(
    // ...
    modifier: Modifier = Modifier,
    painter: Painter,
)

@Composable
public fun Avatar(
    // ...
    modifier: Modifier = Modifier,
    text: String? = null,
)

Option 3: DSL based slots

This pattern relies on the Composable lambda receiver Type to pass a Scope with Composable members.

DSL for defining content of the component or its children should be perceived as an exception.

From: Compose Component API Guidelines

For us this type of API is used when we need to tear down long APIs that have configuration for more than one subcomponent. As a rule of Thumb we always prefer to expose a Restrictive API with overloads than using a DSL Slot. The reason for this is that there is no limitation on invoking many members of the same custom Scope, potentially causing unexpected results.

Composable
internal fun TextField(
    // ...
    start: @Composable (TextFieldContentDefaults.() -> Unit)? = null,
    end: @Composable (TextFieldContentDefaults.() -> Unit)? = null,
    // ...
)

@Stable
public object TextFieldContentDefaults {

    @Composable
    public fun Icon(
        // ...
    ) {
        // ...
    }

    @Composable
    public fun Text(
        // ...
    ) {
        // ...
    }
}

Option 4. Inverted Slot Api

This type of API is reserved for Components that are expected to be Decorated. This method enforces a specific pattern where decorations are added before or after the inner Component.

The key difference between using a “Inverted” Slot Api and a Slot Api, is that it is expected for the Decoration to share the same behavior as the inner Component.

@Composable
fun BasicTextField(
    // ...
    decorator: TextFieldDecorator? = null,
    // ...
)

fun interface TextFieldDecorator {

    @Composable
    fun Decoration(innerTextField: @Composable () -> Unit)
}

Text

The text content is one of the most used in the API definition, as many components in a large design system will receive string content.

From an API perspective, the approach depends on your use case, and ultimately, there are two options: either expose an AnnotatedString overload or not.

@Composable
public fun Banner(
    body: String,
)

@Composable
public fun Banner(
    body: AnnotatedString,
)

The main difference between the two signatures is that by using AnnotatedString, your API opens the capability for overriding TextStyle attributes, potentially causing unexpected TextStyles that are not defined in the Design System Typography.

Additionally, note that without getting into the implementation details, regardless of your API, all text will fall under one of these Compose modifiers: TextStringSimpleElement or TextAnnotatedStringElement. The second one is slower than the first one. When possible, prefer to use different BasicText components, one for the AnnotatedString and one for the String.

Style

We refer to Style as those parameters or functions used in our Design System API to define the look and feel of a component. Usually, Style parameters or functions are key elements in the handover process from UX to Engineering. These should be consistent across platforms and UX tooling to produce the same output.

Regardless of the process for building styles (e.g., manual or code generation), we have identified at least three main options in the API with styleable Components.

Option 1: Closed Styles

This is the simplest approach. The main advantage is that your Style constructor API visibility ensures no new styles will be created, making it closed for extension by design. The main disadvantage is that ad hoc styles for experimentation are not possible.

@Composable
public fun Avatar(
    style: AvatarStyle,
)

public enum class AvatarStyle(internal val shape: CornerBasedShape) {
    SQUARE(RoundedCornerShape(64.dp)),
    CIRCLE(CircleShape)
}

@Composable
public fun Avatar(
    style: AvatarStyle,
)

public enum class AvatarStyle {
    Square,
    Circle,
}

Option 2: Open Styles

The main disadvantage of this option is the lack of exhaustiveness in the evaluation of styles, which can be useful for some presentation use cases. The main advantage is that it facilitates easier collaboration for consumers of the Design System and allows for simpler experimentation.

@Composable
public fun Avatar(
    style: AvatarStyle,
)

public data class AvatarStyle(internal val shape: CornerBasedShape)

public data object CoreAvatarStyle {
    public val Square: AvatarStyle = AvatarStyle(RoundedCornerShape(64.dp))
    public val Circle: AvatarStyle = AvatarStyle(CircleShape)
}

There is no compose stability difference between Open and Closed Styles as long as they encapsulate stable parameters.

Option 3: Multiple components

The previous two approaches use parameters for styling the component. Although this is easier from a handover perspective and for experimentation, the recommended convention is to specify separate @Composable functions with different names.

Express dependencies in a granular, semantically meaningful way. Avoid grab-bag style parameters and classes, akin to ComponentStyle or ComponentConfiguration.

From: Compose Component API Guidelines

The main advantage of this approach is that the semantic meaning is clear and doesn’t need to be unwrapped. The main disadvantage is that it is harder to discover these separate functions compared to using a style parameter.

@Composable
public fun PrimaryAvatar() {

}

@Composable
public fun SecondaryAvatar() {

}

In the example above, specifying an avatar as either square or circle does not carry inherent semantic meaning in the function. On the contrary, designating an avatar as primary or secondary conveys semantic meaning, indicating their importance and intended usage in the UI.

Modifiers

Every component that emits UI should have a modifier parameter.

Based on the Compose component API guidelines, every public Composable API should expose a Modifier.

Exposing a Modifier in the API ensures flexibility and consistency due being able to allow adding functionality without actually changing the Component.

Modifiers in APIs are expected to be at a certain position in the parameters: right after the required parameters and before the first optional parameter.

Why? Required parameters indicate the contract of the component, since they have to be passed and are necessary for the component to work properly. By placing required parameters first, API clearly indicates the requirements and contract of the said component. Optional parameters represent some customisation and additional capabilities of the component, and don’t require immediate attention of the user.

Note that missing a Modifier in the API imposes several restrictions regardless of the content, such as testability or accessibility, which rely on the semantic tree to be achievable. And this semantic tree is built using the semantics Modifier.

Stability

When building a Composable library at scale, it is desirable that most components are skippable due to being small units within the UI. A Composable Design System shouldn’t impact recomposition.

From an API perspective, using Compose foundations ensures that the classes used as parameters are stable, making the components skippable. For us, most of the instability usually comes from using unstable parameters such as List or Painter. Overall, running regular Compose compiler reports to catch any regressions has been effective.

composeCompiler {
   enableStrongSkippingMode = true
   reportsDestination = file("build/reports/compose")
}

Note that a strong skipping mode would solve many of these caveats.

Documentation

As any API, it is expected to have good code documentation providing relevant content to Developers using the Design System. And KDoc offers several attributes that help developers understand and navigate your API more effectively.

In addition to the common @parameter and @see tags, which describes how inputs affect the component’s behavior and provide links to relevant classes or methods, the @sample tag is particularly important in a Design System API.

The @sample tag has many relevant features for us:

It gives the developer an immediate example of how to use the Component.
The sample is a Composable @Preview, providing the developer with an immediate preview of the Component.
Samples are also compiled, offering an integration test out-of-the-box for free.
Samples are included as Code Blocks when using a documentation engine.

Finally, a clear benefit of using KDocs is the ability to export this documentation for public availability by generating it in HTML and Markdown using Dokka.

Bonus: Explicit API

By using strict explicit API, we ensure consistency in the codebase since developers must follow the same conventions. Developers are forced to think about each API’s visibility. This leads to better designed APIs, build time check, and self documented code through the use of explicit visibility modifiers.

kotlin {
   explicitApi()
}

Summary

In summary, these are the key points we would like to highlight:

Material Design: We don’t use it directly. Instead, we reuse Material components when needed, but these are internal to the Design System.
Legacy View System: Not supported. We focus only on Jetpack Compose and use the Design System to leverage increased adoption of Compose.
API Flexibility: We use different approaches depending on the required flexibility. However, we prefer to be opinionated as much as possible and expose the minimum number of parameters to prevent unexpected variations of components. In line with this, we use Closed Styles to ensure exhaustive evaluations and to make sure no use case falls through the cracks.
API Documentation: We heavily rely on KDoc to explain the components to Design System users, providing good code examples.
API Visibility: Using explicit API has been key to maintaining consistency.

This is just the first step of our journey. We are leaving the full implementation of all these APIs outside of this initial article. Diving deeper into this topic will require a few more articles, so for now, we will leave it here. Thank you for reading, and stay tuned for future updates where we’ll explore these concepts in more detail.

Building consistency at scale: Our journey with Compose Design System was originally published in The Glovo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Slashing Lead Times by 30%: The Impact of Using Explicit Types in JVM

alexxozo — Mon, 18 Dec 2023 10:31:49 GMT

Quick summary: As a developer, there are few things more exasperating than waiting for your code to compile. Today we’ll delve into the steps we took to address a significant increase in build time for one of our microservices. By just adding explicit types to java.Map definitions, we managed to cut down our build time by 30%!

👀 Context

In Glovo, the home and store feed screens act as the primary entry points for our ordering process. It’s crucial for us to prioritise rapid deployment in the service delivering these contents, it helps us experiment fast and ensure swift responses in case of incidents.

One of our technical objectives was to improve the lead time of the development process.

Lead time refers to the duration from when a code change (commit) is initially made to the point where it’s fully implemented in the production environment. This process involves several stages, such as coding, testing, code reviews and deployment.

By reducing it, we can increase efficiency, allowing for updates to reach the end users faster. This will translate to faster product experimentation which in turn makes us better at delivering value to our customers. Our 35 minute build process, followed by a 7-minute deployment phase, was more of a roadblock than a speedway! ❌

⭐ The goal: Make it fast!

One of our Glovo values is GAS — “We work hard and execute fast. We always ask ourselves ‘How can this be done faster?’”

We needed to decrease that 36 mins CI pipeline as much as possible!

But first, a little backstory: previously we have been switching from Jenkins to GitHub Actions (GHA) pipelines, the main reason being resource/cost optimisation and ease of use. The catch for us was that in Jenkins we had gigantic instances running the build, however in Github actions, even the largest runners were optimized for resource usage and had set memory limits!

Since we had a lot of integration tests covering most critical flows, this exposed scalability issues in the integration tests, they were simply consuming too much memory. Even before moving to GHA, our optimization journey had begun.

📚 Memory consumption

The builds in GHA were failing due to out-of-memory errors (ie. some tests were taking too much memory). We’ve temporarily stopped the migration process from Jenkins to GHA and started thoroughly investigating the performance of our integration tests.

While this topic is beyond the scope of this post, I’d like to mention some of the steps we took for optimizing tests:

Identifying tests lacking org.springframework.context.annotation.profile and reusing context, ensuring they now use the common TEST profile
Isolate problematic tests that were causing the memory issues. Eg. some were using mockbeans for mocking an external client. This impacted the reusability of contexts, resulting in recreating them for each test, increasing the memory footprint
Some of the unit tests were also extending the integration bases class (a lot of dependencies) and this was a waste of resources and execution time
Cleanup of unused integration tests

Finally we have also tweaked our gradle build parameters to obtain better performance overall (memory and time reduction)

tasks.withType(JavaCompile) {
  options.compilerArgs << "-Xlint:-options"
  options.encoding = 'UTF-8'
  options.fork = true
  options.forkOptions.setMemoryMaximumSize("8g")
  options.incremental = true
}
org.gradle.parallel=truegr
org.gradle.caching=true

Most notably were:

options.fork = true

When fork is set to true, the Java compiler runs in a separate process. This can be useful for several reasons, such as avoiding memory issues in the Gradle daemon process, or isolating the compile process from Gradle’s own classpath.

options.incremental = true

Incremental compilation allows Gradle to recompile only the parts of the code that have changed since the last build, which can significantly speed up the build process.

org.gradle.parallel=true

Allows Gradle to execute multiple tasks in parallel. This can greatly improve the build speed, especially on multi-core machines, as it makes better use of available CPU resources.

org.gradle.caching=true

The build cache can significantly reduce build times by reusing outputs from previous executions of tasks. For instance, if a task has already been executed with the same inputs (source files, configuration, etc.), Gradle can skip its execution and use the cached result instead.

By combining all these methods we’ve managed to migrate Jenkins to GHA successfully! 🚀

⏱️ Compilation time

Even though tests were optimized, after a while we’ve come to realize that build time is still painfully large so we turned our heads towards the compilation time. And like this, we’ve embarked on another journey to discover how to make our code compile FAST!

To get a better grasp on the issue we decided to utilize Gradle’s debugging options to dive into the underlying reasons for the prolonged build times.

Divide and Conquer — The first step was to isolate the slow performing action in the build process. Initially there was a single Github Action for build and test. We’ve split it into separate execution steps for build, unit tests and integration tests. This gave us a great insight, the main driver for compilation time seemed to be the BUILD step 🚀
Searching for a Problematic Commit — Next, we’ve inspected the commits made in the past weeks to check if there is anything that could lead to more compilation time. Unfortunately we could not find anything on this route…
Trial and Error with Gradle commands — We tried using some of the common gradle build commands to check if they help, unfortunately nothing really helped…some of them (might be useful for your scenario):

gradle build — scan

Gradle collects data about the build, such as how long tasks took to execute, what tasks were executed, and information about the environment (like Gradle version, Java version, operating system, etc.).

We ran this command locally and even though the analysis was exhaustive and it highlighted compile java took a lot of time, it could not narrow down to the level of classes.

gradle build — stacktrace

A stack trace is a report of the active stack frames at a certain point in time during the execution of a program. It’s particularly useful for debugging because it shows the call sequence that led to the error or exception.

Since this just provides a summary of tasks and might be useful when there are errors, which wasn’t the case, this was again not helpful.

gradle build — info & gradle build — debug

Tells Gradle to provide more detailed output than usual.

This once again gave a detailed description of the build, tasks executed, dependency resolution etc, but once again was not enough to narrow down the root cause.

Trial and Error with JavaCompile Options — The WINNING command was options.verbose 🚀! Because it gave us the needed visibility to know why compilation takes so much time. This is the final configuration we’ve used:

tasks.withType(JavaCompile) {
  options.compilerArgs << "-Xlint:-options"
  options.encoding = 'UTF-8'
  options.fork = true
  options.forkOptions.setMemoryMaximumSize("8g")
  options.incremental = true
  options.debug = true
  options.verbose=true
}

This option provided detailed information about the compilation step, and we noticed that for a particular class, the compilation step took more than 5 minutes! The begin and end logs of that class showed this duration (see image below). We executed this a few times to confirm this was not random.

The class was quite simple, it was just defining some java.Map objects that hold static data for some of our experiments. The class looked something like this:

public static final Map ABTestData = Map.ofEntries(
     Map.entry("A", "B"),
     // 500+ entries
);

Our breakthrough came from a thread from 2019 in the OpenJDK mailing list (see OpenJDK Mailing List). Extract from the message:

In javac we are doing a lot of heroics to try and keep the space of 
inference variable as small as possible, by aggressively de-duping 
inference variables where possible. This strategy works well in cases like:

a(b(c(d(...)))))

But in cases like the ones you report (or those in the JBS issues 
above), which have a shape like:

a(b(), c(), d(), e() ... )

We do not yet perform any de-duping, so the inference engine has to run 
with a very big set of (possibly very similarly looking) inference 
variables. Performing incorporation will end up setting similarly 
looking bounds on the inference variables of the outer a() call, which 
all have to be validated, and so on and so forth.

This was our AHAAAAA moment! 🚀

When you do not explicitly specify the type parameters (like Map.ofEntries), the compiler has to infer them. This is straightforward for a few entries but can become increasingly complex and resource-intensive as the number of entries grows (our situation).

public static final Map MAP = Map.ofEntries(
     Map.entry("A", "B"),
     // More entries
);

By explicitly specifying the type parameters (e.g. using Map.ofEntries), we relieved the compiler of the need to infer the types for each entry, significantly speeding up the compilation process as the compiler no longer needs to perform inference for each entry. 🚀🚀

💡 The Result: Reduced Compilation Time by 30%

This modest code adjustment led to a substantial 30% decrease in our build time! The impact was immediate! In our fast-paced environment every minute counts, and now, suddenly, we have cut down our waiting time by almost a third.

🚀 Conclusion

In this article we’ve learned about Gradle, GHA, Jenkins, JavaCompile options and the inner workings of JVM for the purpose of optimizing the build times of our microservice. I hope you’ll find some valuable information here that will help you do the same!

Slashing Lead Times by 30%: The Impact of Using Explicit Types in JVM was originally published in The Glovo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Taste the World: How Our New Machine Translation Feature Transforms Your Ordering Experience

Ahmad Hamouda — Wed, 13 Dec 2023 09:18:09 GMT

Authors:

Ahmad Hamouda, Software Engineer IV at Glovo
Stefania Russo, Head of UX Content at Glovo

Introduction

Glovo’s mission is to “give everyone easy access to anything in their city”. Being live in 24 markets, we aim to provide a seamless experience without language barriers for all our customers.

While the interface of our apps and platforms are already localized in the languages of every country we operate in, there was still one more language barrier to overcome: our restaurant menus and store product lists, which are monolingual, were not translated into the user preferred language.

Imagine: you’re a native English speaker who lives in Barcelona and doesn’t speak Spanish. Your phone is in English and so is the Glovo app.

You are hungry on a Friday night and while browsing Glovo find all restaurant menus are either in Spanish or Catalan. There is no simple way in the app to translate the menu and having to copy/paste into Google Translate is frustrating. You’ll likely end up ordering something that you’re not sure of or close the app and go pick up the food yourself.

By giving our users the possibility to see a translated version of our menus in the language of their mobile devices, we can increase our reach and penetration in every market as well as improve the overall user experience.

In this article, we will give an overview of the solution we built, a deep dive into the localization challenges, and how we are measuring success.

Customer pain point

Glovo operates in countries with a high percentage of immigrants, so a monolingual catalog automatically prevents a significant number of users from placing an order due to language inaccessibility.

The lack of a translation solution for our restaurant menus and store product lists was already a known major pain point in many of Glovo’s countries.

On top of this, whenever a user had their phone in a language that was different from the country’s main language, they were exposed to an unexpected multilingual experience they did not choose.

Users prefer content in their language

In 2020, the content and language services firm CSA Research published ”Can‘t Read, Won‘t Buy” summarizing people‘s attitudes towards using products in their language versus other languages. The results were eye-opening:

65% prefer content in their language, even if it‘s poor quality
67% tolerate mixed languages on a website
73% prefer products with information in their own language
66% use online machine translation
40% will not buy from websites in other languages.

Our Product catalog challenges

Catalog ownership

Menu and store catalogs are owned by the partners and even though clear global guidelines are provided, partners choose the language and format of their catalogs.

Language mix and language detection

While most countries have catalogs in the same local language, some partners in multilingual countries choose to duplicate their menus on Glovo to offer a multilingual experience to their customers (ie. in Georgia many restaurants have their menus in both Georgian and Russian).

This quick solution on the partners’ side, to compensate for the lack of a translation feature for menus, becomes a challenge when trying to establish a scalable translation feature for menus. Most menu items consist of one or two words and language detection in such small units of text is a big technical challenge.

Catalog dimensions and updates

Our restaurant and store catalog contains millions of products and they undergo frequent updates.

Menus and product catalogs are structured into names (ie. Pizza Margherita) and descriptions (ie. ingredients like tomato, mozzarella, or basil). Partners can also organize content in sections, collections, and super-collections (ie. Top sales, Combos, Starters, Salads, Classic pizzas, etc.).

Catalogs are updated frequently depending on the type of business: seasonal menus, special offers, changes in product offerings, etc.

These updates occur across all markets, for all partners, every single day. Millions of menu and catalog entries are updated daily.

The Solution

It was clear that what we needed was a real-time Machine Translation (MT) solution, integrated into our systems. We aimed to translate our menus and catalogs into the user’s device language via an API; without requiring any human intervention.

As the result of an effective and rewarding cross-team collaboration between Localization, Engineering, and Product we closed the gap between the localized interface and the product catalog language.

Getting Started

We collected all the necessary requirements to select a third-party Machine Translation provider that could fit our needs.

Main considerations:

Machine Learning Model customization: while the Machine Translation had to happen real-time, we needed a system to customize the machine learning process based on Glovo-specific content requirements and existing content
Machine translation coverage for not-so-common language combinations like Armenian and Georgian into English and Russian
Adaptive technology that could quickly learn from user feedback
Low latency and high availability: maintaining low latency for personalized customer experience and stringent SLAs to ensure service reliability to avoid customer experience degradation
Quality monitoring
Scalability
Cost-effective solution
Robust data processing capabilities.

Our Machine Translation Partner

ModernMT is the provider who won the selection process, as it met our requirements in terms of tech solution, quality customization needs, and cost effectiveness.

ModernMT is an adaptive neural machine translation system and one of the top-rated in the market, developed by Translated.

ModernMT was recently recognized as a leader in the IDC MarketSpace for machine translation software, ahead of the likes of Google, Amazon, and Microsoft. Currently, ModernMT supports 200 languages, reaching over 6.5 billion native speakers worldwide.

The MVP

Enabling machine translation of the product catalog touches many phases of the customer journey from the moment a customer starts their search for a product until they pay for their order at checkout.

We adopted a lean approach to get started, so we prioritized enabling the machine translation feature on the store screen [Figure 1] to make sure our customers could get the most out of it.

We adopted the same lean approach when deciding where to first roll out the feature. For the aforementioned reasons, we began with two countries most in need for a machine translation solution: Georgia and Armenia.

After the first roll-out and applying a number of learnings from the initial trial, the feature was scaled to the store pages/screens for the remaining countries.

We determined the language combinations for enabling the machine translation feature based on the most used device languages in each country.

Overview of the feature

Figure 1

Localization deep-dive: the Machine Translation Customization effort

Customizing a machine translation engine means providing it with relevant material so it learns from it and becomes better over time.

In our case, it involved injecting samples of correct product translations and language glossaries into the engine so that it learns from them and applies the learnings across the whole system.

This practice is extremely important when the content to machine-translate consists of short units of text, such as menus and product lists, because the machine doesn’t have a lot of contextual text to support its choices. Long sentences or paragraphs provide the machine with better context and therefore imply less training and a faster learning process.

The customization work has been divided into the following phases:

Phase 1: Machine Translation Engine Pre-training

MT engines are fed with the following datasets for each language combination:

Existing translation memory databases
Exports of top-selling products for each language and market
Do-Not-Translate glossaries (a list of terms which we never want to translate)

Phase 2: Sample Human Reviews

ModernMT learns dynamically and continuously. So, beyond the pre-training step above, we extracted samples of our product catalogs for each market, processed them through the engine, and had human translators perform a linguistic review.

This step allowed us to directly feed corrections and feedback directly into the ModernMT engine.

Phase 3: Machine Translation Glossaries

A glossary, in the context of Machine Translation, is a tool that facilitates the consistent translation of customer-specific terminologies, giving advanced control over the terms used.

ModernMT has an MT Glossary feature integrated into their API. This allows us to create Glovo-specific terminology lists per country that can help us boost the quality and nuances of the machine translation engine output.

Our MT glossaries include:

Universal food terminology (ingredients, dishes, generic products, etc.)
Local ingredients
Local dishes
Big Food chains
Q-commerce products

The MT glossaries are a live asset, which will be updated regularly to make sure we keep up with the product catalog updates.

Phase 4: Continuous Feedback Loop

Being able to quickly implement any feedback is crucial for us and for our end users.

For this reason, we implemented an in-app feedback solution thanks to which Glovo employees using our beta app are able to easily report any wrong machine translation by clicking on a button.

The clicking of the button sends synchronized API requests along with necessary metadata to the TranslationOS platform, which triggers a human review by a professional translator.

Once the review is completed we receive the corrected text back via API, which is then integrated internally into our ModernMT model.

This feedback loop combined with the adaptive MT model aims at a continuous improvement of our solution.

The Engineering deep-dive: building a dedicated microservice

At first glance, connecting third-party services directly with ours may appear straightforward. However, our stringent requirements related to security, data control, quality assurance, and cost reduction make this integration more complex than it seems.

Our primary goal is to maintain exceptionally low storage latency while minimizing it as much as possible, all while retaining control over our Service Level Agreements (SLAs).

To navigate these challenges, we’ve implemented an asynchronous approach to manage client requests.

When a client queries us, we promptly provide the data if it’s available. If not, we notify the client of data unavailability while presenting the original text.

Simultaneously, we initiate an asynchronous request to ModernMT for the translation, storing it in our database. As a result, subsequent requests for the same word and language combination are instantly served from our storage. Although this method incurs a failure for the initial translation request, it significantly reduces costs by approximately 95%.

This cost reduction is attributed to two key factors:

We translate only a minimal part of our entire catalog (items visible to users).
We reuse the same translation without incurring additional translation costs for every subsequent request.

We use a scoring system to decide when to display translated content. If 85% of our items are translated, we show the translations. This helps us ensure that our pages stay relevant, and we can adjust the threshold based on our evolving business needs.

To make things easier for our users, we’ve introduced an auto-translation feature. When a user’s device language matches one of the supported languages configured for their current country (ie. your device language is English and you’re in Spain where English is a supported target language), the translation happens automatically without any input from the user.

We’ve designed this to accommodate different language preferences, providing an inclusive experience for all users.

Tech Mid-Level Solution: Making Informed Technology Choices

As we planned our service, we started by figuring out how much storage we needed, predicting the traffic mainly from the store screen, and setting our SLA goals.

In this process, we looked at three options, each with its own pros and cons.

Database:

Redis:

Redis emerged as an initial contender due to its cost-effectiveness compared to DynamoDB and its superior speed when compared to a standalone MySql setup. However, its challenge lies in data persistence. While there are more advanced options available, such as Elasticache with persistence, they come with increased costs.

DynamoDB:

Although DynamoDB offered speed, ensuring stable access patterns and understanding Read Capacity Units (RCU) and Write Capacity Units (WCU) were critical requirements.

SQL:

We considered SQL solutions, which seemed cost-effective, but using them might require adding Redis for extra features. After careful thought, we decided to start the service with Redis. This lets us gather data on reads and writes, validate new features, and plan for the future based on data. Our iterative approach allows us to continuously improve the project.

Even though we designed models for DynamoDB and SQL, we structured our data model so that switching from Redis to either DynamoDB or SQL in the future is still possible.

This decision gives us the flexibility to adapt based on metrics and user feedback, ensuring the service remains reliable and efficient.

Communications: Streamlined Machine Translation Services

To improve our machine translation services, we divided our system into two separate modules. Each module focuses on key aspects, including deployment, source code, and infrastructure:

Event-Based Module

This module functions on an event-driven architecture, allowing asynchronous communication and enabling the retrieval of translations from our service providers. This setup ensures effective handling of various translation requests by separating functionalities, promoting scalable, and independently configurable operations.

API Module

On the flip side, our API module handles user traffic and oversees fallback from TranslationOS, guaranteeing a responsive interface for users. This split allows for customized scaling and the use of distinct configuration libraries. This modular approach results in cleaner code, a secure platform, and improved organization. It also lays a solid foundation for adapting to future changes and scaling requirements.

Our reliance on Kafka, an asynchronous technology within Glovo, ensures the smooth operation of our event-based module. Additionally, we’ve incorporated robust features like retries, limiters, and circuit breakers in our communication protocols with ModernMT and TranslationOS. These measures are crucial for adhering to their limitations, respecting service capacities, and managing fallback scenarios effectively.

This detailed approach not only ensures optimized communication channels but also strengthens our system’s resilience, guarding against potential service disruptions and enhancing overall reliability.

Clean Up and Refresh Data

We manage our data in Redis with specific Time-to-Live (TTL) settings, which we refresh based on usage patterns. This means that frequently accessed items have their TTLs extended. However, if we update it with every access, we will have too many unneeded updates on our storage. For this reason, we decided to use a statistical model to extend the TTL only [ex: randomly once every 10 accesses]. This helps reduce write operations and overall costs by keeping them available for a longer time.

Moreover, our feedback loop lets us update and enhance specific items with poor translation quality as explained earlier.

As our machine learning model progresses, there comes a time when data cleanup becomes necessary. To address this, we’ve created an internal API for data cleansing based on language pairs or countries. This API becomes active when we reach a specific threshold or introduce a significant amount of new data to our glossary, ensuring efficient maintenance management.

Security

Ensuring security is a top priority for us, especially when dealing with third parties and managing incoming data that directly affects our company’s reputation and cost center. To minimize risks related to these interactions, we perform thorough risk analyses and make informed choices regarding password rotation, sharing practices, and identity validation.

Through close collaboration with ModernMT, we’re implementing crucial changes related to passwords and tokens. Their support and flexible approach have proven invaluable in strengthening our security measures to protect our systems and data.

Measuring Success: Impact in Numbers

Keeping track of the metrics below is pivotal in measuring the impact of our machine translation solution. This will help us refine strategies, confirm impact, and continuously improve the service based on user behavior and feedback.

Conversion Rate (CVR)

CVR is a key measure to evaluate how well our translation services work. By comparing the conversion rate before and after implementing translations, we can see how it affects user engagement and actions like purchases or interactions on the platform.

New Customer Acquisition

Measuring new customer acquisition serves as a lever in attracting foreign customers and expats to our platform. Tracking the influx of new users post-translation implementation helps quantify the service’s effectiveness in broadening our user base. It provides concrete data on how well our translation solutions resonate with a diverse audience, reflecting our ability to attract and retain foreign users, expatriates, and newcomers to the platform.

Served Orders with Machine Translation

Counting the orders handled through machine translation gives a clear sign of how widely the service is used and its practicality. Keeping an eye on this metric helps us understand the extent of user interactions made possible by translated content, highlighting its role in making transactions smooth.

Increased Orders

The noticeable increase in order placements directly linked to the introduction of machine translation indicates the impact of the service on user behavior.

This metric clearly shows how translated content positively affects user engagement, leading to a boost in platform activity.

Customer Satisfaction

Collecting and analyzing customer feedback on the machine translation feature provides qualitative insights that helps us continue improving the experience of our users.

Customer Retention

Assessing changes in customer retention rates after implementation allows us to gauge the impact of the service on user loyalty.

Future Developments

Our goal is to expand the machine translation feature to additional stages of the customer journey, ensuring a smooth user experience across the entire platform.

Authors:

Ahmad Hamouda, Stefania Russo

Taste the World: How Our New Machine Translation Feature Transforms Your Ordering Experience was originally published in The Glovo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Cracking the Code: JS, BigInt, and the Art of Future-Proofing Your App

Victor Borisov — Wed, 29 Nov 2023 09:56:25 GMT

What’s this article about?

Ever wondered about the mysteries of JSON.parse and why it sometimes throws unexpected surprises? In the real world where we have real traffic and real users, it turns out not every platform plays nice with JSON in the same way. Parsing numbers is just the tip of the iceberg — we’ve got to tackle the quirks of both nodeJS and browsers, as well as some server-side rendering frameworks.

Join us in this exploration where we spill the beans on real-world scenarios from Glovo, sharing insights on the nitty-gritty of JSON and big numbers parsing that you might not have seen coming.

How does JS handle numbers and what is BigInt?

Historically, numbers in JS are represented using `number` type, which is based on the IEEE 754 floating-point standard, which basically means that for every `number` used there is a 64-bit double precision number in the memory, which can safely represent any integer number between -9007199254740991 and 9007199254740991 (this is Number.MAX_SAFE_INTEGER in JS), and can even work with the floating point (although with some limitations). This should be enough to cover most of the cases, but it still has some limitations, most notably when it comes to arithmetics with the floating point and storing numbers outside of the safe numbers interval.

Due to the limitations, number can not safely represent big numbers, do arithmetics with them, and also has issues with arithmetics of some floating point numbers. We can use BigInt type in JS instead of the `number`, it allows only to work with integers, but it can safely work with much bigger numbers and has a pretty solid browser support. This is all we need to know about how JS works with numbers in terms of this article, but if you’d like to dig a bit deeper into the topic, check this article out.

After reading this, you might think: “Oh, I am not planning to do any arithmetics or to work with bigger numbers, so I should not care about what you’ll say next”. And I want to assure you, we thought exactly the same until the thing that I will tell you about next happened.

The Unexpected Twist: How Neglecting BigInt Almost Broke Our App (And why this can easily happen with you)

Even though we use numbers for IDs of products in Glovo, we didn’t consider using BigInts because the numbers were very far from MAX_SAFE_INTEGER, and we were sure we would not reach this limit during the lifetime of the app. I’d personally prefer never using numbers and would go for strings instead (then we would have not had this problem in the first place), but the API was designed with more focus on the mobile apps at that time, and the issues we’ll be talking about here do not exist on major mobile platforms for native apps.

At some point, we had to plan for migrating one of our API services to a new API, which unifies several different applications, and it actually uses IDs that are higher than MAX_SAFE_INTEGER. We’ve immediately figured out this won’t work. Without doing any changes to our code, we set up our testing environment to use the new API to see how bad things are — and indeed they were really bad. The app had a bunch of errors, and we had a store which had all its products with IDs that are big enough to be coerced by the `number` limitations, meaning that after parsing the JSON, every product had exactly the same ID. To understand how this happens, see the screenshot below or check this repo with the demo app from which this screenshot is.

The result was that, when adding 1 product on the store page, all the products of this store were added to the cart, because their (originally distinct) IDs were coerced to the same number.

This happens because there are “unsafe” big numbers in the JSON response, but native `JSON.parse` has no idea what a BigInt is and there is no way to make the native JSON.parse correctly parse these numbers from the original JSON. To make the matter worse, the RFC describing the JSON standard, recommends to use numbers inside the range of the double precision numbers (same as JS’s `number`), but it does NOT enforce any specific limit and states those limits are up to each implementation (you can see it here https://datatracker.ietf.org/doc/html/rfc8259#section-6).

So the problem here is that the service returns a technically valid JSON, but we can not parse it using built-in JS tools. This means that, if you are developing a webapp, it doesn’t matter what your FE framework/library is — there is a non-zero chance that one of the APIs you depend on may start returning unsafe numbers in the JSON, because first it’s not forbidden as per JSON RFC, second many other programming languages do not have a “default” type of the number when it comes to serialising/deserialising JSON.

For instance, your Java backend may cast longer numbers to Java `Long` data type, which will not be compatible with JS `number`, and nobody may even notice any issue until that limit is breached (e.g. IDs that are being incremented in the database). For example this code:

https://medium.com/media/eb867a5cc00c3270508bdb956f14d164/href

Will produce the following JSON (it is not possible to correctly parse it with JSON.parse, if you copy-paste it to the console you will see the result):

{“id”:9223372036854775807}

This can be solved by parsing JSON without using JSON.parse and handling big numbers as BigInts. There is a library for that (tm) here: https://github.com/sidorares/json-bigint. It should cover most of the cases, but I wouldn’t be writing this article if it was the end of our struggle.

Generally, if your app is 100% client-side (SPA) and has no nodeJS server runtime, for instance something like expressJS, nuxt or next, then you should be fine with just making sure you are always using a JSON parser that can parse bigger numbers, like the one I just mentioned.

For us, it was not the end of the story. We have a customer web app with server-side-rendering (meaning we render the HTML out of VueJS components using nodeJS). It is using nodeJS and a server-side rendering framework called nuxtJS.

Behind the Server Curtain: Tackling BigInt Issues in NodeJS SSR Apps

On the server side, the problem is generally the same, with the main difference being that is (generally) we now want to parse much more JSONs, non-stop. In order to render (almost) any page, we need to make several HTTP requests to some of our backend services from the nodeJS (nuxtJS) app — even with HTTP caching that we have, this is still needed due to different customisations (for different users, cities, languages, stores, time of the day, etc), the experimentation which we do a lot (things like A/B tests) and of course almost real-time nature of our data (stores open and close, they edit products, start or finish promotions, etc).

When testing the application locally, everything works well with json-bigint parsing every backend API network response on the nodeJS server. By contrast in a real life scenario of a user navigating our website, their browser will also only parse a few JSON strings per minute (usually per page). In these “light” scenarios the json-bigint library works well. But in reality, each of our production nodeJS servers can be parsing tens or hundreds of JSONs every second, non-stop, 24/7. I am not entirely sure if there is a memory leak in the mentioned library or it’s just generally more memory-hungry, but the reality is that we didn’t find a way to use it on the server without causing a significant performance degradation caused by excessive memory used when enabling this library. We kept it as-is on the client, but we needed to find a way to parse JSONs with BigInts on the server, and, for our use-case it must be something less hungry in terms of memory.

We could not find any alternative solution for our case on the web, we were looking for something relatively popular and supported, slim so it won’t inflate our JS bundle, or at least something simple so that we could fork it and support. Most of the solutions are either too heavy in terms of the bundle size, or are not tested well enough and may fail on a valid JSON, here’s one example I found. After losing all hope, we finally managed to come up with a solution that works (well, kinda, more about that later).

What we did was take the original JSON string, at first, without parsing it. First, we ran a regular expression on it, this regex will replace all the bigger numbers with a string which contains a predefined prefix and the number itself right after it. In our case it would transform this JSON:

{“id”:9223372036854775807}

Into something like

{“id”:”APP_SERIALISED_BIGINT::9223372036854775807”}

The tricky part here is the security and reliability, because manipulating JSON strings (especially with some constants that you later transform) might be dangerous. To make this approach safer we need to make sure that what we replace is actually a big number, but not something that just looks like one. For instance, if implemented poorly, there could be a number of ways to break it:

If the original JSON already includes the “prefix” constant (APP_SERIALISED_BIGINT::) somewhere — this may easily break the app if there is no actual number after it. Among other risks, if the constant is placed into a JSON key, this can lead to many potential ways of allowing an attacker to manipulate the JSON structure.
Big numbers inside strings, especially if the JSON has a string property, which is another JSON — the function may start replacing values inside, which can break the JSON structure because it won’t work with escaped chars ( this behaviour is not desirable since the underlying JSON should be separately parsed with the same BigInt-enabled parser function).

A well-tested regular expression should help here.

After this we can take this string and feed it into the native JSON.parse. The second argument of the .parse method is called `reviver`, it is a function which is called for every value when parsing a JSON string, it allows modifying each value found in the JSON object. The code for this would look like this:

https://medium.com/media/3ce7b31ce2dcbeef01f642d190aa9ec2/href

With this approach we do a (relatively) cheap in terms of memory operation of replacing some things in the string and we use the native JSON.parse, which in theory should be more efficient than any JSON parser made with JS. The only problem with this approach is that, in order to properly detect where the big number is (to avoid cases mentioned above, like JSON values inside strings), we need to use a regular expression with a negative lookbehind assertion, which has a very limited browser support, most notably it needs both macOS and iOS Safari to be of a version not less than 16.4. Luckily, it’s been supported on nodeJS since 8.10.0, so this will still work well on the server. For the browsers (mostly because of Safari) we had to still ship the custom JSON parser based on json-bigint.

The source code of the regex-based parser we created is available on github at https://github.com/vd3v/big-jason, as well as the npm package at https://www.npmjs.com/package/big-jason.

We did a performance test where we parsed a large JSON string containing different types of data, including some big numbers, using both the method with regex and the json-bigint library thousands of times. The results show that the approach with the regex consumes way less memory, around 70% less of both heap and rss memory, but it is more CPU-intensive, so it takes around 40% more time. We had no other choice but to see how much of a tradeoff would that be in production. As a result, there was no noticeable change in the memory consumption compared to what was before (where we were just losing big numbers while using the default JSON.parse), and the CPU didn’t show much change in the average load either. Before, when we were trying to use the json-bigint library in production, our servers had what we call a “slow memory leak”, where the memory would grow over the time without being cleared up, until K8S started killing those machines that ran out of memory. Under high traffic a server would not work more than a minute or two, which rendered this approach unusable for our application.

Now that we found a way to parse JSON strings with BigInt, we should be good to go, right?

Right… Not exactly. Since we are talking about an application with server side rendering, all major SSR frameworks (like next, nuxt and svelteKit) have a process called “hydration”. This is something that happens when the browser loads a page pre-rendered on nodejs, and then needs to instantiate UI components of the UI framework (like react, vue or svelte), and they need to have the same state as they had when they were rendered by the server, otherwise the user will see a page with all the data and then (once the UI library is mounted) it will become empty (it can produce much more issues, including completely breaking the app). To make it work these SSR frameworks have their ways to serialize the state, which will later be read on the client and injected into the components state before they are mounted.

As you might’ve guessed, this is exactly where the next problem with BigInts happened. Luckily, as of today this is not that much of an issue for fresh versions of both nuxt and nextjs, since they’ve released an update. They both rely on the devalue library, which got the support for BigInts in August 2022, so there is still a chance that you’re using an outdated version without BigInt support. In our case the problem was that we were stuck with nuxt2, which relies on its own fork of devalue, which still doesn’t support BigInt. On top of that, we were using @nuxtjs/composition-api, which also handles some logic related to serialization and is using good old JSON.stringify. As a quick fix, we made a patch to those dependencies in our project with pnpm patch, but I will also open PRs to those repos with a patch to potentially help other developers struggling with BigInt on old nuxt.

Conclusion

With the recent patching of server-side rendering (SSR) libraries like Nuxt and Next to support BigInt, a brighter future appears on the horizon for BigInt in JavaScript. However, the true problem lies in the JSON standard itself, as there are different ways to understand and implement it in terms of parsing numbers. The lack of native support for BigInt in JSON parsing poses an ongoing challenge by forcing developers to use 3rd party solutions for such a core thing like parsing a network response. The hope is that major browsers will collectively embrace a more inclusive approach, supporting numbers of all sizes within JSON, or there will be established a new standard for JSON that would allow representing big numbers, for instance with the JS BigInt literal (for example: { “id”: 123456789123456789n}). This would allow developers to use BigInts seamlessly across both web applications and on native mobile apps, not to mention cross-service communications.

I hope it was useful or interesting to you to read about our struggles. Please share your thoughts, questions, or even your own solutions in the comments. I am curious to hear how you handle BigInts in your apps (or if you are intentionally trying to avoid them) and your experience with this. Do you think this should become a part of the ECMA/JSON specification? Thank you very much for reading and have a nice day!

Cracking the Code: JS, BigInt, and the Art of Future-Proofing Your App was originally published in The Glovo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Accelerate Your Android Development: Essential Tips to Minimize Gradle Build Time (Part II of II)

rolgalan — Mon, 06 Nov 2023 08:52:30 GMT

Photo by Guillaume Jaillet on Unsplash

Introduction

In the previous part of this article, we emphasized how reducing build time can enhance developer productivity and business value.

We highlighted that caching the output of previous tasks for reusability and leveraging parallel builds are the most impactful actions.

Let’s review now some other techniques and configurations to keep improving your build times. Even if some of these might not be as effective as the ones outlined in the first article, they are still quite relevant. After you have already applied all of the previous actions, all the new ones will become quite significant to reduce even more your build time.

As it was mentioned previously, all the learnings shared here have been acquired from Android projects, but all of the Gradle techniques discussed here can be applied to any other Gradle project unrelated with mobile.

The hardware

While it may seem obvious, upgrading the machines that build your app should be one of your first considerations to reduce build time. This means both the remote agents from your CI/CD and your local development laptop. (Are your engineers already using M2s? 👀).

Given that building an application is a CPU and memory-intensive process, it’s crucial to understand the machines on which the project runs. Number of cores is decisive to execute tasks in parallel, as well as their clock rate to execute fast. At the same time you are going to need a lot of memory available to be able to run the whole process (specially if you parallelize). We mentioned in the previous article that it is important to parallelize; if you are investing on that, it makes sense also to make sure your machines are going to support it. In the next section we’ll discuss how this parallelization impacts also the memory.

Although often overlooked, disk I/O throughput is critical as app building involves constant disk read and write operations. We learnt this the hard way!. Quite recently we detected huge penalties during a CI agents migration to different runners, specially during the Gradle task fingerprinting. The time when reusing tasks from the cache was increased by 3x when changing from Fargate to EC2 due the default disk used in the latter had worse capabilities. If you are building your projects in AWS, make sure your disk is NVMe.

While disk space may seem trivial in today’s context and often goes unmentioned, we encountered issues when using CI agents with only 20GB of disk space (this was the limit in AWS Fargate at some point). One particular thing to look at is the transitive R class, which duplicates resources in every module from its dependencies. Currently all projects have non-transitive R class by default, but if you are working with a project older than a few years, make sure to enable this flag, as it also impacts build speed.

The JVM memory settings

As previously mentioned, the build process demands a significant amount of memory, making memory configuration the most important setting for your project. Since Gradle executes in a JVM process, this should be done through the org.gradle.jvmargs property in the gradle.properties file.

By default, Gradle is setting org.gradle.jvmargs=-Xmx512m -XX:MaxMetaspaceSize=384m, which is arguably quite small for development of any Android application nowadays.

There are many things playing together here and it’s important to take all of them into account, especially if your system is constrained and you cannot have all the RAM you would like. Let’s go one by one:

The heap is the most important part, and will help you to reduce the amount spent in Garbage Collection maximizing the throughput, so make sure to set a high enough Xmx value in the org.gradle.jvmargs. At the same time, the initial heap size will also be helpful to avoid wasting some cycles dynamically increasing the heap (which requires GC to run), so you should also set up a reasonable Xms value (maybe half of your Xmx or matching it).

But you have to be careful, because the Gradle Daemon will spin up a separate process to compile the Kotlin code, the Kotlin Compiler Daemon. By default this process will inherit the jvmargs settings from the main Gradle Daemon, unless you add an extra kotlin.daemon.jvmargs Gradle property in the gradle.properties file. I recommend this, and you can probably limit it to a lower heap than the main Gradle daemon.

Alternatively you can configure the Kotlin compiler to be executed inside the main Gradle Daemon, but there might be a performance penalty. We had this for a while as our available memory in our legacy CI was quite limited and this was a good way to keep the usage under control.

If you still have some Java code, Gradle will spawns separate workers for it, which used to be disposable, but since Gradle 8.3 these are promoted to long-lived daemons. Keep an eye on these as well when configuring the memory.

Please note that, by setting any value to org.gradle.jvmargs it will override the Gradle mentioned defaults, so if you increase the heap, you will lose the existing limit to the JVM Metaspace, as the JVM doesn’t have any limit by default. At some point we had issues related to the extremely huge usage of Metaspace, which was growing uncontrollably for some unknown reason and we needed to set a maximum for it with -XX:MaxMetaspaceSize to prevent it to cannibalize our available memory, but it has not been a problem recently and we do not need this setting anymore. If you are using SonarQube, it lists MetaSpace in their troubleshooting, so keep an eye on it.

Unit tests also are executed in a separate JVM process, usually one Gradle Worker for each test module. By default these test workers have a maximum of 512mb for the heap (regardless of your gradle.properties settings). If you keep your modules small, this should be enough, but you can increase the value withmaxHeapSize="1024mb" inside a test { } block in your Gradle script.

What is important here is that all these test workers will be launched in parallel, one per core, spawning separate JVM processes that could easily occupy a big chunk of your available memory in the machine if you have many cores. In our case we are running now with 16 cores, so this easily can add up to 16 gb just for these workers in parallel (for big applications, engineers rarely run the whole suite, so this impacts mostly the CI).
Is it true that you can set the property org.gradle.workers.max to limit the amount of workers executing in parallel… but why would you do that? You are paying for all of these extra cores to maximize what you can run in parallel. So use this as the last resource.
Worth highlighting also that maxHeapSize can be configured for the whole app or overridden in a specific module. You might have some outlier that requires much extra memory that you cannot afford having in all modules (due their parallelization), so you could set extra for this specific one and leave the rest with a smaller default.

And remember, the heap is not everything. Usually heap represents around 70% of the memory used by each java process, metaspace ~20% and the rest 10% is a bunch of different areas of the native memory not really relevant for us at this point. If you have a limited amount of available memory in your machines you will need to take this into consideration when choosing your memory settings, so you ensure there is some extra memory available for all the processes.

Update dependencies

Even it may seem obvious to some, I frequently encounter queries in public forums from individuals struggling with outdated versions of the basic tooling. However the truth is that Gradle, JDK, AGP, Kotlin… all are constantly introducing improvements in the performance, so ensuring that your dependencies are up to date is usually a good way to keep your build times under control “for free”.

One of the latest and most relevant examples is Hilt/Dagger, the most common Dependency Injection framework in Android. This is one of the top contributors to slow builds in big projects, since it makes heavy use of annotations. It’s been based on KAPT for a long time, and it was not until some weeks ago that they finally made the required changes to use KSP instead, whose performance is way faster as it doesn’t require some intermediate steps in the middle. So… are you in the latest Dagger version already?

The best you can do is to introduce any tooling to automatically update your dependencies, such as Renovatebot or Dependabot, which will regularly open PRs in your repos to keep updating to the latest versions and running all the CI checks.

Other Minor optimizations

Everything mentioned so far will introduce really noticeable improvements in your build times.

Once the major improvements are implemented, you can consider minor optimizations to further reduce build time and address edge cases. Let’s see some examples.

Pre-cache dependencies

Usually building a project requires several dependencies to be downloaded in the system. This might easily increase a couple of minutes (or more sometimes) your builds. Also it would be a quite erratic delay, as it will depend on the network variability. In general it is good having all or some of them accessible, so you should apply the advice about it for Dealing with ephemeral builds.

Naturally, this extends beyond typical build dependencies to include Gradle itself, Android SDK tools, and system images required for your Robolectric tests. These should be already pre-downloaded in the CI agents running your CI.

If you use the official gradle-build-action with GitHub Actions, this is done by default, and it even caches many other elements that will help to accelerate your builds even more (particularly many contents from the home ~/.gradle folder such as compiled build scripts, and more)..

Different settings CI and local

The structure of your CI builds, and how the tasks are influences this. In our case, we aim to validate multiple aspects in each PR, including unit tests, linting, release builds, and UI test artifacts (even if not executed). This allows us to ensure that PR changes do not impact other necessary tasks in different stages.

Since we had some powerful CI agents with many cores, we previously launched a single execution that requested all tasks simultaneously, allowing Gradle to parallelize everything with its internal workers. This has completely different memory requirements from the day to day of engineers in local builds, that usually just launches tasks one by one. For this reason we were tweaking the memory settings for the CI, overriding the jvmArgs and other gradle.properties in our CI agents.

Remember that anything declared in your home directory (~/.gradle/gradle.properties) will override the project settings, facilitating easy modification of the configuration for many of many other settings mentioned in this section.

Avoid daemons duplication from the IDE

If the JAVA_HOME environment variable is different from the IntelliJ IDEA (or Android Studio) JVM settings, and you usually run tasks both from the IDE and the terminal, it might duplicate the Gradle Daemon, consuming extra CPU and memory. Make sure you configure both to be the same.

There is a nice plugin maintained by Gradle engineers that will flag this misconfiguration and other minor optimizations. Check the Gradle Doctor plugin.

Stop watching file system in CI

Gradle has a nice feature that significantly accelerates incremental builds (which is enabled by default). When enabled, it allows Gradle to keep what it has learned about the file system in memory between builds instead of polling the file system on each build, reducing the amount of disk I/O needed between builds. You should have this enabled for your local development.

However, this is probably overkill for the CI as nothing is expected to change (and probably you only run a single execution), so you can disable it explicitly by adding org.gradle.vfs.watch=false to your gradle.properties. Make sure you disable this only for the CI.

We haven’t quantified the impact of this (as we applied many other changes at the time), but intuitively, this setting seems unnecessary in the CI. I would love to hear from anyone having some data around the imipact of this setting.

Garbage collector

Since the release of JDK 9 G1 is the default garbage collector, however Android documentation encourages you to use the ParallelGC instead. In most cases this might not be a big difference, but in others it might be huge.

For reasons still unclear to us, some complex clean builds targettign several tasks at once resulted in GC Overhead or took over an hour with the ParallelGC, despite allocating substantial extra memory to the heap. However, we managed to reduce this to around 35 minutes simply by switching to the G1 collector.

So I am not telling you to change your garbage collector, but to encourage you to test different settings. Then, do not hesitate to try this (or other newer GCs) if you’re having memory issues, as it might help, even if the reasons why are not clear.

Fork test execution

I mentioned at the beginning that Gradle runs a separate parallel JVM process for each module when running the test suite. You can also execute the tests of the same module in parallel inside this process.

In our experience, this approach penalized our CI builds but benefited local builds, likely because CI executes everything simultaneously using all available resources, while engineers typically run tests for a single module, which is a lighter task, leaving some computer cores available.

Experiment

There are numerous other minor aspects that you should test and measure to ensure they suit your needs. The last few points are good examples of settings that you can evaluate to decide if they are making any improvements for you. The build process is quite complex and depends on so many different pieces that some configurations need to be tested before making a decision.

Do not hesitate to experiment with different settings in order to fine tune your build configuration and keep reducing the build time. The Gradle Profiler is a really good tool that could help you with this.

Conclusion

As demonstrated, there are numerous strategies you can employ to enhance your build times. There might be many others, but these are the most relevant ones that helped the Mobile Platform team to significantly reduce the build times both locally and in the CI for the Glovo mobile apps.

The most important ones are to introduce a remote cache (specially for your CI), to ensure that the cacheability of the tasks works correctly, making sure that you leverage parallelization correctly, and having the right hardware and memory settings. Review the first part of this article for more details about cacheability and parallelization.

After addressing the major improvements, begin exploring other strategies to further reduce your build time, and conduct regular reviews to ensure your build times remain optimal.

Last, but not least, make sure to keep monitoring your build times to ensure there is no degradation over time and to quickly catch any abnormal increase due to some misconfiguration or other changes in your projects.

Accelerate Your Android Development: Essential Tips to Minimize Gradle Build Time (Part II of II) was originally published in The Glovo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to survive on-call in 4 steps

ema — Mon, 30 Oct 2023 09:17:14 GMT

Intro

Many tech companies have adopted the concept of “on-call”, meaning being ready to provide support in case their service is not working as expected.

But while there are well-established best practices for the monitoring/observability and incident response platforms, I think we need to focus more on the preparation and the semantics of Incident Response.

So in this article, you can find 4 very personal tips derived from my on-call experience that I hope could be useful to share!

Breathe

First things first: Put things in perspective

This depends on the domain, but in the majority of tech companies, the incident that popped up on your phone and is making your heart race is not going to put someone’s life at risk, so the first thing to do is to remind yourself that you’re not a surgeon, nobody is going to die but it’s “just” about money.

I found that thinking about this and taking a couple of deep breaths before starting to look into what happened helps reduce anxiety and be more focused, rational and effective in handling the problem.

Finally, remember also that your goal now is to mitigate, not investigate and fix it for good, so:

Rollback is better than pushing a hotfix
If you suspect a specific feature could be the cause, don’t hesitate to disable it to check!
The next working day is the right time to investigate deeper

If you forget this and just take your time to dive deep during an incident it’s:

Bad for you ➡️ personal time lost
Bad for other on-call persons who joined to help you ➡️ time lost
Bad for the company ➡️ pay you extra to work overtime

Be prepared

The biggest work of incident resolution can be done BEFORE the incident. Here are 4 things to include in your team practices:

Build (and maintain) an on-call handbook

Have a shared team handbook for on-call

Need to be the index, the source of truth of incident management
Need to be useful ➡️ links to the dashboard, links to logs, toggles, …
Need to be SHORT, FAST ➡️ you won’t have time to read
Need to be shared and maintained by the team

Write SOPs

Every time you solve a problem that:

Require you some work, like coding a script or crafting an API request
Is not extremely specific, so could be useful again

It’s time to create a Standard Operating Procedure!

This basically means the next time that during an incident there is a situation, before jumping into crafting a solution you take a look and find an SOP, then just follow the steps.

Example: An event consumer stopped working due to a bug and some critical events got lost. To solve the issues, the on-call engineers will probably do something like:

Move the offset of the consumer group, if messages are still in the broker
Force the sender to resend the messages
Manually sync data reading from the sender DB

Even though the problem was different, it happened another time that this service was out-of-sync, so you find an SOP and just follow the steps: after a few minutes, the problem is fixed.

These tasks are not straightforward and could even make the situation worse, plus it is much harder to think straight with anxiety, maybe after being woken up in the middle of your sleep. Imagine how easier it would be to find a step-by-step guide at that moment.

To Recap: common problems have common solutions

Hard to think straight during an incident
Can immensely reduce mitigation time
Do your homework (the day after, not during the incident)

Roleplaying

Take time to train (especially new joiners) on realistic scenarios!

An experienced engineer who has already managed many incidents can raise a fake incident, and then pretend to be a RTO agent that can provide details on the problem. The trainees will then need to go through logs, metrics and apply resolutions, while the expert engineer “shadows” them.

If you ever played D&D, this is what I am talking about, I found this teaching technique to be crazily more effective than any others.

Prioritize OPS tickets

This point is mainly for engineering managers or lead engineers: you need to make sure that critical ops tickets get prioritized or they will never be done.

By this, I don’t mean forcefully push them in the sprint, but have a conversation with your PM/Business team and kindly explain to them why this is important and what are the business implications of NOT doing it.

If you explained it correctly, I am sure that they will be the first willing to push them.

Team Power

This point seems very straightforward, and yet I’ve seen so many times people not doing it!

Call for help: the on-call engineers are a team, don’t try to solve the problem on your own if another service is involved, e.g. if there’s an infrastructure problem and you’re not an SRE: don’t wait! Call one!

The mitigation time can be hugely reduced and they are on-call, so you should be expected to be called not only for a problem on your specific service but for anything in the company you could be helpful solving.

Of course the same will apply to you, so help others if you want to be helped, join on-calls of others and actively help until it’s solved.

About guilt

It’s very important to set a blame-free culture in your company, not only because it’s the right thing but also because it’s more effective: people will have less anxiety and will focus more on the learnings of an error rather than the blaming.

Also, even if it was you who wrote a bug that is destroying prod, it should not be that easy to compromise a company’s service, there should be processes put in place by engineering leaders like code reviews, automatic rollbacks, automatic migrations checks, … everything that is commonly defined as “Guardrails” (ref The Staff Engineer’s Path — Tanya Reilly).

If everything else failed

A couple of suggestions to be put in your handbook, for when you encounter a difficult problem not easy to debug:

Check recent deployments (not only of your service)
Check recent feature toggle changes (not only of your service)
Try to focus on a specific case, even if millions of orders/users have errors, focus on one and deeply debug it
Call for help: call other teams, call all your team, call your manager
Talk more with RTO agents:
- Ask for more cases or if they can spot some pattern in the problem
- Consider a manual solution, sometimes it’s faster than coding a script
- Ask them for possible mitigation like closing the service or sending a message to the customers, you’re less prone to customer churn if you’re honest and declare that you have a problem and working on it

Recap

DON’T

Try to fix the problem for good.
Hesitate to call other teams.
Blame or fear to be blamed, every incident is everyone’s fault.
Even if you wrote a bug, it should not be that easy to compromise a company’s service.

DO

Do your homework: Handbook, SOPs, Prioritize Ops tickets, Training.
Go for the fastest way to mitigate.
Call for help, even if it’s your ownership.
Talk and coordinate with the RTO.

How to survive on-call in 4 steps was originally published in The Glovo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Accelerate Your Android Development: Top Techniques to Reduce Gradle Build Time (Part I of II)

rolgalan — Mon, 23 Oct 2023 09:11:37 GMT

Realbigtaco, CC BY-SA 3.0 (flipped)

Introduction

Everyone wants faster builds, don’t they? Reducing build time is one of the most important actions you can take to improve the Developer Experience, as it reduces the feedback loop and helps the engineers to iterate faster, reducing their idle time and allowing your team to keep focused on delivering new features, adding business value to your app.

Accelerating the build process helps your developers in two ways: first while working locally they can compile and execute tests faster; and second for the CI checks to pass, so they don’t need to wait for a long times before PRs are mergeable (besides review times, of course, but that’s a human problem that won’t be solved by this). Both are important, and good news: most of the techniques to get faster build time are impacting positively both local and CI builds.

Over a year ago, our CI checks executed in the Pull Requests took around 1 hour of time. By analyzing our builds and applying many different improvements, we managed to reduce them to around 15 min in average.

Gradle offers nice documentation to improve the performance of the builds, and also Android documentation provides some good tips to optimize your build speed. However most of these recommendations are not as straightforward as enabling a flag or changing a value. In this series of articles we’ll explain with more details the most important ones and how to leverage them, but also go beyond them and highlight other techniques and findings that helped Glovo to reduce the build times by 75%.

This article is divided into two posts: In this first one we will discuss the most impactful options that you can enable to cut your build times significantly. In the next one, we will talk about other less impactful actions, but still quite relevant and highly recommended to apply after you have done the first ones.

Even if all learnings are extracted from a long journey optimizing Android projects, all of the Gradle techniques discussed here can be applied to any other Gradle project unrelated with mobile.

Parallel tasks

This is principles of computation 101: executing multiple tasks simultaneously rather than sequentially will expedite the entire process.

It really helps to make the build process faster, as many tasks can be executed in parallel, reducing the time significantly. Some of our projects are built in around 6 min, whereas in serial they would take around 30 min.

This is disabled by default in Gradle. You should add org.gradle.parallel=true to your properties file.

This flag will allow Gradle to build independent subprojects in parallel. Projects are the Gradle terminology for what we usually refer to as modules. Fortunately modularizing your application is part of the modern Android development practices, so I am not going to get into details about it. However, these two topics are highly connected.

Please do not think that enabling this flag, and having modules, is all that is required, as this is not true. The reality is that you need to build your modules architecture in a conscious way to leverage all the parallelism that Gradle can provide. This is so because each module depends on other modules, and they cannot start building until the ones they depend on are completed.

For this reason you should flatten your modules graph, reducing the height and the cross dependencies. Once you get this, is when you can get the real benefit of parallel tasks execution.

Besides the architecture, parallelization will be constrained by the hardware: not only the number of cores, but also the memory available for the system. We will review the hardware configuration in detail in the following article.

Note that some parallelization, at different levels, might still happen even without this flag disabled, see Common Gradle misconceptions to learn more..

Cacheability

Another fundamental concept from computation is to reuse previously executed tasks.

The fundamental part of Gradle to improve the performance of the build system is “the ability to avoid doing work that has already been done”. Gradle bases this in the concept of incremental builds and the build cache.

Both are interconnected as they are based on the same principle of fingerprinting each task inputs and storing the task outputs. The only difference is that incremental builds live in the project scope (if the output exists in the build folder, so it doesn’t survive a clean), whereas the cache is persisted somewhere else, allowing you to recycle the tasks outputs of previously executed tasks even if you cleanup the project completely (as long as their inputs don’t change).

Build Cache itself has two levels: local and remote. Local caches are usually stored in your home directory (typically your ~/.gradle/caches unless you declared it somewhere else). This really helps for local development, but for most CI executions this will barely help, especially if you are using ephemeral agents, which is usually the most common case (unless you have some shared “local” disk).

Incremental builds work out of the box in Gradle. In order to enable (local) caching just add org.gradle.caching=true in your gradle.properties.

This is good already, but what really pays off is the remote cache, which allows caching tasks from a different machine, so if you have ephemeral builds this is a great way to reduce the execution time by reusing what was already built in previous jobs. Engineers running local builds can also get the benefit of this remote cache (specially when changing branches or fetching changes from the upstream repository). Among all the techniques described in these articles, remote cache was by far the most successful one for us, helping us to cut our build times in the CI by half when it was introduced.

In order to do this, you will need to maintain a remote Gradle build cache node somewhere, and declare it in your Gradle build files. You can either run your own remote Gradle build cache (we had this for a while through our own internal Artifactory instance that we use for internal dependencies) or buy some of the services that are offering it (we currently get it from Develocity, among other features).

The cache node needs to be configured to have enough disk to storage to artifacts generated/used by your organization, at least in the last 24 hours. Otherwise you won’t leverage this completely, as “old” cache entries will be evicted too soon, getting a higher misses-rate than you should have. When we first introduced the remote cache, we didn’t notice that it was using only 10 gb of space by default. When increased it to 100gb we increased the remote hit-rate from 82% to 96%, with a ~20% build time reduction in all our projects.

Please note that modularization also plays an important role in terms of cacheability as well, because Gradle’s basic building blocks for cacheability tasks are the modules. So the more you modularize, the more chances you will have to reuse code, and the faster builds you will have.

Also make sure you do an adequate usage of api vs implementation (as rule of thumb: always use implementation) to ensure you don’t invalidate the cache unnecessarily: implementation only requires recompiling the modules depending on the changed module, whereas api will invalidate also those depending on the parent.

Optimizing Cacheability

Similar to the case of parallelization, one might assume that simply enabling the cache and setting up a remote cache node would suffice. However the devil is in the details and cacheability is not trivial at all: tasks need to be carefully designed with cacheability in mind and there might be many reasons why it gets invalidated.

Even if you don’t create many tasks yourself, you might still be impacted by third party tasks added to your build. In those cases you might not be able to fix the issue, but if you detect it you should have enough data to report this to the library authors (or propose a fix yourself if it’s an open source project).

One of the most common reasons to fail is that some tasks requiring files as their input declare them as absolute paths, making them mutually incompatible. Usually there is not much you can do when you face these cases other than reporting to the original author hoping they fix it fast.

One minor optimization you can do here is verify that your CI agents are always using the same path for the project. This might sound crazy, but Jenkins by default loads the project in a folder with the name of the branch or the PR number. You can use a specific fixed workspace with the ws command of the Pipeline Step.

Fortunately most of the common libraries used for Android development are currently properly implemented with relative inputs (thanks to all the people constantly watching this and reporting to the library owners), but it is good to keep this in mind when analyzing the reasons for failed cache tasks.

Besides the file paths, there are many other reasons for the tasks to fail caching, and you will need to keep reviewing your builds, debugging and diagnosing cache misses.

In my experience, some of the other most common reasons for Android are those introducing dynamic values in the Manifest or the BuildConfig files, as these files are the input for many other tasks down the line, so introducing any variability in those would invalidate all the subsequent tasks. For example:.

Version name/code. Some projects automatically increase the version code for each commit, which will easily invalidate most of your tasks between executions.
BuildConfig values. You might be getting the commit hash for some verification/tracking purposes. Or updating any other value from the BuildConfig in a dynamic way.

It is quite possible that you cannot completely get rid of these values in all of your flavors, but you should ensure you use fixed values in your CI and in your development flavor for all the above examples and any other dynamic property that could change frequently during development.

However, failures are not the sole issue. It’s crucial to evaluate the effectiveness of any caching mechanism used. Why? Because storing and retrieving cache items introduces some overhead. Some tasks are so simple that caching them introduces negative savings, so it’s better to always run them. You can override any task behavior by setting specific conditions in task.outputs.doNotCacheIf(). There is a really good plugin, maintained by Gradle engineers, that automatically disables the most common Android tasks known to have negative savings.

In general you could get a lot of insights from your builds from the free Gradle build scans. However, if you want to go deeper you would need to use the paid Develocity (formerly Gradle Enterprise), which allows deeper analysis (particularly comparing two build scans to review what was the change in the inputs that triggered the task rerun instead of reusing it from the cache).

I would also encourage you to run the Gradle build validation scripts regularly to get a complete overview of your tasks’ cacheability and detect any regressions quickly.

Configuration Cache

You might be wondering if this is about configuring the cache… However, Configuration Cache is totally unrelated to Cache Configuration 😅.

In order to execute your task, Gradle needs to create a tasks graph to know what the dependencies of your project are, by evaluating your build scripts. Depending on your structure, how many plugins you have, and how big your project is, this process could be time-consuming.

Fortunately Gradle now can cache the outputs from this Configuration phase allowing us to reuse it and get some time back in the next builds. In general we had this enabled only for local builds, as due to the ephemeral nature of the CI builds it is most probably not worth wasting time caching this. You can do this easily by adding the following line to the gradle.properties file org.gradle.configuration-cache=true.

As with the previous cases, this is not as easy as enabling the flag. What a surprise, huh?

Initially, this necessitates adherence to really strict rules for your Gradle scripts, which you may not have been implementing (unless you’ve been keeping up with the latest Gradle best practices). Depending on how many customizations you have in Gradle you would need to invest a significant amount of time rewriting some tasks to follow the new rules that ensure that all tasks are really independent of each other and do not rely on “global” inputs that might change due to side effects.

With that you would be able to start using Configuration Cache. However, this might still not be enough. This one is funny, because usually we just care about Configuration Cache rules to not be violated, however it might happen that the way we define the tasks, makes them to be frequently invalidated in any case. Not leveraging this feature at all.

A good example of this happened to us quite recently. One of our custom tasks was ensuring that some lints run only in the modified code, so it was using as an input the result of git status. Therefore, anytime a new file was modified, conf cache got invalidated. Same happens if you are reading the head hash (maybe you want to keep it for some verification/tracking), or if you are counting the commits (for example to automate the versionCode). In all of these examples you will see some warning like:

Calculating task graph as configuration cache cannot be reused because output of the external process ‘git’ has changed.

There is another advantage of enabling this flag. This was one of the most surprising things that I learnt recently: Configuration Cache also allows tasks parallelization inside each module! Without Configuration Cache, even if your module has some independent tasks, they will always run serially.

I am not familiar with the Gradle internals, but Configuration Cache requires really strict rules to avoid access to the project settings during execution, so my guess is that this “safeguard” also guarantees that independent tasks are not changing any setting that would be required for other tasks, allowing Gradle to run them in parallel.

Conclusion

In this article, we’ve outlined the most effective strategies to significantly decrease your Gradle build times.

If you think about it, it’s all based on a few basic, yet powerful concepts: reusing what has already been done, and executing several tasks at the same time.

The beauty of this is that it all combines together: proper modularization allows better parallelization and caching/reusing more tasks; when you reuse most of your tasks, you have more cores available, so you have idle resources to parallelize at module level.

Even if the underlying concepts might be quite common, optimizing them requires constant dedication and a good understanding of their fundamentals. For this reason in Glovo we have a Mobile Platform team, monitoring and ensuring fast builds for the mobile projects both in the CI and locally; this way the product engineers can focus on delivering business value fast, with quick feedback loops, without worrying about their build time.

In a few days we will share a second part, reviewing some other important techniques and settings to ensure you keep your build times low, helping to speed-up the build duration. Stay tuned!

[UPDATE] Continue reading the second part in Accelerate Your Android Development: Essential Tips to Minimize Gradle Build Time (Part II of II)

Accelerate Your Android Development: Top Techniques to Reduce Gradle Build Time (Part I of II) was originally published in The Glovo Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.