Monitoring Android Codebase

Published in

Tokopedia Engineering

8 min readNov 27, 2020

A story of how Android Engineers in Tokopedia in becoming a data-driven developer. co-authored with Hendry Setiadi

Being an Android Engineer as a team is not merely about implementing features, debugging, or creating tests, but we also need to think about how to maintain our codebase and development environment.

For example, when you start an Android project, you simply need to set up a hello-world project and implement a feature. But things start becoming big in just a couple of months, and you would start adding more entities to your development environment. You start adding CI/CD systems. Then, you start refactoring the code. In our case, we have more than 50 engineers working in the same codebase altogether. Our codebase is just changing constantly.

This is just a sample screenshot from our Github about how much code changes in just 1 week.

The data shown above is just so overwhelming. How can we manage a codebase wherein each week new modules are created, new dependencies are incorporated, and a new app bundle is introduced? If this is not taken carefully, we might make bad decisions by assuming something. We need to see the big picture by seeing from the high ground.

We need to see facts, not assumptions.

One of our principles is that we want to be a Data-Driven Developer, in which we judge and make decisions based on data. For instance, is a module big enough, so it needs to be broken down into another module? How big is it? What part of it makes it so big? Is it the resource or the code? To answer these questions, we can not just feel how big it is, but we need to know the facts in numbers.

For that, we develop a data infrastructure that collects metrics from our codebase. These data metrics might be but are not limited to: modules number, Kotlin percentage, resource size, app size, dynamic feature module number, performance time, etc. In fact, we have hundreds of metrics that we can get insight from.

Data Architecture

Alright, we are just Android Engineers who need a simple approach. We are not talking about data architecture for the enterprise. We are just talking about data architecture for practical use. So here is 3 essentials process when we want to play with data.

In this blog post, we want to share our approach to providing data for our development environment. We will break down each process and share what technology stack we use to tackle each problem.

1. Collecting Data

The main source for this data collection is the codebase itself.

There are many kinds of data we can get from our source code. We can get source code information, such as source code size, source code history. In more advance, we can get per module dependencies behavior, app bundle information, and many more.

For the simplest example: To get how many Android modules in the project, we can loop to the root project and count the Android folder. Actually, we can do it in so many ways. Yes, we can do it in a java program, shell, python script, or Gradle plugin. Anything is possible. But for gathering dependency and module’s data, we prefer to create a custom Gradle Plugin to retrieve, because we can get better data integrity in terms of dependency and build configuration.

Pseudocode to get all module name in the Android project:

this can be translated to Gradle Plugin

The above code will give output the CSV file like below:

output.csv:
ModuleName,ModulePath
home,feature/home
search,feature/search
product,feature/product
image_picker,libraries/image_picker
…

The output may not always be in the CSV format. However, we recommend the output should be in a file. A JSON format file is also great. This file can later become the input for the next step: Storing Data.

2. Storing Data

Once the data collection is ready, we might be able to analyze directly. However, sometimes, we want to compare today’s data and last month’s data and see any significant difference.

If we do not store this data, we are losing the ability to compare the historical data. Therefore, storing data is important.

There are many options for data storage. We can use the database: SQL, MySQL, InfluxDB, etc. Or we can also put the data in the Google Sheets, which is more intuitive and easy to learn.

3. Visualizing Data

Once the data is stored in a database or spreadsheet, we need to visualize the data to analyze the data better.

Again, we have so many options to do this. We can use any visualization dashboard tools: Google Data Studio, Grafana, Tableau, or Kibana.

example of visualization in data studio (mocked)

The data visualization is very helpful to get better and faster analysis. Let’s say we want to compare the data between today and last week. It will be easier to go to the dashboard and click the date option, immediately giving the result.

Initially, we used Google Sheets + Google Data Studio for development speed and simplicity. Google Sheets has an API, so we can program using python to populate the data, while Google Data Studio has a very easy learning curve in integration and building the visualization blocks. Both are a great combo, but sometimes we can not have a fully grained control over those, especially when dealing with time-series data. Not only Google Data Studio is not optimized for data querying; storing data in Google Sheets is also not a good idea when you need more pivot and flexibility for data exploration.

When we deal with data which is bigger and more frequent, alongside Google Sheets, we also go with InfluxDB + Grafana stack, InfluxDB is very well optimized for time-series data, and Grafana is an awesome tool for visualization. The best part is they both are free and open source ❤️. Using this stack also opens up more possibilities when we want to store any data, REST API is ready out of the box, and the data can be explored faster. The paradigm was just changed.

Automating The Process

We want the process of collecting and storing data is done automatically and periodically. In the Tokopedia Android Team, there are several workflows for collecting data depending on the use case and purpose. While we can not explain one by one in this blog post, we will show you one of the processes where we collect code and apk data for every app release in the diagram below.

Note that all of these are done automatically. Developers only need to monitor respective slack channels waiting for notifications if anything goes wrong. Using this scheduling, we can easily see how our code is growing and see the trends to anticipate or plan for better development in the future.

Examples of Codebase Reporting.

A. Application Size - App bundle Percentage

Example Metrics/Analysis:

Total aab size
How many sizes (on average) of the apk downloaded by users from the play store?
How many Megabytes that saved using on-demand Dynamic Features
Breakdown of Dynamic Features modules and the respective size.

*Example of Simplified Application Size dashboard*

How we get the data:

Create the bundle (aab) application.
Convert the aab into apks. These apks later can be extracted and analyzed. We can see that there is a “base” folder and also “dynamic features” folders. We can loop all these folders to get the information of approximated size for aab.

B. Multi-Module

Example metrics/analysis:

Is that module already using Kotlin or still using Java?
How much Kotlin/Java percentage for that module?
How big is the codebase for that module?
Can we know if the module is big because of the code or because of the drawable?
Is the module a Dynamic Feature module?
If it is a Dynamic Feature module, is it on-demand or install-time?
Does the module still depend on the legacy module or deprecated module?
Does the module have the unit test/instrument test?
How much percentage is the coverage?
Can we check code quality, for example bugs / vulnerabilities / code smells / etc.

*Example the simplified output in Spreadsheet*

How we get the data:

We create a Gradle Task to run after project configuration.
This Gradle plugin will loop all the projects (sub-modules) in the root application project and get the detailed information of the module. For example: to get information if the module is already Dynamic features or not, we can check if the module has the line “apply plugin ‘com.android.dynamic-feature’”.
Other information, such as Jacoco coverage or line of codes, we can get from other sources using API. Sonarqube is one of the tools that we use.

Also, note that we have a special column, “Owner” Hence, we can know this module belongs to which team. This owner is populated from a special file. If using Github, we can use a CODEOWNERS file. This owner field is handy to notify the respective team, especially for a particular module's technical debt.

C. Library Dependencies Size

One of the unpredictable things in a huge codebase is the usage of their dependencies. So, we need to keep track of the dependencies and their usage in the module. Below are the metrics:

How many external library dependencies we use to build the app
What is the size of each dependency?
How is the usage of each dependency?
Does the library already use AndroidX or still use android.support ?

*Example of Library size Spreadsheet (row 37–43). Data is not real.*

A quick explanation of how we get the data:

Populate all third-party dependencies in all modules in the application.
Get the size of each dependency by looking up to the ./gradle/caches directory.

Leveraging this library dependencies reporting, we can decide, for example, whether a library is really needed or not. Removing a library will make the application size smaller. If a third party library is too big, we can consider the library to be a Dynamic Feature Module. Too much usage of “implementation of unneeded dependencies” can also increase the compilation time, so we can carefully plan to remove unnecessary usage of implementation dependency.

Conclusion

So far, we know that a codebase and development environment is something we can measure. In the development alone, there are myriad of data that we should measure like application size, build time, each module’s statistics, and dependencies. Storing and automate these data collections are also an integral part in order to see the trends and stay consistent.

By leveraging Codebase Reporting, we can have a better understanding of our application. It can also help us analyze and improve our application in terms of better source code and better quality products.