SW Development Process At MANTA

Published in

MANTA Engineering Blog

12 min readJun 6, 2022

Can our development team scale from 4 to 70 developers in a few years while using a source code mono-repo? We will talk about the core principles of the source code modularity in connection with a branching model and continuous integration peculiarities.

Understanding the Context — What Is MANTA

MANTA is a software product company developing a product for data lineage analysis used by dozens of big customers from all over the world.

The truth is, we have grown out of our child shoes quite fast. If you visited our office 4 years ago, you would see only 4 enthusiasts. Nowadays, 40 developers make the magic possible and we expect to grow to 70 or 80 within two years. We maintain eight million lines of code, mainly written in Java, often nontrivial code parsing and analyzing SQL code or integrating with 3rd party systems through various APIs. Speaking about the volume of work, we usually develop 1000 features and generate an effort of 6.500 man-days per year.

I somehow feel we do not fit in the box labeled as “Startup” anymore, although it would sound attractive, right? I know, the startup could be also seen as a kind of mindset, not limited by the company size. No doubt. My point here is that the growth sooner or later brings the need for a certain level of organization and actionable processes we are going to talk about. These facts have particular implications for the software development process we use today. We will especially focus on the core development part of our SW development process, deliberately omitting other important parts such as analysis, design, and testing.

Code Repository and Modularity

Mono-repo

Yes, we have a monolithic repository. Sounds almost reactionary when I say it loud. Indeed, mono-repo seems to lose its popularity in our modern times of microservice architecture and code structured in many independent repositories giving so much flexibility.

When migrating from Subversion to Git, we had a good chance to evaluate the current source code versioning model and change to another model that would suit our needs better. The migration itself is another topic, so long story short. What was our rationale behind the mono-repo? We made a requirements analysis similar to what you would usually do as part of a standard development iteration. Our customers were the developers, our continuous system, our release process, and whoever might use the Git repository now or soon. We elaborated on all the relevant usage scenarios and simply said, “For us, the fewer repos, the fewer complications with releasing, daily coding and building, inter-module dependencies, etc.” Impaired scaling caused by a growing codebase was assessed as low risk. So, we stayed with one main mono-repo. The decision could be very different in another organization or team.

Modularity

After this decision, we were immediately facing the question: “How are you going to organize the code and build for dozens of programmers using a single codebase to avoid the mess?” Complicated things are easier to understand and handle when they are broken into smaller pieces. We borrowed the idea of modularity from the SW architecture and applied it to the modularization on the level of the codebase to enable parallel, independent, efficient work. Secondly, the software system architecture is often influenced by Conway’s Law: “Organizations design systems that mirror their communication structure“. This means that the organization chart is reflected in the SW architecture. And it could go even further by saying “The architecture gets reflected in the code-base structure.” Of course, there were many more factors and triggers that made us gradually decompose the repository into approximately 1000 modules. These modules are grouped into logical components such as Oracle scanner, export to Informatica Data Catalog, or one of our deliverables such as the web interface for the visualization of the data lineage or the administration console.

When the Maven dependency tree is not helpful anymore. This is just a fraction.

From the project perspective, dividing the big repository into smaller cohesive components allows us to organize work effectively. Each team or developer is responsible for one or more components. If this division was missing, proper team scaling could be problematic. However, adding a new developer to the team often means assigning tasks connected to some existing modules or creating new, separate modules. This way we were witnessing the team growing from four to forty developers in four years and there will probably be seventy plus developers very soon.

From the design perspective, cohesive also means an encapsulated, well-defined functionality that can be grasped and used on its own. It can even be reused in multiple places, such as utility artifacts or platform code. Although the modules and components are cohesive, it doesn’t mean they are always independent. They form a complex system of direct and indirect compile-time dependencies which are checked and linked during the build phase.

Technically, the whole dependency management is managed by Maven. The high modularity of compilation units frees us from building and running the whole project consisting of one thousand artifacts to debug only one small increment of code. Each modification is located and can be addressed independently. Usually, the developer runs his code (one module) against the dedicated suite of unit or integration tests. If practiced consciously this workflow encourages test-driven approach and improves code quality and code coverage. Besides that, running only the related part of the code-base saves an incredible amount of time for the build and deployment to the application server in case of server-side code. If the integration testing and end-to-end testing are delayed for too long, the bugs will stack one on the other and the cause of the bugs can become hard to track.

Modularity challenges

Nothing in the world is perfect. We have been facing several issues regarding the modularity which I would like to sketch here because it could save you a few wrinkles.

Each developer should stay in sync with other developers to work with up-to-date snapshots of their modules. These snapshots must compile, function well, and should pass the required quality standards before they are made available to other developers. We will describe the implementation side of this in the chapter dedicated to continuous integration.

There is another challenge connected to codebase fragmentation and complicated dependency management. By working on individual modules separately, you inevitably happen to break APIs between the modules or you can corrupt functional contracts between your service and its client residing in a different module. This happens simply because you run and test only your own code without considering what is behind the wall and who else consumes your API or relies on your particular way of behaving. Luckily, you and your consumer should find out by the next day thanks to the continuous integration which we will describe further.

Apart from the functional dependencies, the compile-time dependencies must be tackled as well. The main theme here is “What depends on what.” Direct dependencies are easy to track and find out. However, transitive dependencies are not so obvious and you as a developer need to make sure the right version of the dependencies propagate into the final assembly (the end product). The Maven dependency tree command is helpful but can give very chaotic results for big projects with many dependencies. Also, cyclic dependencies should not occur but they do. They will ruin your build and are tricky to resolve.

Handling all those versions of all your artifacts and versions of their dependencies needs a systematic approach. Otherwise, you will drown in the sea of dependencies. Every time you branch off to start working on a new version of the application, you need to upgrade the versions. Maven will help you with its mvn versions:set

But this can get even more complicated if you do minor (partial) releases as we do. We regularly release a patch containing only a small subset of modules. In such a case, we need to reliably identify only the modified modules and release only them.

Another moment to be aware of the versions is during release when you need to turn the snapshot versions into release versions. The Maven release plugin can handle this.

Remember we maintain 1000 modules in several teams. Such an amount of build configuration can get messy and inconsistent quite quickly. To avoid that, we built our own automation to control the build configuration. This helps us eliminate misconfiguration or usage of one library in two different versions. On top of this automated process, there is also a dedicated architect watching for duplicate or similar external libraries on the classpath.

In the next chapters, we will show that dividing the project into many smaller modules can make the CI more complicated, especially if your code resides in a mono-repo and you want to build several branches in parallel.

Branching Model

As a branching model, we started using Git-flow.

Very soon, it became obvious we needed to adjust the flow to cover all relevant scenarios at MANTA.

There is only one permanent integration branch called develop.
Considering we support up to six major production versions in parallel, we completely dropped the master branch as a holder for the tags. We tag release revision on the dedicated release branches.
We use several parallel release branches — each hosts the code of a particular production version. The major release branch branches off from develop branch while the minor release branch starts from the last major release branch (minor contains only patches).
A hotfix, similar to the minor release branch, branches off from the major release branch. After the hotfix branch was tagged, the branch was merged to the major release branch.

The rest of the principles are similar to the standard Git-flow such as using short-lived feature branches to merge the little increments to the integration branch (develop or release branch). There are also a few additional rules we stick to.

Feature branches should be merged to the integration branch using a switch — no-ff. This always creates a merge commit that is more explicit than just a silent fast-forward merge.
Feature branches should be kept in sync with the integration branch using rebase (instead of merge). This keeps the history clean without unnecessary crossroads.
Squashing of groups of related commits is recommended.
The only way to merge there is to use a pull request and pass a code review.
We set the integration branches as protected. Nobody can force-push and over-write the history with their local changes.

The branching model must go hand in hand with the proper configuration management we talked about earlier. We must be able to support different production versions in parallel. Thus, in every type of branch (develop, feature, release, hotfix) you must correctly set the versions of the projects that are to be released in the target version. If you are not strict with the configuration management, you will likely deploy your artifacts into the central repository with the wrong version and others will use them unknowingly.

Continuous Integration

I have described code modularization as one of the ways of streamlining the development pipeline. You as a developer just change your source file, rebuild the increment and verify the test results, right? Yes, but…. It means the whole application is divided into pieces. If we leave it like that for too long those pieces will be very difficult to glue back together and make them work together. That is why we strongly focus on continuous integration and regularly integrate the modules to ensure consistency and correctness.

How does it work? Every merge to develop branch triggers Jenkins build which pulls the code, compiles and tests the source code, and checks the source code quality. This static code analysis phase is implemented via SonarQube quality gates integration. A quality gate in SonarQube is a set of selected quality metrics guarding your code quality. We usually relate to the level of blocker or critical issues, bugs, duplicated lines, code coverage, and vulnerabilities. If any of these thresholds are not met, the build fails. If all stages pass, the artifact is deployed to Nexus central repository. Consequently, a cascade of related downstream jobs is triggered and the whole build pipeline is executed.

This is a more or less standard process. But what are the specifics and challenges of the multimodule mono-repo used in a rapidly growing team?

We build all types of branches (pull requests, develop, release, hotfix) to spot the problems as early as possible. Building pull requests before merging helps us keep the integration branches stable.

Maven jobs in Jenkins

The CI of integration branches is implemented by Maven jobs in Jenkins. Each job contains several Maven projects bundled together. Each job has Upstream and Downstream projects, which allows Jenkins to run the jobs in the right order. They are automatically computed from dependencies in POMs of the individual projects.

Jenkins upstream and downstream projects

Concerning multi-module mono-repo, it is practical to bind Jenkins jobs with individual folders of the mono-repo that host the individual modules. The solution is called sparse checkout. The workspace folder is restricted to only one sub-folder.

Multibranch Pipeline

Some types of branches such as feature branches are created on the fly and are automatically tracked in Jenkins. We use the “Multibranch Pipeline” and the Git plugin to notify Jenkins about recent changes. The pipeline is defined in the “Jenkinsfile” describing the build behavior for different types of branches and separate stages of the build.

Multipipeline Job — Hotfix branches and feature branches

Different branches may point to different versions of the developed application. E.g., develop refers to MANTA 34.0.0-SNAPSHOT while in the release branch we test and fix MANTA 33.0.0-SNAPSHOT. Both branches should deploy their artifacts to the Nexus repository to provide SNAPSHOT JARs. Thanks to the different versions these artifacts won’t interfere in Nexus. However, work in the feature branches is in progress and could be unstable. That is why the feature branches’ build should not be deployed to the artifacts repository because it would overwrite the stable version deployed from the integration branch.

Checking Vulnerabilities in External Libraries

Now about security. Generally, the majority of the source code in large software systems was not written by the authors of the system, but it comes from the external libraries linked to the final executable. And often the developed system must comply with some security guidelines or you simply want to focus on the security concerns of your application. Then, mitigating the vulnerabilities in the external libraries makes sense. We, at MANTA, have automated the whole process of checking the external libraries against the database of known security vulnerabilities from the wide rank of OWASP. We use a Maven plugin called “OWASP Dependency-Check” scanning Java JARs and JavaScript libraries.

Vulnerabilities in Spring REST (https://cve.mitre.org/index.html)

CI Catches

Again, there are some common catches connected with the continuous integration and multi-module repository. Each of them caused us a headache, so make sure you’ve thought it out well. Here are our recommendations:

Creating a new release branch means creating all the Jenkins jobs as they exist in the develop branch. We have 160 of them. Doing this manually would be quite a frustrating experience. We have automated this step via Jenkins CLI and is a matter of minutes.
One PR can change several modules and you need to build all the affected modules. Detecting only the changed modules within the pull request can be complicated.
Map the logical dependencies between your modules. Upstream and downstream projects are not everything.
Make sure the merged PR doesn’t trigger all other build jobs or build of all PRs.
Define and enforce the naming conventions for the different types of branches. CI will use the name patterns to detect new branches and to decide how to build each of them.
Build artifacts are cached in the local Maven repo. Use force update (switch -U) to get the fresh snapshot from the central repository.
Be ready for CI performance scaling. CI can be flooded when everybody starts pushing their features.

Summary

I wish you the smoothest possible development experience. Let me share our top 10 tips.

For smooth scaling of your team consider proper source code versioning model, branching model, and continuous integration strategy
Think carefully about project structure — project perspective, design perspective, implementation perspective
The branching model should support your development and release cycle
Use CI
Know the build dependencies and triggers
Define the build process for different types of branches
Define release process
Automate release process
Watch for build misconfiguration and build path
Deploy artifacts regarding the related application version