Scaling Bazel builds @Harness

Gaurav Nanda
Harness Engineering
4 min readFeb 27, 2023

Authors: Gaurav Nanda, Udham Singh

At Harness, we believe that developer productivity is one of the key pillars of software development. We are always actively looking for improvements to ensure developers at Harness have a great experience.

To the same end, last quarter, we identified a key area of improvement — improving build times. Historically, our repository, harness-core had fat Bazel modules, creating one BUILD per top-level directory. This approach has many shortcomings, one of them being slower incremental build times. To fix this issue, we are moving toward Bazel’s recommendation, of creating one BUILD per directory.

As the number of build files is increasing, we have encountered many scaling issues. In this post, we would like to go over those issues, and how we debugged and fixed those.

“Class not found”

Issue

One of the first issues, we started to run into was the “Class not found”, while trying to run applications.

Could not find or load main class io.harness.ng.NextGenApplication
Caused by: java.lang.NoClassDefFoundError: io/dropwizard/Application

This started to happen a couple of months after we started refactoring to create smaller build targets. Eventually, we reached a point where adding new dependencies lead to a build failure!

Debugging

The error message suggested that the class was missing from the classpath. We noted that this error was happening only when our classpath limit would cross 120K characters.

Solution

We did a quick workaround here to specify a higher CLASSPATH_LIMIT of 400K. The underlying reason was still not clear but got fixed when we faced the next issue discussed below.

“Argument list too long”

Issue

Soon after the classpath limit fix, we started to run into the “Argument list too long” error.

Executing tests from //batch-processing/service:io.harness.batch.processing.config.k8s.recommendation.WorkloadCostServiceTest
-----------------------------------------------------------------------------
/tmp/sandbox/processwrapper-sandbox/5627/execroot/harness_monorepo/bazel-out/k8-fastbuild/bin/batch-processing/service/io.harness.batch.processing.config.k8s.recommendation.WorkloadCostServiceTest.runfiles/harness_monorepo/batch-processing/service/io.harness.batch.processing.config.k8s.recommendation.WorkloadCostServiceTest: line 366: /tmp/sandbox/processwrapper-sandbox/5627/execroot/harness_monorepo/bazel-out/k8-fastbuild/bin/batch-processing/service/io.harness.batch.processing.config.k8s.recommendation.WorkloadCostServiceTest.runfiles/local_jdk/bin/java: Argument list too long
/tmp/sandbox/processwrapper-sandbox/5627/execroot/harness_monorepo/bazel-out/k8-fastbuild/bin/batch-processing/service/io.harness.batch.processing.config.k8s.recommendation.WorkloadCostServiceTest.runfiles/harness_monorepo/batch-processing/service/io.harness.batch.processing.config.k8s.recommendation.WorkloadCostServiceTest: line 366: /tmp/sandbox/processwrapper-sandbox/5627/execroot/harness_monorepo/bazel-out/k8-fastbuild/bin/batch-processing/service/io.harness.batch.processing.config.k8s.recommendation.WorkloadCostServiceTest.runfiles/local_jdk/bin/java: Success
[root@harnessci-tin0-xvss4fzd io.harness.batch.processing.config.k8s.recommendation.WorkloadCostServiceTest]# grep ARG_MAX /usr/include/linux/limits.h
#define ARG_MAX 131072 /* # bytes of args + environ for exec() */
[root@harnessci-tin0-xvss4fzd io.harness.batch.processing.config.k8s.recommendation.WorkloadCostServiceTest]# getconf ARG_MAX

Debugging

This error suggested that we were crossing the Linux’s limit of arguments length longer than 128K. This meant our original fix to increase CLASSPATH_LIMIT beyond 128K, was not a great idea.

On digging deeper into the Bazel codebase, we noted that Bazel seemed to already handle the long arguments issue in java_stub_template.txt. As per the following snippet, if the java argument’s length crosses 120K (8K for windows), Bazel should wrap all classpath arguments inside the Jar file using the “class-path” header.

# If the user didn't specify a - classpath_limit, use the default value.
if [ -z "$CLASSPATH_LIMIT" ]; then
# Windows per-arg limit MAX_ARG_STRLEN == 8k
# Linux per-arg limit MAX_ARG_STRLEN == 128k
is_windows && CLASSPATH_LIMIT=7000 || CLASSPATH_LIMIT=120000
fi
if (("${#CLASSPATH}" > ${CLASSPATH_LIMIT})); then
export JACOCO_IS_JAR_WRAPPED=1
create_and_run_classpath_jar
else
exec $JAVABIN -classpath $CLASSPATH "${ARGS[@]}"
fi

However, this was not working for us somehow. A little more research suggested that Bazel had a bug in its earlier implementation, which got fixed in 5.x.x versions.

Solution

Therefore, to address this issue, we upgraded our Bazel version to 5.0.0 and that took care of the underlying problem. We also reverted our CLASSPATH_LIMIT change, so we never go beyond the operating system’s argument limits.

“Too many open files”

Issue

Recently, many developers started to complain about Bazel build failures with “Too many open files” errors on their Mac machines.

Debugging

MacOS has a limitation on the total number of open file descriptors (these limits can be verified by running ulimit -n command). While running “bazel build” on the harness-core monorepo, Bazel’s main process and its worker threads were trying to open more than the limit, resulting in the termination of the build with the error “Too many open files”.

  • The first thing we tried was to increase the OS file limit using the ulimit command. This did not work as the latest versions use launchctl¹ for setting limits on maximum files open². To our surprise, even after using launchctl, we did not see any change in behavior and the builds were still failing beyond ~10K open file descriptors.
  • We also learned that JVM sets its own file descriptor limit and to ignore those limits, we can pass -XX: -MaxFDLimit argument³. However, there is no option in the java build target to accept JVM arguments.
  • Defining this flag “build — jvmopt=’-XX:-MaxFDLimit’” in bazelrc files did not work for us either and worker threads were still crashing.

Solution

  • We discovered that the child java processes in Java accept their jvm arguments from the default java toolchain directly.
  • Hence, we ended up extending the default Java toolchain and added “-XX: -MaxFDLimit” to the jvm_opts. This ensured we used the system limits, rather than the JVM-defined limits. Here is the PR for reference.
load(
"@bazel_tools//tools/jdk:default_java_toolchain.bzl",
"BASE_JDK9_JVM_OPTS",
"DEFAULT_TOOLCHAIN_CONFIGURATION",
"default_java_toolchain",
)

default_java_toolchain(
name = "harness_no_fdLimit_jdk11_toolchain",
configuration = DEFAULT_TOOLCHAIN_CONFIGURATION, # One of predefined configurations
jvm_opts = BASE_JDK9_JVM_OPTS + ["-XX:-MaxFDLimit"], # Additional JDK options
source_version = "11",
target_version = "11",
)

If you are also excited about the developer productivity domain and would like to contribute to solving such interesting problems and making an impact, feel free to take a look at Harness’ career page.

--

--