Java vs. Kotlin — Part 1: Performance

Jakub Anioła
RSQ Technologies
12 min readAug 13, 2019

--

Almost one year ago, I was a student at the Poznań University of Technology studying Software Engineering, thinking about master thesis subject. All of the recommended subjects seemed boring or totally out of the scope of my interests, so I decided to come up with my own topic for the research.

Around the same time, in October, me and my friends from RSQ Technologies headed to the KotlinConf 2018 conference in Amsterdam. We attended the closing panel with numerous interesting people from Kotlin community. One of them was Wiliam Cook. The professor from the University of Texas mentioned that Kotlin does not have a large audience in the scientific community because there is not a lot of papers about Kotlin. So I asked myself — maybe I can do something with it and try to write something about Kotlin?

Some days later, after the conference, I was chilling at my house catching up with older KotlinConf talks. I came across an awsome “The Cost of Kotlin Language Features” talk by Duncan McGregor about downsizes in Kotlin code size and execution time. It motivated me to get a deep dive into performance and build differences between Java and Kotlin.

So here we are, months later, I am a master degree graduate trying to summarize my experiment and results in this Medium article. I hope you will enjoy it and get to know something new.

Link to Java vs. Kotlin — Part 2: Bytecode

Research question

Icon made by Freepik from www.flaticon.com

First of all, I want to share with you my master thesis research question. There are two of them, but in the first article, we will focus on the first one — connected to performance.

RQ: Are there significant differences between execution performance between various implementations of the same benchmark in Kotlin and Java using Java Runtime Environment?

The Computer Language Benchmark Game

The dynamic metrics comparison will be based on the idea propagated by one of the most popular cross-language benchmark suite — The Computer Language Benchmark Game (CLBG).

Introduced by Doug Bagley in 2000 as The Great Computer Language Shootout project with a goal to compare all major languages. Nowadays, the project developed into The Computer Language Benchmark Game which is the most popular cross-language benchmark used scientifically. Project is always growing with the new problem benchmarks and language implementations. It is systematically updated by the creator to follow the programming market trends (by adding new languages, removing ones that are not used anymore and updating the list of benchmark algorithms to test).

There is the goal behind CLBG benchmark — to answer the following question, asked by one of the 4chan users:

My question is if anyone here has any experience with simplistic benchmarking and could tell me which things to test for in order to get a simple idea of each language’s general performance?

To answer that question, CLBG presents a set of 10 different algorithm problems. All of the problems are presented and described on the official web-page. Introduced algorithms have strict rules on how to implement them, what do algorithms use to achieve a correct equivalent result. With that following information, problems are implemented using specific language which will be measured.

To enable an objective comparison of the results, CLBG benchmark always uses fixed scripts that implement the metrics for all of the experiments. The collected measurements are independent of the implementations of algorithms in the indicated languages.

Benchmarks selection

Icon made by Freepik from www.flaticon.com

During the development of the experiment, The Computer Language Benchmark Game consisted of 10 benchmark programs (sidenote: the number of benchmarks in CLBG change in time). Each of them proposes a different kind of problem which uses different language paradigms, language features, and methodologies in general to solve them.
(I am not going to describe every benchmark in CLBG, if you are interested I encourage you to read the information on the webpage)

The main idea standing behind this experiment was to compare Java and Kotlin. To achieve that, the experiment used the base implementation of programs in Java (taken from CLBG benchmarks repository) and two implementations in Kotlin — converted and idiomatic (described in the next sections).

With these assumptions, benchmarks used in the experiment were selected based on two factors:

  1. best Java implementation taken from CLBG repository has to be convertible to Kotlin language
  2. the programs must manipulate on as diverse as possible data

Authors of “JVM-Hosted Languages: They Talk the Talk, but do they Walk the Walk?” paper proposed method of distinguishing CLBG corpus of programs by indicating whether a program mostly manipulates integers, floating-point numbers, pointers or strings. This information will help us divide benchmarks into groups.

Taking everything into account, there are only 6 of the CLBG benchmarks that were used in the final experiment. Four of the ten benchmarks were rejected in order to be consistent with the assumption that Java code has to be convertible to Kotlin language without the need for large changes in the code.

  • int - integer
  • fp - floating point
  • ptr - pointer
  • str - string
Table 1: Selected benchmarks with information about most manipulated data

Remarks

  • after benchmark selection (as it is depicted in Table 1), the final full benchmark suite does not contain programs which manipulate mostly on string resources.

Implementations

Icon made by Freepik from www.flaticon.com

There are three implementations for every benchmark in this experiment.

  1. Java
  2. Kotlin-converted
  3. Kotlin-idiomatic

All of those codes were used for experiments — compiled, executed and measured by the external Python scripts. None of the implementations have any measurements or irrelevant code parts which could interfere with final results.

Kotlin was divided into two versions. I wanted to check what is the difference between multiple Kotlin implementations. I assumed that Kotlin-converted might be the code with similar performance and bytecode results to the Java version. On the other hand, I used Kotlin-idiomatic implementation to achieve benchmark results for codes which are more likely to be produced by experienced Kotlin programmer.

If you are interested in more implementation details, check out the Java vs Kotlin comparison repository.

Java implementation

All Java codes were taken directly from most actual benchmark implementations available in The Computer Language Benchmark Game repository, without making any changes to them.

In the benchmark repository, there were multiple versions of Java benchmarks. The codes used in the experiment are those which achieved the best results, according to the leaderboard available on CLBG webpage.

Kotlin-converted implementation

Implementation is created using Java to Kotlin converter provided by IntelliJ IDEA. Multiple changes have been made, to the raw converted versions, to bring the code to an executable state. Most of the changes were necessary to allow the compiler to run the code.

All Kotlin-converted implementations have one change which was introduced into the code. In order to make the code compilable using the command-line interface, the method main(), was extracted outside the original file class.

Kotlin-idiomatic implementation

Changes introduced into the Kotlin-idiomatic implementations are mostly based on:

  • Idioms — which lists Kotlin frequently used idioms. The site is part of the official Kotlin documentation
  • Coding Conventions — which contains the current recommended coding style for the Kotlin language. The page is also a part of Kotlin documentation
  • IDEA default code inspections

Kotlin idiomatic implementations are based on the converted versions.

Remarks

  • five out of six benchmarks work in parallel using Threads (Java and Kotlin implementations), none of Kotlin implementations uses coroutines, the use of which could significantly affect the performance results

Languages version and hardware

Icon made by Nikita Golubev from www.flaticon.com

Obligatory information about software and hardware environment used for benchmark execution.

The experiments were executed on the hardware environment which details are presented in Table 2. The choice of the Linux Ubuntu System was dictated by the fact that it was recommended OS for the CLBG metrics measurement scripts.

Table 2: Hardware environment

Table 3 presents the versions of Java and Kotlin which were used for executing benchmark implementation. Both versions represent the newest available version of languages at the time.

Table 3: Languages version

Remarks

  • all of the benchmark executions are done on Oracle HotSpot VM

Dynamic metrics

Icon made by Freepik from www.flaticon.com

So yeah, what did I really measure in that dynamic metrics comparison?
I decided to compare languages using most metrics which are most often considered as most important for programmers:

  • Execution time
  • Memory usage
  • CPU load

Every program was executed and measured 500 times.

Benchmark metrics are also based on those used in The Computer Language Benchmark Game. All of the programs were executed and measured using dedicated CLBG scripts.

Multiple measurement methods were experimented in Kotlin and Java benchmark suite development process and are still available in the repository. Initially, the time was measured using Java/Kotlin code with methods like currentTimeMilis() or nanoTime() from System class, but that idea was abandoned in further work. The results had a large variance, and later on, I decided to abandon this measurement method. Measuring other load metrics like CPU and memory turned out to be also not straightforward and affected by various factors, depending on the environments (deep dive into benchmark measurement ways is a subject for another long article!).

All of that work led to the decision that applying the official CLBG measurement scripts is going to be the most objective method in that case. That method can help put all of the Kotlin measurement conclusions in the context of results given for different languages evaluation presented by CLBG benchmark.

Details of how each parameter is measured by the Python script are available at CLBG measurements page.

Results

Full list of measurement results for each benchmark is available in the project repository.

Execution time

The Figure below presents execution time box plots for each benchmark and each implementation. Letters in brackets represent most manipulated data.

Memory usage

The Figure below presents memory consumption box plots for each benchmark and each implementation. Letters in brackets represent most manipulated data.

CPU Load

The table below is presenting the CPU load on each CPU core in each benchmark and each implementation.

Table 4: CPU load results

Conclusions

The table below presents benchmark results comparison with median memory consumption and median execution time. Letters in brackets stand for most manipulated data.

The first thing that catches the eye is that Kotlin-idiomatic implementation never achieved the best result in any median of the measurements. Median execution time, in this case, is higher than the rest of the implementations. The same goes for memory consumption: the idiomatic implementation is sometimes in the second position, but never achieves the best result with a given benchmark. We can thus conclude that writing the Kotlin code using recommended techniques might not be the best option for someone who is looking for the best result in memory management and fastest execution time.

Based on those results, we can also assume, that Java implementation is usually better at managing the memory optimization and program execution. Java implementation carries out better median execution time in four out of six benchmarks. The same implementations also achieve the best median memory usage in four out of six benchmarks. The Kotlin-converted measurement outcome presents shorter execution time in Mandelbrot and Fannkuch Redux, and lower memory consumption in Fasta and Binary Trees benchmarks.

The largest diversity between the highest and lowest median execution time is noticeable in Binary Trees benchmark. The difference between the best execution time (Java) and the worst (Kotlin-idiomatic) is at the level of 6.76%. There are also benchmarks where Java, Kotlin-converted and idiomatic implementations achieved very similar results. In Spectral Norm and Mandelbrot, the difference between the highest and lowest median execution time is less than 1%.

Considerable memory usage measurement outcome variation is visible in Fannkuch Redux benchmark. As mentioned earlier, Fannkuch Redux is a program which mostly manipulates integer numbers. There is a major difference in the way do Java and Kotlin manage integer numbers. In Java, numbers without fractional component are mostly saved as primitive types, in contrast to Kotlin, where all those numbers have to be boxed as Integer objects.

On Figure 3, we can see that the Kotlin-converted implementation has much lower median memory consumption than other implementations. The difference between memory usage medians is even greater than 100MB. The study conducted for this thesis’ provides no clear explanation for the observed result because even static bytecode analysis results do not present any notable variance in favor of the Kotlin-converted implementation.

On the other hand, we have the last dynamic metrics results — CPU load. As mentioned in the previous section, there is no significant difference among particular measurements. Taking into consideration that tasks are allocated randomly to the CPU cores, the difference between the implementations is, in most cases, not higher than 3%. The differences like this can vary at that level between executions, because of differences in the computer hardware and software. The difference between benchmarks implemented in various languages can be much higher. The number of CPU load results with more significant differences are available at the official CLBG benchmark webpage.

With all of that information, we can assume that both languages load the CPU in a similar manner. There is no notable decrease in CPU load in any of those languages or implementations.

So… what’s the answer to the research question?

Icon made by Freepik from www.flaticon.com

At the beginning of this article, I presented my research question. We came to the place when it is the time to form an answer based on achieved results.

RQ: Are there significant differences between execution performance between various implementations of the same benchmark in Kotlin and Java using Java Runtime Environment?

From the execution time perspective, the maximum difference between the highest and lowest median elapsed time can vary from 6.7% in favor of Java implementation Binary Trees benchmark), to 1.2% in favor of Kotlin-converted implementation Fannkuch Redux benchmark). There is no benchmark where the Kotlin-idiomatic implementation achieves the best median execution time. Java code reaches the best time median in four out of six benchmarks, the remaining benchmarks performed best for the Kotlin-converted code.

Memory consumption comparison appeared the lowest in the Java implementation in four out of six benchmarks. Also, the next two of them were for Kotlin-converted code. The best memory consumption results for both implementations do not overlap with the best execution time results for those codes. Memory management can reach very diverse values for languages in different benchmarks. Java implementation achieves better memory usage even by 13% in Fannkuch Redux, but on the other hand, Kotlin-converted reaches 9.7% improvement in Binary Trees, compared to the Java results.

The CPU load measurement results show that there is no significant distinction in load management on cores between these two languages. All of the benchmark implementations achieve very similar results with the difference on the error level.

What's next?

Icon made by Darius Dan from www.flaticon.com

That is it for dynamic metrics. I am not a statistics and data analysis expert so there are probably tons of conclusions which I did not present but they are hidden inside those results. If anyone is interested — I would love to see anyone showing me more analysis based on my data.

Also, if you have found any errors in my comparison — wrong implementations, median evaluation and so on — hit me up in the comments section, Twitter or official Kotlin slack (Jakub Aniola).

In the next part, I am going to present the methodology and results of “Java vs Kotlin” static analysis. I think that next time we should look at the differences between these two languages from a different perspective. The JVM bytecode perspective.

Link to Java vs. Kotlin — Part 2: Bytecode

--

--