Scala: Internals and Intermediates

In our previous Scala guides, we covered Scala on a very external level; installation, syntax and semantics, debugging and unit testing. In this guide, we aim to explore Scala’s toolchain and internals.

Finally, please ensure that you have installed both Scala and scalac, the language and its compiler respectively. This means you will also need the Java Development Kit (JDK), preferably JDK8. If you require assistance in setting up any of these, you can refer to our Scala installation guide.

Toolchains and the Scala Compiler (scalac)

A toolchain is a set of ‘chained’ software development tools used to perform complex software development tasks or create software products, including program compilation, debugging, etc.

The Scala Compiler, or scalac, performs a ton of work so your program can be portable and successfully executed on any target machine. Recall that Scala runs on the Java Virtual Machine (JVM). Then the purpose of scalac is to parse your high-level code, and generate the corresponding portable JVM bytecode with the help of abstract syntax trees (AST), among loading in packages and other important compilation tasks.

A fresh installation of Scala and scalac already comes packaged with an efficient toolchain for compilation, which we examine in the following section.

Using Compiler Options

There are two primary ways to use compiler options:

  1. Using scalac — Simply write the following command into your terminal:
scalac [ <options> ] <source files>

Where [ <options> ] is the list of options you desire, and <source files> are your Scala files to be examined.

2. Using sbt — In your sbt project’s build.sbt file, we add the following code:

scalacOptions ++= Seq(
“-option1”, “arg1”, // Option and arguments on same line
“-option2”, // New lines for each options

In the scalacOptions body, each option on its own line, and any arguments for our options are on the same line. All specified options and arguments are of the String type.

In this guide, we will be using scalac to explore the compiler.

Compilation Phases

Fortunately, scalac supports compiler options that let us peek at each individual compilation phase of our Scala project. Open up your project in the scalac command line, and execute the following option:

scalac -Xshow-phases

This should print a rather long table of compilation phases, along with their IDs and descriptions:

At the time of this writing, these are the 24 compilation phases of scalac and their duties. These are the necessary steps to turning your high-level Scala code into a portable, intermediate representation known as JVM bytecode to be successfully executed by the JVM on your target platform.

Throughout the compilation process, scalac builds and modifies ASTs in order to produce the JVM bytecode. For example, consider the following simple program of a for loop:

object MyProgram extends App {
class aFunctionDef (num: Int) {
def ForLoopFunction(): Unit = {
for (i <- 1 to num) {
if (i % 2 == 0)
println(i.toString + “: even”)
println(i.toString + “: odd”)
  var testClass = new aFunctionDef(5)
println(“End of Test Program.”)

We can actually view the evolution of the AST generated for our Scala program at each compilation phase! Just run the following command, replacing <yourfile> with the name of your Scala file:

scalac -Xprint:all <yourfile>.scala

Running this command on the above test program, we obtain the ASTs at each compilation phase. We compare the ASTs of the first phase (parser) and the last (terminal):

AST after Phase 1 (Parser)
AST of Phases 22–24(Delambdafy, JVM, Terminal)

As we can see, the AST has evolved dramatically from beginning to end of compilation. We also see that there are phases where the AST remained unchanged.

Although it is large, it is also now in a form that can be easily translated into a portable JVM bytecode structure. In Kotlin compilers, this is done using ASM: An all purpose Java bytecode manipulation and analysis framework. For more information on the ASM framework, check out the official documentation.

Finally, after your program finishes compiling, the corresponding JVM bytecode of your classes should automatically be generated in the current working directory. If not, you can specify the option -Ydump-classes DIR, where DIR is a specified directory, to create these files after compilation.

Scala’s Intermediate Representations

An intermediate representation of a programming language is the data structures or code used by a compiler or virtual machine to represent source code.

In the case of Scala, intermediate representations include platform-independent JVM Bytecode, and platform-dependent Machine Code.

JVM Bytecode

In high-level programming, understanding JVM bytecode can help a Java programmer in a way similar to how understanding assembly might help a C/C++ programmer.

JDK8 comes with javap, a tool that lets us disassemble JVM bytecode within .class files. We will use javap to inspect the JVM bytecode generated from the previous example.

In the terminal, we type in the following command:

javap -c MyProgram.class

Where MyProgram.class is the JVM bytecode generated by scala. This returns the following:

Fortunately, it is not necessary to understand JVM bytecode to understand your Scala programs. However, it is an undeniable advantage to understand what is happening at a much lower level, especially in terms of performance. For more information on JVM bytecode, check out the official Java Virtual Machine Specification.

Internal Functions of JVM in Scala

Scala being a functional programming language working on the JVM raises a question on how the functions are implemented since, JVM does not support functional programming natively. In these section we will see how JVM implements the Scala functions which would come in very handy while dealing with obscure bugs.

Decompiling Functions

If we look at a normal function which converts a number to its square:

If this code is decompiled, it would translate as:

public static scala.Function1<java.lang.Object, java.lang.Object> square();
0: getstatic // Field RunExample$.MODULE$:LRunExample$;
3: invokevirtual // Method RunExample$.square:()Lscala/Function1;
6: areturn

This translation is the conversion of the original code into an instance of Function1 trait of Scala. This trait extends the root class of all reference types- AnyRef class. With this example, we can conclude that every function in Scala are objects in some forms. For this case, it was the instance of FunctionN trait.

Let’s consider a test class with a function:

On compilation, this will generate 2 files: Run.class and Run$$anonfun$1.class.

The Run$$anonfun$1.class is the function class. If we decompile it we get:

public final class Run$$anonfun$1 extends scala.runtime.AbstractFunction1$mcII$sp implements scala.Serializable {
public static final long serialVersionUID;
public final int apply(int);
public int apply$mcII$sp(int);
public final java.lang.Object apply(java.lang.Object);
public Run$$anonfun$1(Run);

JVM basically just provides features like serializable, apply and constructor.

The FunctionN Trait

In the previous section, we saw the example involving trait Function1. In Scala, there are traits from Function1 to Function22. But then how do the developers put in 22 classes?

The answer is they don’t have to. The top of the source code always mentions as a comment:

// GENERATED CODE: DO NOT EDIT. See scala.Function0 for timestamp.

This proves that this is generated from somewhere and is not hand-coded at all. This trait in Scala is similar to the generic of Java language which handles the heterogeneous types.

Now, what if a function requires more than 22 parameters? This is a valid question. But in reality, 22 limit is really hard to reach. Even if this scenario occurs, it could be overcome by using nested tuples or other structures like HList.

It should be kept in mind though that the internal implementation of a function can change anytime but the core concept of a function being an object won’t change.

Functions vs Methods

Functions and methods share most of the keywords and the syntax but they are not same. The basic difference being that methods cannot be defined as values and can’t be passed around to other function as values. Methods are a part of Object Oriented Programming and are very different from functions which can be seen if we observe the byte code generated.

Interoperability with Object Oriented Programming

As we discussed above, in Scala, each function is considered an object which makes it easy to work with imperative or object oriented code. For instance, if we consider our square function from before:

This function takes an integer as parameter and returns an integer. It can also be consumed like a regular method. To a programmer it looks like a method while it actually is a function if consider behind the scene operations.

From JVM Bytecode to Machine Code

Finally, after all bytecode is generated for your Scala program, all that is left is for the Just-In-Time (JIT) compiler to translate the bytecode into machine code, which is then fed into memory for execution. The JIT compiler is one of the many functional components of the JVM.

As the name implies, the JIT compiler runs on demand, but only compiles machine code once. Since the JIT compiler is smart enough to recognize when code is already compiled, applications tend to run faster and faster as they run over time. This also means that code blocks which do not often run recieve less optimization.

An important thing to note is that the bytecode generation is the same among all platforms, however the machine code generation is not; this is an important recipe in optimizing not just Scala code execution, but other languages using JVM and JIT compiler such as Java and Groovy.


Alessandro Heres — Github
Tristen Sprainis — Github
Ayushi Priyadarshi — Github | LinkedIn