Exploring the applicability of Kotlin for data science.

Joep Klein Teeselink
Sogeti Data | Netherlands

--

“A.I. is the new electricity”, a comparison made by Andrew Ng¹ that we’ve all seen thrown around quite frequently. With this statement, Andrew is determined that A.I. will bring forward the next technological revolution just as electricity did last century. With an expected growth of 15.2% year over year for the A.I. market, his claim is backed up by the predictions².

IDC Forecasts Companies to Spend Almost $342 Billion on AI Solutions in 2021

With the increase in demand, it becomes increasingly more important to shorten the time between development and deployment, a well-known struggle within the field of data science. “The valley of death”, as it has been dubbed, is the phenomenon where ML models are never brought to production. Exploration of data and early model development are often done in Python, a language that lends itself greatly for quick and dirty use but is not often picked for product development. This blog explores what Kotlin, a modern, concise and safe programming language³ can provide for the world of data science.

Eduonix Kotlin Android Development Masterclass

Kotlin is introduced by JetBrains, a company that is mostly known for its IDEs such as Intellij and PyCharm. JetBrains decided to take a leap-of-faith to develop Kotlin which saw its first release in 2016. It is designed to interoperate fully with Java. This means that you can write Kotlin and Java within the same project and reference objects in the other language. Creating a new programming language is a difficult game to play as you have to bet on developers, and more so companies, to adopt it. Overall, it has been well received and with Google adopting Kotlin for Android, it has gotten a lot of traction in the last few years. This is an interesting turn of events as you can imagine that changing the programming language for something as gigantic as Android is no easy, let alone cheap, task. If Google made the switch, then why shouldn’t you give it at least a try?

Reading, analyzing and manipulating data in Kotlin

In Python, people are used to the interactive freedom that Jupyter Notebook provides. A Kotlin-kernel can be installed on top of Jupyter that allows you to explore data with the same interactivity while also granting access to the power that Kotlin provides. When it comes to raw manipulations such as mapping and filtering, Kotlin contains an easy-to-use streaming API that is well explained by Shubham Panchal in his Kotlin for ML article. There are some topics that Shubham did not touch upon which I want to point out, however. For example, the windowed function, which give users an intuitive way to do windowed calculations on any sequence:

Additionally, Kotlin has a smart way to extend an existing class with additional functionality without having to inherit from the class. In the following example the even() function is added to an IntArray to filter out all the even numbers:

An important thing to note is that all these functions can be chained together and operate in a ‘lazy’ fashion, meaning one after the other. This allows you to transform, filter and collect data in a very intuitive manner. The process of streams has existed in Java but is much more bloated. Here you can find more examples that outline how Kotlin simplifies these processes.

Apart from the core components, many tools have been built with Kotlin that are even more similar to the “Pythonic” syntax. Most notable are the dataframe and krangl libraries that allow for quick data wrangling in a similar fashion to Python’s Pandas and R’s Dplyr. On top of this there is lets-plot which allows users to help visualize the data. In an example notebook it is shown how well krangl & lets-plot work together. All of these tools are further elaborated on in an article written by Rishid Dagli. Later in this article, krangl is used to read a dataframe to train a model.

Models are not products

We’ve all seen the posts about what tools to use and which to avoid, which are more efficient and which will make your life better. In my opinion they are all assuming that solutions have to be found within the bounds of Python. The truth is that many of these tools (or similar ones) can also be found in alternative programming languages. If you drop the limitation of one language but instead explore multiple programming languages, you can easily bridge the gap between building models and bringing them to production. When converting a model to a product, scalability, testing, and refactoring become bigger issues than the model itself. This is why in software development, statically typed languages are usually preferred as they are safer to use. In contrast, data scientists prefer languages that are easy to learn, flexible and offer quick turnaround. Whereas static typing is commonly more verbose, Kotlin manages to keep it concise. Furthermore, using the same language for model and product development allows you to open-up and present the inner workings of a model. This is a big advantage regarding the recent push in transparency and explainable A.I.

Building and training models with Kotlin

Tensorflow has always been available in Java (and thus Kotlin) but has been made even more accessible through KotlinDL, a Kotlin based deep learning library inspired by Keras. Getting started with KotlinDL is low effort and only requires adding the library to your environment. When working with Gradle this is as simple as:

The first example focuses on the IrisData imported with Krangl and starts with some data exploration similar to Pandas.

Next, build a simple model using KotlinDL’s Sequential model builder:

The following step is to create a dataset from the krangl dataframe. Although training a model on data extracted from a dataframe is not recommended in production, it is interesting to see how separate tools can be used in combination. The extractX() function contains an init method that allows for dynamic data loading during batch creation. With this, data can be stored on heap and becomes rapidly accessible.

Finally, the model needs to be trained. By putting all the training code as a block in a use function, the model resource is closed afterwards regardless of whether an exception is thrown:

Transfer learning and saving & loading models

Since version 0.3.0 KotlinDL supports transfer learning which works as follows. An ONNX model is added in a Preprocessing block which then outputs the prediction as the last step of the pipeline. The freshly created top model is then trained on the output of the pre-trained model. Instead of freezing layers and training only the last part (fully-connected layer), it is cut-off and connected to a new model that can be trained in isolation. First, a Preprocessing pipeline is created which contains the pre-trained ONNX model in the transform tensor step. Additionally, we have to reshape the ONNX model to be able to take the images as input. In this example the cats & dogs image data is used. With KotlinDL’s FromFolders class, image data can be loaded by labeling the data based on the folder the image is stored in.

Next, a new model is designed for training that uses the output dimensions of the Preprocessing pipeline as input dimensions:

Finally, the topModel is trained with an OnFlyDataset that is based on the Preprocessing pipeline we just created. The OnFlyDataset keeps the data on disk and automatically generates batches during training:

If you are curious about training progress, insights can be generated by attaching the CustomCallBack class in the compile() function:

Additionally, it is possible to save and load models to skip retraining every time you boot your software. The next Gist shows how to add a saving method as an extension to the Sequential class. This allows us to call the saveModel() function from any model built. Furthermore, an example of custom model loading is shown.

Conclusion

Now that you can read data and create, train, save and load models, the code can be directly used in product development. Since we are working in Kotlin this can be anything from server systems (using Spring) to Android applications. Data scientists often find themselves frustrated trying to figure out how to properly make use of ambiguous functions with few documentation. Kotlin really simplifies this issue by being statically typed and providing all the needed info inside the IDE. As shown in this article there are many projects surrounding data science in Kotlin. Finally, KotlinDL shows to be a very prominent deep learning “keras-like” API that was easily learned after having spent years using Tensorflow and Keras in Python. It is actively developed and promises Android support in the future, so we have something exciting to look forward to.

References

[1]: Andrew Ng, January 25, 2017, Stanford MSx Future Forum

[2]: IDC, August 4 2021, IDC Forecasts Companies to Spend Almost $342 Billion on AI Solutions in 2021, https://www.idc.com/getdoc.jsp?containerId=prUS48127321

[3]: JetBrains, https://kotlinlang.org/

--

--