Testing polyglot libraries with GitHub Actions

Davide Fantuzzi
LARUS
Published in
4 min readMar 16, 2021

This is the story of how we managed to build a CI pipeline to test the official Neo4j Connector for Apache Spark with Scala, Java, and Python just by using GitHub Actions.

What are we working with?

My team in Larus developed the Neo4j Connector for Apache Spark, a library that allows Spark to write to / read from the Neo4j Graph Database.
If you don’t know Apache Spark, it is a distributed computing framework and among all the features it has, it’s also polyglot; Spark supports Scala, Java, Python (using PySpark library) and R (using SparkR).

Spark has API that allow developers to write their own connectors to exchange data with Spark, but in the past the developer had to take care of the polyglot part, providing the support for all the languages by developing different libraries, or using some kind of trickery.

Starting from the Spark DataSource API V2, everything changed! Those API are simply amazing from this point of view; it’s enough for the developer to create one library written in Scala/Java and Spark will take care of the polyglot feature! Isn’t it amazing?

So are we done already?

Can we ever say we are done when developing a software though?

We already have the whole connector covered with test suites in Scala, both unit and integration tests. But we are test freak, and we believe in the good practice to test everything!

What’s left to test?

In the Neo4j Connector we obviously heavily rely on Spark Types to make everything work. The foundation of this feature consists in two types mapping functions that converts a Spark type into a Neo4j type and viceversa.

Of course, when using different languages there is an additional level of mapping, from the language, to Spark, from Spark to Neo4j, and backwards.

Lots of types conversion involved.

We have a test suite for testing these conversions from and to Java and Scala, but we were missing conversions for Python and R.

This is basically what we wanted to achieve:

Basic CI pipeline for polyglot libraries.

We didn’t forget about R. We just decided to start testing Python first. R will come next and will be an even bigger challenge!

Writing the tests was no big deal, we leveraged the potential of testcontainers for Python (library that allows you to use Docker container in your tests) and we were able to get a nice integration test suite for out types mapping.

Highlights of our Python tests.

We inject our pre-packaged Scala library in Spark at lines 7:10 and then we simply run our test on the shell.

Well congrats for teaching us the obvious.

Is that it?

Of course not! Now comes the juicy part.

Since we are lazy we wanted to automate the execution of the Python tests in the CI pipeline. Our previous CI wasn’t giving us that much flexibility, so we decided to switch to GitHub Actions.

GitHub Actions are sets of jobs and steps that can be triggered on any GitHub event, like on a push, or on a pull request. You can execute commands, build, deploy, and many other things to handle your CI/CD pipelines with ease.

This system allows, among everything else, to run test on different environments. Exactly what we needed!

For Java and Scala everything was pretty straightforward.

“Light” version of our Java/Scala CI action.

One job to test them all.

Notice how GitHub Actions allow to specify version matrix (lines 9:12). The job will be executed for each combination of the values in the matrix. This feature let us test our library against all the supported versions of Spark, Scala and Neo4j.

What about Python?

The issue with the Python tests was to properly inject the correct JAR in Spark. So we simply package the JAR before running the tests.

“Light” version of our action file for the Java and Scala tests.

Once again we test against all the supported versions of Spark, Neo4j and Python and we pass them to the test script as arguments.

We had to tweak our test script a bit to be able to leverage the version matrix.

Tweaked version of our test script.

And here’s what a nice and safe PR looks like!

So nice to see 53 successful checks.

It’s also really easy to inspect the logs from the “Checks” tab on GitHub, without changing website.

This is the easiest way we found to implement tests on polyglot libraries! We hope this article will help those of you who gave up on testing multiple languages on the same CI pipeline without doing some magic tricks.

You can find the full code of our GitHub Actions on our repository. And while you’re there please give a try to the Neo4j Connector for Spark! Here’s a repository with everything you need to get started.

--

--

Davide Fantuzzi
LARUS
Writer for

Backend Developer @ Switcho. Big fan of music and oxygen, for different reasons but both help me live.