The Scala Advantage in Data Engineering

Discover how Scala outperforms other languages in data engineering with Apache Spark, the ultimate big data processing engine.

Henri Happonen
5 min readSep 3, 2023
A street in London’s Fitzrovia — where every data engineer dreams of living!

Ever found yourself browsing through a Data Engineering job posting and wondered, “Why on earth do they want me to know Scala?” Or perhaps you’ve mused, “What on earth is Scala, anyway?”

Allow me to explain…

Birth of Scala

Scala is a high-level, statically typed, general-purpose programming language. Developed in 2003 as a comeback to Java criticism, Scala can be compiled to Java bytecode and run on Java Virtual Machine (JVM). Because of this, Scala allows for interoperability with Java libraries, which can be directly referenced in code. Scala’s like that cool bilingual friend who effortlessly switches between languages.

Scala’s mission? To eliminate boilerplate code. Java likes to babble on and on but Scala believes in keeping it concise. And fewer lines of code mean quicker development and speedier deliveries. Scala even spices things up by blending some functional programming concepts with Java’s object-oriented nature. It’s like the peanut butter and jelly of programming paradigms.

While Scala might seem deceptively simple, it introduces some complex features that can leave even seasoned programmers scratching their heads. With fewer lines of code, you need to be better at detecting what the code does and not just blindly follow it. It’s all about predicting what’s happening under the hood.

So, there you have it, a sneak peek into the world of Scala. Let’s dive deeper.

It’s all about Spark

In the realm of data engineering, one framework stands out, and most Data Engineers nod in agreement: Apache Spark reigns supreme in the kingdom of big data processing (well, for now at least — data tech evolves faster than a chameleon at a disco). Think of it as the Gandalf of open-source frameworks, wielding its magic for massive data munching.

Spark dishes out parallel computing on a silver platter, served with a side of simplicity via a SQL-like interface. It’s so versatile that it can be used with multiple programming languages: Scala, Java, Python, and even R!

And — as you may have guessed—Spark’s codebase is crafted in Scala. But just because a library is conceived in one language, it doesn’t mean that the same language is the automatic go-to for writing applications that leverage it. Much of the internet’s foundational elements are basically written in C (routers, protocols, servers etc) yet you wouldn’t exactly declare C as the undisputed champion for building modern, internet-savvy applications, would you?

Why Scala over other languages?

So, why would you choose Scala over other languages like Python, the friendly snake of the programming world, for your Apache Spark endeavours? While Python has a bustling online community, it’s as easy to learn as your favourite TikTok dance, quick to write, and packed with great libraries, Scala still holds its crown in the realm of data engineering.

Here are 7 reasons why:

  1. Speed. Because Spark is implemented in Scala, nearly all data traveling to and from the Python executor must traverse through a socket and a JVM worker, facilitated by the Py4J (‘Python for Java’) library. While this translation tango is relatively efficient, it’s not free. So, in the grand race of Spark code execution, Python finds itself just a smidge slower than its Scala and Java counterparts.
  2. Latest features. When new features are introduced in Spark, it’s generally Scala that gets the first slice of the cake. To stay on the cutting edge, you’ll want to speak the same JVM language as Spark.
  3. Prototyping. For those who love to tinker and experiment, Scala comes with a nifty tool called REPL (Read-Eval-Print-Loop). It allows for interactive, step-wise coding and debugging, without having to write and compile a complete program. Java tried to catch up with JShell in Java 9, but it wasn’t until Java 10 and the introduction of the reserved type name ‘var’ that it became more user-friendly. Still, if you’re using Spark 2 with Java 8, you miss out on the REPL fun. So, Scala scores half a point for this one, along with Python and R.
  4. Error messages. Scala doesn’t just talk the talk; it walks the walk when it comes to error messages. Python has a long history of playing hide and seek with accurate traceback logs using Spark, and you might sometimes find yourself reading Scala/Java logs anyway. Admittedly, since Spark 3.0, Python error messages have been greatly improved but JVM languages still shine in this department.
  5. UDFs and UDAFs. User-defined functions and user-defined aggregate functions are the superheroes of custom functionality in Spark. Scala’s got the upper hand here because data doesn’t need to leave the JVM for these functions to work efficiently. Python, with its non-JVM status, misses out on this performance boost.
  6. Dataset API. If you’re into type safety and compile-time assurance, Scala’s got your back with Spark’s Dataset API. Sorry, Python and R, but this party is exclusive to JVM languages, leaving you with just the DataFrame API (and RDDs if that’s your cup of tea).
  7. JARs. Both Scala and Java applications compile down to nifty little packages called Java Archive files, or simply JARs. These are single files that contain all the source code and dependencies for your application. When submitting a Spark command to your cluster manager, you can just reference a single file name and a class to run your app. Did you catch that, Python enthusiasts? We’re talking about a one-file wonder! Once you’ve savoured this simplicity, you’ll think twice about wrestling with multiple Python files and dependency packages.

But is Scala worth the extra effort?

Undoubtedly, Scala demands a bit more elbow grease. It involves mastering build tools, dealing with extra lines of code compared to Python, and carries a reputation for a steeper learning curve.

Consider this: when we talk about performance differences, we’re not discussing life-changing leaps. Unless you’re running data operations on the scale of tech giants like Meta, Apple, Netflix, or Google, optimising your data engineering operations to save a few percentage points in cloud computing costs might be like hunting mosquitoes with a bazooka. Smaller companies already fluent in Python shouldn’t switch to Scala just to shave off 5% of their data engineering costs unless they’re dealing with data the size of a small planet.

Conclusion

In the dynamic world of data engineering, choosing the right tool for the job can make all the difference. Scala, with its blend of simplicity and power, offers a unique edge when it comes to harnessing the full potential of Apache Spark. While Python has its merits and a flourishing community, Scala’s roots in Spark provide a compelling case for those looking to supercharge their big data processing endeavours.

So, whether you’re a seasoned coder or just embarking on your programming journey, keep Scala on your radar. It might just be the secret ingredient that elevates your data engineering game. Happy coding, and may your data always flow swiftly and smoothly!

--

--