Serialization challenges with Spark and Scala — Part 2— Now for something really challenging…

Tim Gent
ONZO Technology
Published in
4 min readSep 19, 2018

Following on from the introductory post on serialization with spark, this post gets right into the thick of it with a tricky example of serialization with Spark. I highly recommend attempting to get this working yourself first, you’ll learn a lot!

Each example steps through some ways you may try to debug the problems, eventually resulting in a working solution. Once again the code samples can be found on ONZO’s Github, and the numbering on this article should match up with the code there :)

9 — base example

**FAILS**

Now for some practice! This example is relatively complex and needs a few changes to work successfully. Can you figure out what they are? Kudos if so! The next few examples walk through a solution step by step, and some things you may try.

10 — make classes serializable

**FAILS**

One approach to serialization issues can be to make everything Serializable. However in this case you will find it doesn’t solve the issue. You’ll find it easier (but not that easy..!) to spot why if you look at the complete examples. It’s because when trying to serialize the classes it will find references to testRddand also the shouldBe method. This will trigger serialization of the test class (you can see the full code in Github) that contains these, and the test class is not serializable.

11a — use anon function

**FAILS**

In order to debug this you might try simplifying things by replacing the WithFunction class with a simple anonymous function. However in this case we still have an a failure, can you spot the issue now?

11b — use anon function, with enclosing

**PASSES**

Did you spot it? By enclosing the reduceInts method the map function can now access everything it needs in that one closure, no need to serialize the other classes!

12a — use function with def

**FAILS**
Taking small steps, we now replace the anonymous function with a function declared with a def. Again you will find this fails, but seeing why isn’t easy. It is because of the intricacies of how def works. Essentially a method defined with def contains an implicit reference to this, which in this case is an object
which can’t be serialized. You can find out more about the differences between def and val here.

12b — use function with val

**PASSES**
Declaring the method with val works. A val method equates to a Function1 object, which is serializable, and doesn’t contain an implicit reference to this, stopping the attempted serialization of the Example object.

12c — use function with val explained part 1

**FAILS**
This example serves to illustrate the point more clearly. Here the addOnefunction references the one value. As we saw earlier this will cause the whole Exampleobject to be serialized, which will fail.

**BONUS POINTS**
One helpful experiment to try here is to resolve this by making the Exampleobject serializable.
You will note that you still get a serialization error. Can you see why? There are actually 2 reasons:
1) testRddis referenced inside the WithSparkMap class, leading to the whole Spec trying to be serialized (please see Github link for full code which will explain this more!)
2) The shouldBemethod is also referenced, again leading to the whole Spec trying to be serialized

12d —use function with val explained part 2

**PASSES**
As above, the best way to fix the issue is to reference values only in the more immediate scope. Here we have added oneEnc, which prevents the serialization of the whole Exampleobject.

13 — back to the problem, no class params

**PASSES**
Coming back from the issue we originally had, now we understand a little more let's introduce our WithFunction class back in. To simplify things we've taken out the constructor parameter here. We're also using a val for the method rather than a def. No serialization issues now!

14 — back to the problem, with class params

**FAILS**
We’ve now added back in the class params. Can you spot why this fails? The plusOnefunction references num, outside of the immediate scope,
again causing more objects to be serialized which is failing.

15a — back to the problem, with class params, and enclosing

**PASSES**
This is now a simple fix, and we can enclose the num value with encNum which resolves the last of our serialization issues. Finally, this is a complete working example that is equivalent to our first implementation that failed!

15b — adding some complexity e.g.7b — testing understanding

**FAILS**
One more failing example! Can you see why the above fails?

The issue is that encNumwon’t be evaluated until plusOne is actually called, effectively within the map function. At this point then the numvalue will need to be accessed, causing additional serialization of the containing object and the failure here.

Conclusion

Hopefully these examples have made a little clearer how serialization of functions works with Spark and Scala, good luck with your Spark serialization challenges!

--

--