Scala on AWS Lambda — Benchmarking Serialization and Deserialization

Thomas Bach
ash_blog
Published in
6 min readJan 31, 2018

At Midas we decided early to use Scala as a programming language in our back-end. Our services are generally designed in a micro-service approach. So, when it came to evaluating the different platforms these services could run on, it pretty much boiled down to AWS in general and AWS Lambda for our back-end in particular.

Just to be on the same page, here is how Wikipedia describes Lambda:

AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of the Amazon Web Services. It is a compute service that runs code in response to events and automatically manages the compute resources required by that code.

This sums it up pretty well. At this point, the relevant part for us is: “runs code in response to events.” When we combine this with AWS API Gateway the event usually is a call to a RESTful API. AWS will then fire up our handler on Lambda, pass it the arguments sent to the API and give the caller back, whatever our function returns.

How to get things in and out?

When running Scala on AWS Lambda we basically have two ways to get and return objects:

  1. either we let AWS do the serialization and deserialization via so called Plain Old Java Objects (POJOs); or
  2. we serialize and deserialize the objects ourselves via our JSON library of choice, e.g. Jackson.

Both of these options have their pros and cons. In order to see these, let’s have a look at some code. Our functions shall have a pretty easy job: return the size of a list containing User objects. Here is the implementation of User:

case class UserId(id: String)case class User(
id: UserId,
phone: String,
signupTimestamp: LocalDateTime,
)

and here are the handlers:

class PojosDeserHandler
extends RequestHandler[JavaList[UserPojo], Int] {
def handleRequest(
req: JavaList[UserPojo], c: Context
): Int = {
asScalaIterator(req.iterator).map(UserPojo.toCaseClass)
.toList.size
}
}
class JsonDeserHandler
extends RequestStreamHandler {
def handleRequest(
is: InputStream, os: OutputStream, c: Context
): Unit = {
val users = mapper.readValue(is, classOf[List[User]])
os.write(users.size)
}
}

The JSON-thing is pretty straight forward. You need to instantiate an ObjectMapper, then you simply read the InputStream coming from AWS, do the processing and write your result back to the OutputStream. It couldn't be easier, right?

But now, what’s this UserPojo thing in the PojosInputHandler? Well, AWS wouldn't construct a proper Scala object for us. It actually assumes you to be running Java, i.e. it expects you to ask for an object fulfilling the Bean protocol with a default constructor not taking any arguments. Here it is:

class UserIdPojo {
@BeanProperty
var id: String = null
}
object UserIdPojo {
implicit def toCaseClass(x: UserIdPojo): UserId =
UserId(id=x.id)
implicit def toPojo(x: UserId): UserIdPojo = {
val u = new UserIdPojo
u.id = x.id
u
}
}
class UserPojo {
@BeanProperty
var id: UserIdPojo = null
@BeanProperty
var phone: String = null
@BeanProperty
var signupTimestamp: LocalDateTimePojo = null
}
object UserPojo {
implicit def toCaseClass(x: UserPojo): User =
new User(
id = x.id,
phone = x.phone,
signupTimestamp = x.signupTimestamp,
)
implicit def fromCaseClass(x: User): UserPojo = {
val u = new UserPojo()
u.id = x.id
u.phone = x.phone
u.signupTimestamp = x.signupTimestamp
u
}
}

Yep, you got that right! That’s about 40 lines of ugly boiler-plate code. Plus, we need extra code when we want proper equals methods and a nicer toString. And even then it's not the whole thing! You see these LocalDateTimePojo lurking around in the code? This is additional boiler-plate we'll have to write in order to properly (de-)serialize LocalDateTime objects. (As of this writing, getting AWS to properly serialize and deserialize time and date objects is pretty hard. Neither the here used java.time, nor the de-facto standard Joda-Time instances worked to our satisfaction.) We could just as well code in Java, right? Just kidding!

What could probably speak in favour of this solution? The fact that we can put more logic into the type system! Doesn’t def handleRequest(req: JavaList[UserPojo], c: Context): Int tell you so much more than def handleRequest(is: InputStream, os: OutputStream, c: Context): Unit?

Of course, this also implies, that the compiler can properly type-check our system. What if the authors of the Scala collection library all of a sudden decide to let List.size return the element in the middle of the list? Yeah, I know: sounds silly. But, when you are refactoring things deep down in your code base, chances are high that you are going to forget about the high-level handlers.

Anyways, both options aren’t optimal. So, how are we going to decide which route to follow? Benchmarks!

What and how

You’ve already seen the handlers to benchmark deserialization above. Here are the handlers to benchmark serialization:

class PojoSerHandler
extends RequestHandler[Int, JavaList[UserPojo]] {
def handleRequest(
req: Int, context: Context
): JavaList[UserPojo] = {
val users = Gen.listOfN(req, arbitrary[User]).sample.get
users.map(UserPojo.fromCaseClass).asJava
}
}
class JsonSerHandler
extends RequestStreamHandler {
def handleRequest(
is: InputStream, os: OutputStream, c: Context
): Unit = {
val req = mapper.readValue(is, classOf[Int])
val users = Gen.listOfN(req, arbitrary[User]).sample.get
os.write(mapper.writeValueAsBytes(users))
}
}

It is kind of the other way around, right? We receive a number n, we produce a list of n User instances and send it back. The Gen.listOf and arbitrary methods come from ScalaCheck.

On the calling side, we double the code above and model it around APIGatewayClient.testInvokeMethod of the AWS SDK for Java. We embed it in calls to System.nanoTime in order to measure the round-trip time. I repeat that 500 times for each n, where n is the size of List[User]. I let n run from 10 up to 100,000 in the deserialization benchmark and up to 15,000 in the serialization benchmark. (Apparently AWS doesn't like very large requests.)

Additionally, a warm-up phase is executed, where each n is repeated five times.

Results

Let’s first have a look at deserialization. Here are the results plotted for JSON and POJO.

1*VeN6kNtgoUBODrT3wlNcRg.png
1*AyCsQZejtNNqazeshG47Pw.png

In both cases there is a peak for small n when looking at the maximums. This might be due to not letting the warm-up phase iterate often enough. The other numbers aren’t very surprising. The time simply grows with n. AWS is slightly faster. But not that much faster as you’d suspect. Probably, you could even beat AWS with a library tailored to your needs.

Let’s look at serialization now. This is funky!

1*DaCp4O5muNyP8lEAQB90Ew.png
1*OzSPSZl93hxnXvkKeiWpYg.png

Yep! You got this right! Amazon’s timings are so good, that you can hardly plot them on the same scale as the JSON’ timings. Let’s have a look at a scatter plot:

1*9a7fALYM4xCe9QQ3jYRk4A.png

This is strange, right? It doesn’t grow with n at all. It stays constant. Looks like Amazon did a pretty good job optimizing the serialization of lists in parallel.

Conclusion

Are you going to send huge lists of objects over the wire? Use POJOs! Note that this only caches out when your lists hold more than 10,000 items. Most projects won’t have that requirement.

Getting the POJOs right in all aspects is tiresome. There are quite a lot of corner cases one has to think about, especially when dealing with time and date objects. So, for small projects or prototypes you should probably stick to a JSON library you are already familiar with.

I’d recommend to investigate POJOs for long-term projects with a fastly growing code base. Expressing logic via types is a huge benefit here and AWS is blazingly fast when it comes to serializing your output.

--

--

Thomas Bach
ash_blog

DevOp using functional programming where possible.