Serialization Techniques for Highly Scalable systems

When building a architecture(microservice or any other type) most common approach to send/share data objects are common messaging queue and HTTP APIs , which is exposed and consumed by each element of the project.

For highly performant traffic the need arises it to be fast, scalable and interoperable. Sending Java Serializable Object/JSON/XML values over the network seemed to be a great idea earlier, as most of the language provided encoding/decoding out of the box for each end of the communication stream but it is no longer viable for huge scale.

The thing is, that if you take a look at how long the process is or the size of the object, you will be shocked. It’s not uncommon for high traffic, always available & slightly complicated systems to spend most of their time on JSON parsing (except for I/O calls).

When comparing serialization formats, we can observe three quantitative values.

  • Serialization time
  • Deserialization time
  • Serialized size

Problem with Java Serialization

Let’s see step by step on how the object is serialized and de-serialized.

So when an object is serialized

  • First it writes out the serialization stream magic data i.e STREAM_MAGIC and STREAM_VERSION data (static data) so that JVM can deserialize it when it has to. The STRAM_MAGIC will be “AC ED” and the STREAM_VERSION will be the version of the JVM.
  • Then it writes out the metadata (description) of the class associated with an instance.It includes the length of the class, the name of the class, serialVersionUID (or serial version), various flags and the number of fields in this class.
  • Then it recursively writes out the metadata of the superclass until it finds java.lang.object.
  • Once it finishes writing the metadata information, it then starts with the actual data associated with the instance. But this time, it starts from the top most superclass.
  • Finally it writes the data of objects associated with the instance starting from metadata to actual content. So here it starts writing the metadata for the class Contain and then the contents of it as usual recursively.

Here it can be easily deduced that a small object when serialized sometimes grows 20X times than the original object. And then transferring this serialized object will in fact take more time.

Problem with Externalizable/XML/JSON Serialization

Java Externalizable interface(or Hadoop Writable works on similar lines) provides readExternal(ObjectInput in) and writeExternal(ObjectOutput out) to control what is serialized and what is not. Thus it provides much needed control how/what part of object is serialized and deserialized.

Json/XML advantage is that it is human readable and services providing data can be directly consumed on browser.

XML being strongly type, provides human and machine readability but it is quite slow, is insecure, consumes a lot of space on the disk and it works on public members and public classes and not on the private or internal classes.

First and foremost, in JSON has no error handling for JSON calls. Another major drawback of JSON is that it can be quite dangerous if used with untrusted services or untrusted browsers, because a JSON service returns a JSON response wrapped in a function call, which will be executed by the browser if it will be used with untrusted browser it can be hacked, this makes the hosting Web Application Vulnerable to a variety of attacks.

Why Protobuf/FlatBuffers/Thrift/Avro etc ?

Tools like FlatBuffer/Protobuf/Thrift/Avro have quite a few advantages over java serialization/json/xml :

  • smaller in size(3 to 10 times at least)
  • faster (20 to 100 times faster)
  • typed
  • offer a common interface for all your services, you can just update your .proto or .thrift or .avro and share those with the services using it, as long as there is a library for that language
  • self-describing (i.e there is no way to tell the names, meaning, or full data types of fields without an external specification)
  • handles versioning issues of classes and its members
  • easy language interoperability

Bottom Line

What system you prefer is more or less a matter of taste in most cases. I’d make the decision based on the question

How variable the data are that I have to deal with?
How much performance/traffic is required ?
Do I need a loosely typed system like Avro or strongly typed system like flatbuffer ?
Does it require to be human readable ?
Do I need versioning? backward compatibility ?

If it makes sense to use a strongly typed system, I’d go that way, because I like it very much to get informed about errors and mistakes early.

However, if there is a need for very flexible data structures, it may make more sense to go the other road.

So you can see the choice depends on many factors, many times json/xml serialization do the trick but when you talk about dealing millions of API calls and high scalability/interoperability of systems you have to look for something more robust system.