On serialization: what it is, and why it’s needed. (part 1)

4 min readJan 22, 2020

Introduction

Serialization wasn’t brought up at Flatiron as an explicitly key topic, but as a relatively critical operation in software engineering, and with most programming languages appearing to provide tools to facilitate it, it seems imperative to delve into the topic.

Outside the development world, to serialize can mean: a) to broadcast, transmit or publish something(e.g. a story) sequentially at scheduled or regular intervals, b) to arrange something in a sequence. In the context of digital data storage, serialization can be summarized as a process of converting the content or value of an object into a format that can be stored, or a byte stream, to be saved into a file or local storage, or to be transmitted from the network to elsewhere. Deserialization, quite predictably, is the process of reversing this change (serialized byte stream/storable format → object).

Breaking this down further, a byte stream refers to sequence of bytes, each of which is composed of a sequence of 8 bits (0s and 1s). This is machine-readable code at the lowest level, and is essentially what all programming language commands and accompanying data get translated into, by compilers. (This is quite a fundamental component of Computer Science: for further reading see ‘language compilers’, ‘assembly language’, ‘machine code’ if interested.) If the serialized object is sent elsewhere, it is with the expectation that it will be reconstructed in different architectural environment or via the resources of a different programming language. Therefore, to preserve the contents of the object, most deserializing tools operate linearly — that is, they process or ‘read’ the whole object from beginning to end.

Serialization formats

Most of us are extremely familiar and have worked with files ending in ‘.xml’ or ‘.json’, and will probably have read from or written to such files. My final project at Flatiron involved constructing an SVG world map via reading geographical data from a ‘.json’ file, for example. File formats such as ‘.xml’ and ‘.json’ are serialized formats, in a human-readable, text-based format, unlike Binary XML, for example, which is not readable by plain text editors. ‘yaml’ is another, arguably more legible format: it allows for indentation in the data and the tagging of data types, among other features.

If you find yourself having to work with large scale scientific data, such as numerical ocean or satellite data, you may encounter HDF, netCDF and GRIB: formats developed specifically to handle (binary) serialization of volumetric data.

Serialization in use

A number of languages — Ruby, PHP, Python, Objective-C, and .NET — enable serialization. For those that don’t have embedded serialization support, associated libraries are available to facilitate it. In Haskell, it is selectively enabled —types deriving the ‘Read’ or ‘Show’ type classes — have access, and there are specific Haskell libraries to provide more-efficient, high-speed serialization. Notably, C/C++don’t offer it, but serialization frameworks such as Cereal, S11n and Boost.Serializer can be incorporated easily to extend functionality.

Incorporating serialization in Java is often touted as a seamless operation, since it enables automatic serialization when classes extend its java.io.Serializable interface (these classes become subclasses of Serializable, but the interface itself has no methods). A serializable class may then optionally define methods with particular class names to enable these methods to be included in the serialization process. It is important to note, however, a similarity in serialization across languages: serialization tools do not care about access modifier as such as ‘Private’, and therefore should be used carefully if such modifiers have been employed.

A second rule is that static or transient variables won’t be serialized. Static variables are do not belong to any instance of the class (they may be shared by the instances) and are considered class-level variables. They are variables that are initialized when the class loads, before any object of the class is created, and before any static method of the class executes. Variables that need not be included in the serialization process can be marked as ‘transient’. The serializer will ignore the value of the variable marked as transient and instead save the default value of the serializer variable data type. Typically, it is used to modify variables containing confidential or sensitive data (see example below).

transient variable example: soruced from geeksforgeeks.org

The following example (in Java) hopefully illustrate the above in context.

snippet sourced from Naresh Joshi’s blog

So here, a new instance of the class Employee (or a new Employee object) called ‘empObj’ has been created. Calling the serialize method on it results in a new instance of java’s FileOutputStream object (called ‘fos’ here), which creates and names the empty file into which the contents of ‘empObj’ will be written. The new instance of ObjectOutputStream (called ‘oos’ here) wraps the FileOutputStream object. Finally, the writeObject method, called on ‘oos’, causes the contents of ‘empObj’ to be written to the file ‘data.obj’.

In summary: new FileOutputStream object (create and name file to write to)→ new ObjectOutputStream object wrapping the FileOutputStream (prepare the file to be written to)→ writeObject method called on the contents to be written to the file (write to the file).

This is just one (relatively simple) example of how serialization works in java. A second post will cover more complex methods, customization of serialization logic, what serialVersionUID does, and of course the Externalizable interface.

On serialization: what it is, and why it’s needed. (part 1)

Introduction

Serialization formats

Serialization in use

Written by code_with_zeal