When Telcos and Big Data engineers share good practices …

Engineering Big Data Streams

I recently came across a very insighful talk by Jonathan Winandy, at OpenWorldForum : “Data Encoding and Metadata for Streams”. Sorry, the talk was in french, but I will comment in english for you. However, english deck is here.

Jonathan is a “Data Pipeline Engineer”. He helps its customer to stop losing data, by building appropriate distributed architectures, at scale.

About Data Streams

Data streams are today a hot topic. They are becoming strategic, when it comes to collect and store clicks streams, app logs, or connected devices information.

Data streams are an abstract data structure, with very simple operations : namely “append”, and “read at”. Once the data is pushed in the stream, at a certain position, it is not allowed to modify it any longer. This makes them very different from traditional data management architectures, like key-value stores, or databases. But they are also definitely different from queues, or messaging solutions (e.g. : RabbitMQ).

Regarding data stream systems, think for example about Apache Kafka (and the fascinating story behind).

Binary Encoding and Metadata

Jonathan’s talks focuses on data encoding in streams, or how to convey data over wires. Several technics are used today : different kinds of java serialization, or the undisputable champion : JSON. Here are some typical encoding schemes pointed out by Jonathan :

Unfortunately, classical encoding are not CPU, and memory efficient. And they are not generic enough, to allow the stream to be consumed by heterogeneous systems on the path. What’s more, they are not optimized for easy parsing. This really starts causing problems when building systems at scale.

Look, for instance, how JSON is unefficient (39 bytes on the wire, for 10 bytes of useful data). Instead the corresponding Avro encoding is much more compact, and efficient to parse.

Telecom industry

For me, this stuff about data streams, and powerful encoding, rings a bell. I spent 15 years in the telecom industry. Actually, efficient encoding of datagrams is something the wireless telecom scene knows very well.

And that, for years.

Indeed, spectrum is a scarce and incredibly costly resource. No extra information bit must be transmitted, and it forces telco engineers to remain lean.

Look, for example, at the good old 2G GSM protocol, normalized in the 90's. Signalling messages used to handle radio resources, mobility, and calls are specified in the norm GSM 04.07 (§11). By essence, they are bit centric, compact, and not at all character centric. We can read that :

Messages are bit strings of variable length, formally a succession of a finite, possibly null, number of bits (i.e., elements of the set {“0", “1"}), with a beginning and an end. Considered as messages, these bit strings follow some structure (the syntax), enabling to organise bits in information pieces of a different meaning level

Here is, for example, how is encoded the phone number in a 2G cell phone message. Note that digits are encoded with 4 bits, the minimal number of bits needed (BCD encoding).

Bit-cenctricity also applies for the 3G UMTS protocol, but with a different encoding scheme known as ASN.1. As stated in the norm UMTS TS 25.331 :

The bits in the ASN.1 bit string shall represent the semantics of the functional IE definition in decreasing order of bit significance;

- with the first (or leftmost) bit in the bit string representing the most significant bit; and

- with the last (or rightmost) bit in the bit string representing the least significant bit

In ASN.1, messages are described with a standardized syntax, independant of the way information is represented by their consumers and producers. Once formally described, they can be encoded into packed bitstreams, through normalized rules. Here is an online ASN.1 compiler where you can encode / decode messages interactively.

ASN.1 is also used in other domains : like cryptography, to encode X.509 certificates.

It’s definitely great to have a cross-industry look, seeing engineers facing same kinds of problems, at different time, and finding creative solutions.

--

--

Christophe Bourguignat

Data enthusiast #BigData #DataScience #MachineLearning #FrenchData #Kaggle