Playing Together with Elixir Binaries-Strings :)

Play seriously to win over.

This article comprises of things that you’ll encounter while working with Strings and Raw bytes explaining with real situational examples. I tried to design the images, to focus on what we are talking. Hope you like them.

Elixir Version

All the examples used in this article are executed in iex using the following combination of Elixir/Erlang OTP .

Elixir version

Gentle Intro

I got to do the heavy workout on packet parsing using the header lengths on raw binaries decoding and encoding of 16, 32, 64 bit strings in one of my projects. So, I just got a thought to share the experience.

Hope, you already knew the difference of bitstring, binary, bit, and byte. If true, do: skip the following screen shot else: have a glance of it.

8 bits = 1 byte
“Every binary is a bitstring but every bitstring is not a binary “

In elixir, binary is represented by <<>> . Of course, everybody does know.

iex(8) data = <<"hello">>
"hello"
iex(9) is_binary data
true
iex(10) is_bitstring data
true
iex(11) data2 = <<1,2,3::4>>
<<1, 2, 3::size(4)>>
iex(12) is_bitstring data2
true
iex(13) is_binary data2
false

What makes a binary different from bitstring ?

If the number of bits is a multiple of 8, then we call it as a binary.

Consider the following example.

<<1,2,3::4>>

In the above line, we did not mention the number bits to be used for 1,2 but we represented for 3. In elixir, if the size is not mentioned, it uses default 8 bits. So, <<1,2,3::4>> is equal to <<1::8, 2::8, 3::4>> which is a 20 bit data. We cannot call it as a binary as number of bits is 20 which is not a multiple of 8.

Have look at the following representation.

bitstring and binary

Raw bytes and Understanding Elixir representation

Strings in elixir are binaries. Sorry for repeating the same statement again and again. But, I have to do. Even when you are asked by waking up from sleep, you are supposed to say that.

Consider a word hello each letter or a grapheme will take 8 bits. So, the total byte_size of a word hello is 5.

iex> byte_size "hello"
5
iex> String.graphemes "hello"
["h", "e", "l", "l", "o"]
iex> String.valid? <<35>>
true
iex> <<35>>
"#" // valid string

The ASCII (American Standard Code for Information Interchange) code for # is 35. The binary representation of 35 is 100011 6 bit data.

Here, << 35 >> means we are telling to use 8 bits for 35. So, 00100011 is the binary form for 35. If you represent like << 35::6>> is fall under raw bytes of data.

iex> <<35::6>>
<<35::size(6)>>
iex> String.valid?(<<35::6>>)
false
iex> String.valid?(<<35::8>>)
true

Understanding Elixir Representation

Consider the following lines of code

iex> match?("#", <<35>>)
true
iex> match? "#", <<0::1, 0::1, 1::1, 0::1, 0::1, 0::1, 1::1, 1::1>>
true
iex> match? <<35>>, <<0::1, 0::1, 1::1, 0::1, 0::1, 0::1,1::1,1::1>>
true

Here, literally we are dividing each bit of << 35::8 >> to <<0::1, 0::1, 1::1, 0::1, 0::1, 0::1,1::1, 1::1>>

Back End Story of Learning

When I was learning the basics of programming in Elixir, I used to turn the pages without reading when ever I see the symbols <<>> . These symbols are night mare when I was a kid relative to Elixir. Learning them is like a feeling of hitting the mountain with your head at a speed of 200. Just imagine.

OK! Stories are apart. But, once you get a clear picture of what is meant by raw byte and valid strings in your mind, you’ll climb the mountain with ease.

Programmers heavily deal with raw bytes in their life than Strings. Especially, one who always do parsing.

Programmers count memory but not in length.

Remember the previous line, we talk on this later inside the article in deep.

Extracting Sub-String

This is a real-world situation.

Extracting a String of Known Length

If you know the exact length of the string and position from where you want to extract, then you can go with the following approach

Using binary_part for raw bytes

When you dwell on real world project, you’ll encounter the situations dealing with raw bytes of data. I would suggest you learn as much as possible before working with raw bytes of data.

iex> binary_part("hello medium", 6, 6)
"medium"

The binary_part(binary, start, length) extracts the binary part from start to the length . It is used for splitting the raw bytes of data.

When the length is negative and within the bounds, it extracts the string from right to left unlike it does from left to right.

Things to remember.

→Here, the index cannot be negative.

→Here, the binaries are zero-indexed means binary_part("hello",1,1) would results e not h . You have to try binary_part("hello", 0,1) . Hope you understood what the zero-indexed is.

→The start and length cannot exceed the byte_size of string. Otherwise, it raises an Argument Error Exception.

Using binary_part in Guard clause

This definition can be used in guard clause as well.

Example: Packet Parsing

For an example, you are parsing the packets like $admin#medium#worlds#best#blog , $user#blackode#a#medium#writer . You are asked to write a definition that receives a packet and you have to differentiate each packet from other.

You can do this by splitting the packet like String.split(packet, "#") and using if macro to do the job. But, it takes more code logic. You can make use of the binary_part in guard clause like following.

defmodule Parser do
def parse(packet) when binary_part(packet, 1,5)=="admin" do
IO.puts "Admin Packet !"
end
...
end
using binary_part in guard clause

Check out the execution screenshot

execution screen shot

=================Warning=================

As I already mentioned in the things to remember section, if either length or start values are out of bounds, then it raises an Argument Error exception.

Here, start is out of bounds
Here, length is out of bounds

— Extracting a string of Unknown Length

If you don’t know the length of the sub string, you cannot use the binary_part function. Here comes the binary pattern matching <<>> in handy.

Situation
You are asked to extract the string from the position 6 to end of the string.

String in Elixir is a multiple of 8 bits which we call it as binary. It means, if the bit_size is divided by 8 then we call that bitstring as binary.

As we talked earlier in the intro section, each letter in string is of 8 bits means 1 byte. So, to skip the 6 letters you have to skip 6x8 bits.

— Extracting first letter from the string

Situation
Extract the currency symbol from string “$500”

This can be achieved in many ways

String.first

iex> string = "$500"
"$500"
iex> string |> String.first
"$"

Pattern Matching

iex> string = "$500"
"$500"
iex> <<first::8,_rest::binary>> = string
"$500"
iex> <<first>>
"$"
iex> first
36 // code_point ascii-code of $
iex> <<35>>
"#"

String.split

Not recommended in this situation but, it is good to know the option existence.

As we know, it splits the string based on the given pattern. If the pattern is "" it gives some different result.

iex> string = "$500"
"$500"
iex> string |> String.split("")
["", "$", "5", "0", "0", ""]

note: no space between ""

If you observe here, it added some extra "" at head and tail. You have to again trim them by passing an option trim: true .

iex> string = "$500"
"$500"
iex> string |> String.split("", trim: true) |> hd
"$"

String.slice

iex> String.slice "$500", 0, 1
"$"
iex> String.slice "$500", -4, 1
"$"

String.slice [ VS ] binary_part

As we know, both will takes arguments as (str, start, len) and returns a sub string starting at the offset start, and of length len .

I kept thinking of why would be there two functions with similar functionality. So, I started checking out the things that differentiate them.

Out of bound options

When the start and len are out of the bounds then binary_part would raise an Argument Error as it is designed to use along with raw bytes but not String.slice which refers to the String.length.

Let’s check that.

iex(14) str = "hello medium" 
"hello medium"
iex(15) String.slice str, 6, 10
"medium"
iex(16) binary_part str, 6, 10
Bug Bug ..!!** (ArgumentError) argument error
:erlang.binary_part("hello medium", 6, 10)

Here, after position 6 only remain with 6 letters, but we tried to extract sub string of len 10 . So, the binary_part raised an error but not String.slice which gave a result of sub string from index 6 to end of the string. Hope you got the point.

Raw Bytes and Graphemes

The function String.slice(str, start, len) , the start is the index of the graphemes whereas in binary_part it is the index of a byte.

It will be more clear with the following example.

iex> str = "hełło" 
"hełło"
iex> String.length str
5
iex> byte_size str
7
iex> String.graphemes str
["h", "e", "ł", "ł", "o"]

I hope you understand what I mean of graphemes. The graphemes length of str is 5 but its byte_size is 7 that is where these functions differ from each other.

The byte_size/1 counts the underlying raw bytes, and String.length/1 counts characters .

The function String.slice deals with unicode graphemes and binary_part deals byte_size.

In general, binary_part deals with raw bytes.

Internal Representation of String (Raw Bytes)

You can see the binary representation of any string with a little hack of joining the string with <<0>> .

iex> str = "hełło" 
"hełło"
iex> raw = str <> <<0>>
<<104, 101, 197, 130, 197, 130, 111, 0>>
iex(37) String.slice raw, 2, 3
"łło"
iex(38) binary_part raw, 2, 3 
<<197, 130, 197>>

The elixir has a Base module which helps you in decoding and encoding of binaries. Have a look here.

Hope you enjoyed playing with strings. Practice makes you more perfect. Try to parse ipv4 packet based on its header length .

If you find this helpful, please put your hand forward to share. Let’s others get benefited from this.

Sharing is Caring.

Happy Coding! Keep always smiling :)

if worth_clapping, do: CLAP, else: nil