LZ77 compression in Javascript

Vincentcorbee
17 min readJan 15, 2023

--

When I was working on a library for creating pdf’s in Javascript, I wanted that the library was able to use compression filters in stream objects. One such filter that pdf supports is FlateDecode which uses the deflate compression algorithm. Deflate is also what is used in Gzip. So with fresh enthusiasm for the challenge ahead I looked for the rfc. There were a few parts in constructing the bit stream that stood out, Huffman encoding and LZ77. Since I already knew how Huffman coding works (kinda), I set my sights on LZ77. So what is LZ77 and how do you implement it from scratch in Javascript(ish)? Let’s find out.

Side note. The first fun fact I discovered is that deflate doesn’t use LZ77 compression at all.. 💩 It uses a derivation of that algorithm named LZSS. So the end result is useless for deflate, but hey, we learned something.🤓 The main difference is that while LZ77 outputs items as a triplet in the form of [offset, length, character]. LZSS outputs based ont he length of the match, either a literal or a tuple in the form of [offset, length]. I understand if this doesn’t make sense now, but it will, I promise — hopefully. 🤞🏻

So what is LZ77

LZ77 is a lossless compression algorithm created by Lempel and Ziv in 1977. The LZ77 algorithm compresses data by replacing duplicate sections of data with metadata in the form of a triplet [ offset, length, character ].

First some key terms need to be addressed:

coding position
This is the position in the input stream for the character currently being encoded

search buffer
The search buffer is a history of processed characters with a fixed length in which the algorithm searches to match for duplicate segments.

lookahead buffer
The lookahead buffer is a fixed length set af characters after the current character into which te algorithm looks to find the longest match possible. So for example if the current character being processed is a A after that comes the characters B C D E. If A matches, it will try and match as much characters as it can in the lookahead buffer.

sliding window
The three parts above are called the sliding window. It is called sliding because it moves over the input stream as more characters are processed.

So how does this process work?

Let’s use the following as our input: abcbbcbaaaaaa

First let’s define the size of our sliding window. Let’s say we have a search buffer of 6 characters and a lookahead buffer of 5 characters.

Then the algorithms starts with moving the coding position to the start of the input. This means that our search buffer is empty, our current coding position points at character a and our lookahead buffer contains babcb.

Thus our state looks like this:

source
a b c b b c b a a a a a a

sliding window
a b c b b c
0 1 2 3 4 5

Now, since our search buffer is empty we can’t find any matches. So what do we output when we don’t have a match? We output a triplet in the form of: [0, 0, character]. This means the offset is zero and the length is zero.

So we have our first output: [0, 0, a]. 🥹

Our sliding window moves over one position and our state now looks like this:

 source
a b c b b c b a a a a a a

sliding window
a b c b b c e
-1 0 1 2 3 4 5

outputs
[0, 0, a]

Our coding position is now at b and our search buffer now contains the character a. So let’s repeat the process for b. We look in our search buffer and can’t find a match. So again we output: [0, 0, b] and our state now looks like this:

 source
a b c b b c b a a a a a a

sliding window
a b c b b c b a
-2 -1 0 1 2 3 4 5

outputs
[0, 0, a]
[0, 0, b]

Our coding position is now at c and our search buffer now contains the characters ab. So let us again repeat the process. We look in our search buffer and.. Still nothing, so we emit: [0, 0, c].

Our sliding window moves over one position and our state now looks like this:

 source
a b c b b c b a a a a a a

sliding window
a b c b b c b a a
-3 -2 -1 0 1 2 3 4 5

outputs
[0, 0, a]
[0, 0, b]
[0, 0, c]

Our coding position is now at b and our search buffer now contains the characters abc. So let us again repeat the process. We look in our search buffer and.. We found a match at -2! Now we will continue to look in our look ahead buffer to try and find the longest match possible. The next character in our lookahead buffer is b. So we check for bb and.. sadly no match. Since we have a match we output [2, 1, b]. We go back 2 with a length of one and emit the next character which is also a b.

Now our state looks like this:

 source
a b c b b c b a a a a a a

sliding window
a b c b b c b a a a a
-5 -4 -3 -2 -1 0 1 2 3 4 5

outputs
[0, 0, a]
[0, 0, b]
[0, 0, c]
[2, 1, b]

Let’s repeat the process again. What can we match? Wel have a match for cb in our search buffer at -3. So we will output [3, 2, a].

Our state now looks like this:

 source
a b c b b c b a a a a a a

sliding window
c b b c b a a a a a a
-6 -5 -4 -3 -2 -1 0 1 2 3 4

outputs
[0, 0, a]
[0, 0, b]
[0, 0, c]
[2, 1, b]
[3, 2, a]

Observe since the size of our search buffer had a length of six we have lost some of our previously processed characters. Also since we only have five unprocessed characters remaining, our lookahead buffer only contains four characters.

Let us again repeat the process. We take our character a and see that it matches on -1. We matched a and reach the end of our search buffer. So do we repeat the process again for the last four characters? The answer in this scenario is no!

The reason being that the length of our match could overlap previously seen characters. This works because within the algorithm we get run length encoding for free. What this means is that we can continue looking for matches in the lookahead buffer as long as the match repeats. In our case, we have a repeating pattern of a that starts at -1 and continues till the end of our search buffer. Thus although our search buffers ends, we can continue to process this repeating pattern and add it to the length.

So our length becomes 5, i.e. 5 times an a. Now our output looks like this: [1, 5, ‘’]. We output an empty string because we have processed all our characters.

Our finished state looks like this:

outputs
[0, 0, a]
[0, 0, b]
[0, 0, c]
[2, 1, b]
[3, 2, a]
[1, 5, '']

And we are done! 🎉

As you have seen we even emit a triplet for unmatched characters. What this means is that if we take our first character a, which is 8 bits, we actually expand the original data because we also need to store the offset and length. So in our example for the first three characters we already used more bits than the actual size of the original data. So did we actually achieved compression? Well that further depends on how we store the offset en length. If we cram them in 16 bits, we need at least 24 bits per triplet. So if we look at our output, we need 6 * 24 = 144 bits. Our original data contained 8 * 13 = 104 bits. Hurray, we actually expanded our original data. 💩

It becomes clear that with this scheme, we need to store at least three characters in our triplet to break even. The LZSS scheme tries to do away whit this overhead and stores a match as just a length distance pair and only if storing this metadata is less than just storing the original character.

Wel great we compress it, but how do we get our original data back?

Decompress

To decompress we convert each triplet back into a piece of the original data and add it to the result.

The outputs we have produced in the compression fase where:

[0, 0, a]
[0, 0, b]
[0, 0, c]
[2, 1, b]
[3, 2, a]
[1, 5, '']

Since our first triplet always has a offset of 0 and a length of zero, we simply emit the character. So our first triplet produces a. Our decoded stream now becomes a. Our next triplet is [0, 0, b]. Again, since are offset is 0 and our length is 0, we emit the character. Our output now becomes ab. For our next triplet we do the same, so our output becomes abc. When we look at our third triplet we have a offset of 2 and a length of 1. Now how do we process this? Remember that our offset is de position in our search buffer. But you might ask, where is our search buffer? Well, that is all the data we already decompressed. For our triplet [2, 1, b] we go back 2 positions and read one character and emit a b. Which results in bb. We now have decompressed abcbb. The next one is [3, 2, a] which results in cba. Our output becomes abcbbcba. Now for our last one [1, 5, ‘’]. If you remember this one was special because it’s length was bigger than the match in the search buffer. So how do we process this one? We go back one which is an a. Then we copy that character over five times which becomes aaaaa so our output becomes abcbbcbaaaaaa.

And if we compare it to our input in the compression fase we can see that it matches. So now we have compressed and decompressed data with the LZ77 algorithm. Our next step is to figure out how to implement this in Javascript. 🤔

Now let’s implement it in Javascript

You can type along or clone the repo from here: https://github.com/vincentcorbee/LZ77

I like to point out this is not a comprehensive implementation. It is intended to learn about the algorithm and how we can go about implementing it in Javascript. I try to be as descriptive as possible in my function and variable naming so hopefully the code itself can in large part be self explanatory. Also this implementation is based on the premise that every character in the input can be read as an unsigned 8 bit integer.

So let’s get typing. We are not going to install any dependencies for our actual implementation. But since we are going to be using Typescript we are going to install typescript, ts-node and types/nodes as dev dependencies. The following commands will setup the basic structure that we need for our project.

touch lz77 && cd lz77

yarn init -y

yarn add -D ts-node typescript @types/node

npx tsc --init

mkdir src && mkdir src/lib && mkdir src/samples && mkdir src/modules

touch src/lib/index.ts && touch src/index.ts && touch src/modules/index.ts && touch src/types.ts

First we set up our index files.

In index.ts add.

export * from './compress'
export * from './decompress'

In lib/index.ts add.

export * from './get-options'
export * from './get-match'
export * from './get-lookahead-buffer'
export * from './get-search-buffer'
export * from './create-array-buffer'

And finally in modules/index.ts add.

export * from './binary-reader'
export * from './binary-writer'

We start by creating compress.ts in our src folder with the following content.

import { getOptions, getMatch, getLookaheadBuffer, getSearchBuffer, createArrayBuffer } from "./lib";
import { LZ77Options, EncodedArray } from './types'

export function compress(
input: string,
options?: Partial<LZ77Options>
) {
const { searchBufferLength, lookaheadBufferLength } = getOptions(options)

const encodedArray: EncodedArray = []
const end = input.length - 1

let searchBuffer = ''

let codingPosition = 0

while (codingPosition <= end) {
const lookaheadBuffer = getLookaheadBuffer(input, codingPosition, lookaheadBufferLength)
const [offset, length, matchedChars] = getMatch(
searchBuffer,
lookaheadBuffer
)

codingPosition += length

const nonMatchingChar = input[codingPosition] ?? ''

searchBuffer += matchedChars + nonMatchingChar

codingPosition++

searchBuffer = getSearchBuffer(searchBuffer, searchBufferLength)

encodedArray.push([offset, length, nonMatchingChar])
}

return createArrayBuffer(encodedArray)
}

Our compression function takes in two inputs. A string as our data to be compressed and an options object for the compression settings. The settings that can be adjusted are the length of the search buffer and the lookahead buffer. With these settings we can influence the compression ratio of our output. If we have a larger search and lookahead buffer, we will potentially get more compression but the process will become slower. And vice versa if the buffers are smaller we potentially have less compression but a faster process. We set these options based on the user input or default options.

We then set our encodedArray to an empty array that we will be filling with our encodings. Our end is set to the end of the last index in our input. We set the search buffer to an empty string and our coding position is set to 0.

Next is our main loop. We continue this loop until we reach the end of our input. First we get our current lookahead buffer. The we use the search buffer and lookahead buffer to get an encoding for our character. We get back the offset, length and matched characters. We then add the length to our coding position. The character that we will be outputting is either the first non matching character in our lookahead buffer or the next character at our new coding position or an empty string. After that we add all our processed characters to our search buffer and move our coding position one position over. We finally get a new search buffer and push our new encoding to the encoded array.

When all the data is processed, create an array buffer and return it.

In types.ts add the following.

export type LZ77Options = {
searchBufferLength: number
lookaheadBufferLength: number
}

export type Encoding = [number, number, string]

export type Match = Encoding

export type EncodedArray = Encoding[]

In our lib folder create get-options.ts with the following content.

import { LZ77MaxSearchBufferLength, LZ77MaxLookaheadBufferLength } from "../constants";
import { LZ77Options } from "../types";

export function getOptions(options: Partial<LZ77Options> = {}): LZ77Options {
const { searchBufferLength = 255, lookaheadBufferLength = 15 } = options

return {
searchBufferLength: Math.min(searchBufferLength, LZ77MaxSearchBufferLength),
lookaheadBufferLength: Math.min(lookaheadBufferLength, LZ77MaxLookaheadBufferLength)
}
}

Here we are returning an options object with user defined options or defaults. We are capping the length of both buffers because we are going to store our offset as 12 bits and our length as 4 bits. So our max search buffer length is 0xfff or 4095 and our lookahead buffer is 0xf or 15.

In our src folder we create constants.ts which will hold our max values.

export const LZ77MaxSearchBufferLength = 0xfff
export const LZ77MaxLookaheadBufferLength = 0xf

For our lookahead buffer we create get-lookahead-buffer.ts in our lib folder.

export function getLookaheadBuffer(
input: string,
codingPosition: number,
lookaheadBufferLength: number,
) {
return input.substring(codingPosition, codingPosition + lookaheadBufferLength)
}

Here we are returning a substring of our input data based on our coding position and our search buffer length.

Next up the matching phase. This step is a little bit more involved. Create get-match.ts in our lib folder with the following content:

import { Match } from "../types"

export function getMatch(searchBuffer: string, lookaheadBuffer: string) {
const [char] = lookaheadBuffer

let offset = 0
let lengthMatch = 0
let matchedChars = searchBuffer.lastIndexOf(char) === -1 ? '' : char

if (!matchedChars) return [offset, lengthMatch, matchedChars] as Match

const searchBufferEnd = searchBuffer.length

let indexLookaheadBuffer = lookaheadBuffer.length

while (indexLookaheadBuffer > 0) {
const chars = lookaheadBuffer.substring(0, indexLookaheadBuffer)

const indexInSearchBuffer = searchBuffer.lastIndexOf(chars)

if (indexInSearchBuffer > -1) {
lengthMatch = chars.length

matchedChars = chars

offset = searchBufferEnd - indexInSearchBuffer

/* Get the run length of the matched chars in the lookahead buffer */
if (indexInSearchBuffer + chars.length === searchBufferEnd) {
while (indexLookaheadBuffer <= lookaheadBuffer.length) {
const remainingChars = lookaheadBuffer.substring(indexLookaheadBuffer)
const match = remainingChars.indexOf(chars) === 0

if (!match) break

matchedChars += chars

indexLookaheadBuffer += chars.length

lengthMatch = matchedChars.length
}
}

break
}

indexLookaheadBuffer--
}

return [offset, lengthMatch, matchedChars] as Match
}

So what’s going on here? Our function takes in the search buffer and the lookahead buffer. We set our offset and length of the match to 0. First we check if our first character matches. Remember that the character out our coding position is added to the lookahead buffer. If it matches we add it to our matched characters. If we don’t have a match, we can stop and return [0, 0, ‘’].

Now if we matched our first character, i.e. we know it is in the search buffer, we can process the rest of our lookahead buffer. Our main loop runs as long as our index is bigger than 0 because we already processed the first character.

We start with the longest match we could possibly find and work our way down. In our search buffer we will try to find the last index of this match.

If we have a match we set our length to the matched chars set our offset accordingly. The next thing we do is check if we are at the end of the search buffer. Why? If you recalled, we can continue on with matching as long as our match repeats in the lookahead buffer, i.e. our run length encoding. We have a simple while loop that continues adding characters to our match as long as we find them.

Now if we are done processing, we return our result as [offset, length match, matched characters].

We still need to create the functions that updates our search buffer. In our lib folder create get-search-buffer-ts with the following content:

export function getSearchBuffer (searchBuffer: string, searchBufferLength: number) {
const currentLengthSearchBuffer = searchBuffer.length

/*
Move the search buffer n positions over based on the difference
between the current lenght of the search buffer and the max length of our search buffer
*/

if (currentLengthSearchBuffer >= searchBufferLength) return searchBuffer.substring(currentLengthSearchBuffer - searchBufferLength)

return searchBuffer
}

What we doe here is simply returning our search buffer if it hasn’t exceeded the limits of the search buffer length else we remove the n position from the start and return the new buffer.

The last part is convert our encoded array to an array buffer. Because we are outputting binary data we are also going to consume binary data we will be creating a binary reader and a binary writer.

In our modules folder create binary-reader.ts with the following content:

export class BinaryReader {
protected pos: number
protected bitCount: number

view: DataView

constructor(arrayBuffer: ArrayBuffer) {
this.view = new DataView(arrayBuffer)
this.pos = 0
this.bitCount = 0
}

protected getData(type = 'Uint8') {
if (this.view.byteLength > this.pos) {

// @ts-ignore
return this.view[`get${type}`](this.pos++)
}

throw new RangeError()
}

get buffer () {
return this.view.buffer
}

getBytePosition() {
return this.pos
}

seek(pos: number) {
const oldPos = this.pos

this.pos = pos

return oldPos
}

peak(index = this.pos + 1) {
if (this.view.byteLength > index && index > -1) {
return this.view.getUint8(index)
}

return null
}

getUint16() {
return (this.getData() << 8) | this.getData()
}

getUint8() {
return this.getData()
}

getOffsetLength() {
const data = this.getUint16()

return [data >>> 4, data & 0xf]
}

getCharacter() {
const uint8 = this.getData()

return uint8 ? String.fromCharCode(uint8) : ''
}
}

Next in the same folder create binary-writer.ts with the following content:

import { BinaryReader } from './binary-reader'

export class BinaryWriter extends BinaryReader {
constructor(length: number) {
super(new ArrayBuffer(length))
}

private setData(data: number, type = 'Uint8') {
let advance = 0

switch(type) {
case 'Uint16':
advance = 2
break;
default:
advance = 1
}

if (this.view.byteLength > this.pos) {
// @ts-ignore
this.view[`set${type}`](this.pos, data)

this.pos += advance

return this
}

return this
}

setUint16(data: number) {
return this.setData(data, 'Uint16')
}

setUint8(data: number) {
return this.setData(data)
}
}

Now for the creating of our array buffer we create create-array-buffer.ts in our lib folder with the following content:

import { BinaryWriter } from "../modules";
import { EncodedArray } from "../types";

export function createArrayBuffer(encodedData: EncodedArray) {
const binaryWriter = new BinaryWriter(encodedData.length * 3)

encodedData.forEach(([ offset, length, character ]) => {
binaryWriter.setUint16(offset << 4 | length)
binaryWriter.setUint8(character.charCodeAt(0))
})

return binaryWriter.buffer
}

This function takes in the encoded data and outputs an array buffer. First we create a new binary writer with the length of our encoded data * 3 because we store our triplet in 32 bits. Next we loop over all our encodings. First we store our offset and length as a uint16 integer. We store our offset in the most significant 12 bits and our length in the 4 least significant bits. After that we store our character as a uint8 integer. When we processed all our data we return our buffer.

That is all for the compression part.

Decompress

Next we need the ability to decompress our data. So in our src folder create decompress.ts with the following content.

import { BinaryReader } from "./modules"

export function decompress(buffer: ArrayBuffer) {
const binaryReader = new BinaryReader(buffer)

let output = ''

let index = 0

while (binaryReader.peak() !== null) {
const [offset, length] = binaryReader.getOffsetLength()
const char = binaryReader.getCharacter()

if (offset === 0 && length === 0) output += char
else {
const startIndex = output.length - offset

const overflow = Math.max(startIndex + length - output.length, 0)

const chars = output.substring(startIndex, startIndex + length)

if (overflow) {
let runLength = length / (length - overflow)

while (runLength > 0) {
output += chars

runLength--
}
} else {
output += chars
}

output += char
}

index++
}

return output
}

Our function takes in our encoded data en returns, hopefully, the original data. The loop runs over the entire encoded data. If the offset and length of the entry is 0, we know this entry does not have a match so we add the character to our output. If we do have a length and offset, we get the start index in our output and retrieve our characters. Recall that the output functions as our search buffer.

Since our data could be run length encoded, we need to check if our length is overflowing our current output. If it is, we determine our run length and output additional characters for the length of our run. When we don’t have a run length, we simply output the characters.

Now if all goes well we should have our original data back. So let’s test that out.

In our src folder create test.ts with the following content:

import assert from "assert"

import { compress, decompress } from "."
import { sampleOne as input } from "./samples"

const compressed = compress(input)

const decompressed = decompress(compressed)

assert(decompressed === input)

if (compressed.length < input.length) console.log('🎉')
else console.log('💩')

In this file we write a simple test for our compression and decompression. We first compress and decompress. After that we check if our decompressed data equals our original data. If that succeeds we have two options, we either have reduced our data size or we haven’t and show a fitting emoji accordingly. We have our options set to default. You can play around with these and you should get different results.

In the sample folder are six different samples. You can swap out sampleOne for any of the other samples. The samples can be found on github.

So let’s run some tests. To run the test use de following command in our root folder folder:

npx ts-node src/test.ts

Let’s run our test with sample one.

15 12
💩

Ah to bad, no compression here. We expanded our output by 3 bytes with a ratio of 0.8.

Let’s try sample two

21 45
🎉

Hoera, we have compression!🥹 We have a compression ratio of 2.14, not bad.

Next sample three.

2823 3059
🎉

Ok, we still achieved some compression with a ratio of 1.08.

Let’s see what sample four gives us.

22374 24543
🎉

Looks similar to last result with a ratio of 1.1.

Let’s try sample five.

21 25
🎉

Mew, still some compression but only a ratio of 1.19.

As you can see, with our default options we can get some compression on most of our examples. If we increased our search buffer we can probably do better. It also further depends on the nature of our source. If we have highly repetitive data we would achieve more compression. The reason why LZ77 also does not achieve that much compression in practice is the fact that we have to emit a triplet for every piece of data. Other LZ scheme’s try to solve this. For example LZSS emits a tuple in the form of [offset, length] only when emitting it would save data rather than emitting the original data.

Conclusion

So have come to the end of our journey into LZ77 in Javascript. Although useless for the deflate algorithm, I have found it interesting to learn about the algorithm and implement it in Javascript. I hope that if you have gotten this far you have the feeling that you also maybe learned something.

--

--