Methods with Losses: JPEG2000 and JBIG2
Previously published in the PDf Optimization In-depth Series:
- Introduction to the Optimization of Existing PDF files: Methods
- Lossless Methods: Optimization of Document Content
- Lossless Methods: Compression of Streams and Fonts
This article was originally published on the GdPicture blog.
Sometimes, when acceptable or necessary, it may be interesting to recompress images already included in an existing PDF file to reduce the file size.
The PDF specification allows for seven compression schemes for such purposes. All of them can be used to compress images.
Among these different compression schemes, two offer particularly beneficial optimization opportunities: JPEG2000 for 24-bit color and 8-bit grayscale images and JBIG2 for bitonal images, usually black and white.
JPEG2000
JPEG2000 (JP2, JPX) is a standardized (ISO/IEC 15444) image compression algorithm based on the existing JPEG standard enhanced with a newly introduced wavelet transforms.
Pros:
- The compression is often much better than all other schemes for color images. Some sources mention a 20% improvement in compression efficiency over the JPEG standard.
- The scheme allows choosing a compression ratio adjustment, which will vary the quality of the produced image.
- It offers impressive results in terms of quality on the colored text as part of an MRC compression.
- It also brings an optional lossless compression mode. Lossless JPEG2000 files are about half the size of the original as indicates some sources.
Cons:
- The scheme is not available for PDF standards below 1.5, including the PDF/A-1a and PDF/A-1b imposed as standard for exchange and retention by some jurisdictions.
- Decompression can be very slow, especially on very high definition images. This fact can alter the user experience with some viewers.
- It provides little compression vs. quality benefit on small images, especially photos.
This method is suitable for colored photographs and structured documents scanned in color, like invoices, forms, etc.
It is not suitable for:
- Compression of transparency masks. The PDF specification makes it necessary to isolate the alpha channel from images with transparency in an image acting as a mask.
- Images with flat colors like logos, lines, or geometric shapes with filling.
- Any other image where loss is not desired.
Here are some numbers for comparison.
JBIG2
JBIG2 is a standardized (ISO/IEC 14492) image compression algorithm for bitonal images (black&white) based on the JBIG compression scheme.
If you want to know more, you can check our blog article here.
Pros:
- The compression can be either lossless or lossy by definition.
- The compression rate is often much higher than CCITT4 Group 4 Compression, even in lossless mode.
- It can be used in PDF documents since version 1.4 of the PDF format.
- It is supported in all PDF/A versions.
Cons:
- Lossy encoding can produce very unexpected results.
This scheme is suitable for all types of bitonal images when the lossless encoding is used.
It is not suitable for lossy compression of sensitive data, where even the slightest substitution of symbols can produce unexpected problems.
Here are some numbers for comparison.
Issues with JBIG2
It may be interesting here to notice the reasons for and the consequences of the above mentioned unexpected results of lossy JBIG2 compression.
In 2013 it was reported that the JBIG2 compression implementation in some Xerox scanners unexpectedly changes some characters to others when scanning documents. The reason for this was the aggressive usage of the high-level lossy compression.
Following this, the German Federal Office for Information Security (BSI) has banned the use of pattern-matching algorithms effective in March 2015. The JBIG2 lossy compression algorithm uses this kind of technique. Although the ban does not target namely the use of JBIG2 compression, many publishers and service providers have decided to ban the JBIG2 encoder. Others rather choose JBIG2 compression only as part of lossless compression.
It seems that character substitution is the least desirable effect.
Until now, no open-source library appears to be producing a correct result, even with a high desired level of quality.
It will, therefore, be wise to ensure that the editor that develops the encoding can limit as much as possible this substitution based on heuristics or optical character recognition (like we do with our hyper-compression solution which includes -but is not limited to-JBIG2 compression).
If the targeted standard does not allow JPEG-2000 encoding, it will be interesting to serialize JPEG and Deflate compression. Indeed, the specification enables certain use cases of several filters. This practice provides compression gains of up to 15%.
In the next articles will talk about other methods with losses: resizing, color detection, and MRC compression.
See you next time,
Loïc & Elodie