Text Extraction Using PyMuPDF
PyMuPDF: Just another text extraction package?
There are many packages and products in the open source and the commercial market, which support text extraction from PDF documents in one way or another.
So why should you even bother to look at PyMuPDF?
This is what this article is about.
We will cover what differentiates PyMuPDF from other approaches and will show you first steps to get going.
PyMuPDF …
- is a product owned and maintained by Artifex. It is available under an open source, freeware license (GNU AGPL 3.0) as well as a commercial license.
- is a Python programming library, which provides convenient access to the C library MuPDF, also owned and maintained by Artifex under the same license models.
- has its homepage on Github and can be installed from PyPI.
- supports many (if not most) of MuPDF’s functions — text extraction is just one among of dozens of its other features.
- text extraction — like all of its features — is known for its top performance and exceptional rendering quality.
- is not restricted to PDF documents — in contrast to other packages, but its API works in exactly the same way for all supported document types — apart from PDF these include XPS, EPUB, HTML and more. We are not aware of any package — freeware or commercial — that can offer this.
- provides integrated support of Tesseract’s OCR machine. In your script, you can dynamically determine whether OCR-ing of the full document page, or just some part of it is required, then invoke Tesseract and process its output together with with the “regular” text.
What can go wrong in text extraction?
If you ever have worked with any text extraction tool, you probably will have encountered at least one of the following pesky situations:
- Not the right (“natural” / expected) reading order.
- Unsupported / unreadable characters pop up, like here: ”The �ase �lass fo� P�MuPDF’s linkDest, …”.
- You as a human can read the page, but your program won’t produce any output.
PyMuPDF can support you in addressing all of these issues.
Depending on your need, you can choose between basic extraction of plain text (which requires just one Python statement), or sophisticated access to each character’s position on the page, its writing direcion, color, font size, font name and font properties.
Possible output formats range from plain text, over special formats like HTML or SVG to detailed Python dictionaries (or JSON strings).
Using PyMuPDF text extraction
Extracting Plain Text
Like with any Python package, you must import PyMuPDF. This happens under the toplevel name fitz
In [1]: import fitz # import PyMuPDF
In [2]: doc = fitz.open("PyMuPDF.pdf") # open a supported document
In [3]: page = doc[0] # load the required page (0-based index)
In [4]: text = page.get_text() # extract plain text
In [5]: print(text) # process or print it:
PyMuPDF Documentation
Release 1.20.0
Artifex
Jun 20, 2022In [6]:
For documents built in a straightforward way, this is all you need to do.
Because every document also is a sequence of pages under PyMuPDF, you can use it as an iterator with all of Python’s syntactical power. The following extracts all document pages and concatenates their text with the page break character 0XC
.
In [7]: all_text = ""
In [8]: for page in doc:
...: all_text += page.get_text() + chr(12)In [10]: # or shorter, with the even faster list comprehension:
In [11]: all_text = chr(12).join([page.get_text() for page in doc])
The above is extremely fast: expect execution times between 0.7 and less than 2 seconds for complete documents, like the Adobe’s PDF manuals (756, resp. 1,310 pages), or the Pandas manual with more than 3,000 pages.
The method is about three times faster than pdftotext (component of XPDF, the base library of Poppler) and 30 to 45 times (!) faster than popular pure Python packages like pdfminer or PyPDF2.
If you suspect, that text in your document is physically not stored in reading sequence, simply use the “sort” parameter of the method: “page.get_text(sort=True)”. This will return the page’s text paragraphs arranged in the sequence “top-left to bottom-right” and should deliver satisfying results for many or most documents.
You can also restrict extraction to certain areas of the page. For example, if pages have a two-column layout, you could define two rectangles representing those areas and then separately extract the corresponding text portions:
page_rect = page.rect # the full page rectangle
half_width = page_rect.width / 2 # compute half of the page widthleft_rect = +page_rect # prepare left page half: copy page rectangle
left_rect.x1 = half_width # make it half as wideright_rect = +page_rect # prepare right page half: copy page rect
right_rect.x0 = half_width # left border is middle of the page# use those two rectangles as clip areas for the extractions:
left_text = page.get_text(sort=True, clip=left_rect)
right_text = page.get_text(sort=True, clip=right_rect)
Extracting Text with all Detail
The same method can also deliver detailed information alongside the extracted text, such as
- writing direction and writing mode (horizontal / vertical)
- color (RGB)
- font name and font properties
- position information (by single characters, lines and paragraphs)
- images
- automatic substitution of white spaces
- automatic hyphenation detection and handling
Let us pick a small example from the header page of PyMuPDF’s documentation PDF manual (download it from here):
In [1]: import fitz
In [2]: doc = fitz.open("PyMuPDF.pdf")
In [3]: page = doc[0]
In [4]: all_infos = page.get_text("dict", sort=True)
In [7]: pprint(all_infos)
{'blocks': [{'bbox': (240.0, 88.94, 540.0, 388.94),
'bpc': 8,
'colorspace': 3,
'ext': 'png',
'height': 1200,
'image': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x04\xb0'
<< ... omitted data ... >>,
'number': 0,
'size': 107663,
'transform': (300.0, 0.0, -0.0, 300.0, 240.0, 88.94),
'type': 1,
'width': 1200,
'xres': 96,
'yres': 96},
{'bbox': (236.90, 396.92, 540.0, 432.41),
'lines': [{'bbox': (236.90, 396.92, 540.0, 432.41),
'dir': (1.0, 0.0),
'spans': [{'ascender': 1.125,
'bbox': (236.90, 396.92, 540.0, 432.41),
'color': 0,
'descender': -0.307,
'flags': 20,
'font': 'TeXGyreHeros-Bold',
'origin': (236.90, 424.80),
'size': 24.79,
'text': 'PyMuPDF Documentation'}],
'wmode': 0}],
'number': 1,
'type': 0},
{'bbox': (422.28, 433.36, 540.0, 457.98),
'lines': [{'bbox': (422.28, 433.36, 540.0, 457.98),
'dir': (1.0, 0.0),
'spans': [{'ascender': 1.123,... and so on
The “dict” output option of “Page.get_text()” above returns a Python multi-layered dictionary of dictionaries. A picture is worth a thousand words, so you may rather want to look at this image: https://pymupdf.readthedocs.io/en/latest/_images/img-textpage.png
- The top level are “block” dictionaries, which either represent an image or a text paragraph. Like all lower-level dictionaries, a block contains its position on the page (the “boundary box” = “bbox”).
- A text block contains a list of “line” dictionaries.
- Apart from its bbox, a ”line” dictionary contains the writing direction (“dir”: a tuple), the writing mode (“wmode”: either horizontal or vertical) and a list of text “span” dictionaries.
- A text span is a string of characters with identical properties in terms of font, font size and color. A line will have multiple spans only, if its text has multiple font attributes or colors.
- Apart from its bbox, a span contains the insertion start point (called “origin”) and reports basic properties of the font it uses, like the font ascender and descender values. Consult this article for some background.
With the above plethora of information, it is possible to regenerate the page appearance with high fidelity. It is used by this example set of scripts, which can replace selected fonts of an existing PDF:
You don’t like the look of the Courier font in some technical document? Try to replace it with “Ubuntu Mono” using these scripts!
Extracting Special Output Formats
As mentioned above, use the method’s positional parameter to request other text output formats. Create HTML pages (or similar formats like XHTML and XML), that can be displayed in your internet browser, like so:
text = page.get_text("html")
html_page = open("page-%i.html" % page.number, "w")
html_page.write(text)
html_page.close()
Like the “dict” output above, HTML output also includes any images on the page.
Using PyMuPDF as a Module in Text Extraction
Using PyMuPDF as a Python module for text extraction via the line command `python -m fitz gettext <options> …` helps avoid writing scripts in many cases.
It produces a text file, that can be influenced by a number of parameters.
There are three output modes available:
- `fitz gettext -mode simple` — produces the output of `page.get_text()`.
- `fitz gettext -mode blocks` — produces the output of `page.get_text(sort=True)`.
- `fitz gettext -mode layout` — produces an output resembling the original page layout. Please look at this part of PyMuPDF’s documentation for details.
Other parameters let you select page ranges, the minimum font size and more.
Dynamic OCR
The primary intent of the PDF document format is to display text and other data.
Extracting text from a PDF in contrast is not guaranteed to always work: certain requirements must be met.
The most important requirement is the availability of data, which translate the visual appearance of a character (its “glyph”) back to the original unicode. This information (”character map”, CMAP) need not be given at all for a font, or even just not for certain characters within a font.
The absence of a CMAP will prevent extraction of text, that is written in that font.
The only way to get around this hurdle is using Optical Character Recognition, OCR.
MuPDF and PyMuPDF both support programmatic invocation of the OCR tool Tesseract. Of the multiple possible uses of this interface, we pick the following situation and show a solution:
- Suppose, some page text contains glyphs that have no valid unicode.
- (Py-) MuPDF text extraction will return the character `chr(0xFFFD)` for each such glyph (U+FFFD is the invalid Unicode value).
- Whenever we encounter text with the error unicode, we create a temporary small image of the surroundings of our text, let Tesseract recognize it and use the result instead.
The advantage of this approach is, that OCR is only used where actually needed. For a page without this type of problem, no degradation of execution speed nor extraction quality will occur.
Please have a look at this demo script, which exactly follows the above recipe. Whenever it has to OCR a piece of text, it will report it like this (each � represents one `chr(0xFFFD)`):
before: 'binaries we generate – our decisions are ��u��t i�� i�to them. 'after: 'binaries we generate — our decisions are “burnt in” into them. '
Wrapping up
I hope you enjoyed this small glimpse into the many features of PyMuPDF.
About text handling alone, a lot more could be told, like text searching and highlighting, or text manipulation using redactions, etc.
I am planning to cover some of these topics in future articles.