How to Convert HTML Tables to PDF

PyMuPDF supports HTML syntax used to define tables

PyMuPDF
6 min readApr 7, 2023

Starting with version 1.22.0, PyMuPDF now supports a large portion of HTML syntax that is used to define tables.

This is a continuation of an earlier article introducing the basic concepts of PyMuPDF’s “Story” layout technique. It is recommended to read that article for a better background understanding of what we will be discussing here.

What are HTML Tables?

Tables in HTML are defined by the <table> tag. Other tags within a table definition are available to define column and row headers and each single table cell. Styling syntax can be used to influence the visual appearance of the table, just like any other part of the HTML text.

Tables are used to conveniently layout grids of data on webpages — for example, a traditional train timetable might make very good use of a table layout.

In HTML code, here is a simple example of a table with a caption, three columns and four rows:

<table>
<tr>
<th scope="col">Player</th>
<th scope="col">Gloobles</th>
<th scope="col">Za'taak</th>
</tr>
<tr>
<th scope="row">TR-7</th>
<td>7</td>
<td class="text-right">4,569</td>
</tr>
<tr>
<th scope="row">Khiresh Odo</th>
<td>7</td>
<td class="text-right">7,223</td>
</tr>
<tr>
<th scope="row">Mia Oolong</th>
<td>9</td>
<td class="text-right">6,219</td>
</tr>
<caption>Alien football stars</caption>
</table>

The table has a header row (defining the column headers), and the first column is used to define row headers. If we would feed this definition into some browser, the output would not look very appealing, however, the browser would faithfully display a simple table:

HTML table with no styling — default browser output

To improve the look of our table, we use styling syntax to influence things like font selection, text and cell background coloring or cell grid lines. This information can either be integrated in the HTML source directly or be given as a separate source file declared as CSS (Cascaded Style Sheet).

For our demonstration purposes, we want a sans serif font, the row and column header to have some blueish background, the column header text should be white and all table cells should be wrapped by a border. Also, for fun, we are going to put our table caption at the bottom and give it a green alien border. Here is the respective styling source:

/* use sans-serif for all text */
body {font-family: sans-serif;}

/* right-align eligible cell text */
.text-right {text-align: right;}

/* cell grids, padding */
td, th {border: 0.5px solid #00f; padding: 5px;}

/* in general use centered alignment */
td {text-align: center;}

/* col header colors */
th[scope='col'] {background-color: #696969; color: #fff;}

/* row header colors */
th[scope='row'] {background-color: #d7d9f2;}

/* caption appearance */
caption {
padding: 5px;
text-align: center;
border: 1px solid #0f0;
caption-side: bottom;
}

/* table appearance: an own border, no spacing between border lines */
table {border: 2px solid #aaa; border-spacing: 0;}

A typical browser output will then look like this:

Browser output. Note: The above has been adopted from this Mozilla example.

Using PyMuPDF as an “HTML Browser”

If we put the above definitions in an HTML file, called “table.html”, the following 20 lines of code will output a PDF page that looks very much the same as the page generated by any browser:

import pymupdf
import pathlib

# read HTML file
HTML = pathlib.Path("table.html").read_bytes().decode()

story = pymupdf.Story(html=HTML) # interpret it by the Story object
writer = pymupdf.DocumentWriter("output.pdf")
mediabox = pymupdf.paper_rect("a6") # choose a small paper size
where = mediabox + (36, 36, -36, -36) # leave 1/2 inch borders

more = True
while more: # write on one or more PDF pages as required
dev = writer.begin_page(mediabox) # tell the writer our page size
more, filled = story.place(where) # compute layout
story.draw(dev) # write to the page
writer.end_page() # finish page
writer.close() # close the writer

The above code uses the PyMuPDF Story class to interpret the HTML, then creates an area for display and renders the data.

Here is the image of the generated PDF page:

PyMuPDF output

Things to Observe for PDF Output

When generating PDF pages from HTML source, PyMuPDF can act very much like an internet browser.

But the output medium PDF, being a document format, also imposes differences compared to web site output. Here is an overview of the most important similarities and differences:

  • MuPDF automatically computes adequate relative column widths and line breaks in rows for a table. This algorithm cannot be influenced currently.
  • In contrast to web sites, page breaks may occur when outputting documents.
    - MuPDF will never split a table row across page breaks, but always position each row completely on one page.
    - Should table splitting be needed, then any table column header will currently not be replicated on subsequent pages. There are however easy ways to deal with this, using a callback function of the PDF output process. The function collects a list of cell positions on the pages. Post-processing the PDF with this data allows drawing the required table headers, gridlines and row shadows.
Repeating header rows and alternating row background
  • Table caption width is not connected to the table width but is computed relative to the width of the block element that contains the table e.g., a <div> tag wrapping the table.
    - Wrapping the table in its own <div> is also the recommended way to control a table’s horizontal page position.
  • Styling information must be provided separately in the CSS or as part of the HTML’s <style> tag. Single table cells cannot be styled directly. To achieve this, use an HTML class attribute for a table cell, and then assign stylings to that class.
  • PyMuPDF supports an HTML template feature based on the HTML attribute id. This can conveniently be used to target specific HTML objects. Consider in our HTML if we do:
    <th scope="col" id="player">Player</th>
    And then in our CSS we do:
    #player {background-color:#ff00ff;}
    Then we have a magenta background for that specific HTML <th> object. Note, ids should be considered unique and should not therefore be duplicated in your HTML code.
  • Apart from text, table cells may also contain images (or a mixture of both, text and image). The column width magic will take this into account accordingly.
Table cells containing images

Conclusion

With table support now available in PyMuPDF it should give you much more control over your PDF layout. It should also feel more natural and anyone with basic experience with HTML & CSS should hopefully find things much more intuitive.

So, please try it out and get back to us with enhancements or feature requests — PyMuPDF is constantly evolving and your feedback helps us make the right choices for the next direction!

PyMuPDF is a large, full-featured document-handling Python package. Apart from its superior performance and top rendering quality, it is also known for its excellent documentation: the PDF version today has over 420 pages in Letter format — more than 70 of which are devoted to recipes in How-To format — certainly a worthwhile read.

Another knowledge source is the utilities repository. Whatever you plan to do when dealing with PDFs: you will probably find some example script there that gives you a start.

You can reach the devs on the #pymupdf Discord channel.

--

--

PyMuPDF

PyMuPDF is a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.