How to automate creating reports using docx templates

Holistic AI Engineering
9 min readJan 26, 2023

--

Automating the creation of reports is a surprisingly challenging task. We explored three solutions:

  1. The first uses the FPDF library in Python to directly “draw” a PDF.
  2. The second uses jinja, an HTML template and the Weasyprint python library to convert it into a PDF.
  3. The third, and the one we focus on in this article uses the docxtpl Python library to create a docx template, that is later converted in a PDF.

Problem to solve

From a range of inputs (radio buttons/checkbox/rich text & text inputs/a bunch of Pandas dataframes filled with data), we want to create an automated report.

What you need

  1. A docx template, with jinja2 style tags.
  2. A python environment
  3. Some dynamic data to fill your tags and create the final report.

Filling the template

To fill the template, we use the python library docxtpl. As described in the documentation, it uses 2 libraries:

  • python-docx for reading, writing and creating sub documents
  • jinja2 for managing tags inserted into the template docx

The tags in the template are variable names enclosed between “{{“ and “}}”, such as {{lottery_winner_number}}. jinja2 replaces those with whatever value is set in a dictionary (context in the snippet example below) that maps the variables to their values.

A simple example (which also inserts an image) is presented below:

from docxtpl import DocxTemplate,InlineImage
from docx.shared import Mm

img_path = "glass_onion.jpg"
doc = DocxTemplate('tpl.docx')
visual = InlineImage(doc, img_path, width=Mm(150))
context = {'company_name' : 'Alpha',
'owner': 'Miles Bron',
'killer' : 'Stop the spoiler',
'visual':visual}
doc.render(context)
doc.save("script.docx")

Where for instance tpl.docx looks like this:

and the generated script.docx :

Image: Glass Onion: A Knives Out Mystery — Copyright Netflix

Dealing with tables

Let’s say we want to display a dynamic table, for instance a dataframe. Jinja2 syntax allows to create some loops and conditional logic. You can also access python variables if you pass them on to the context object.

Rendering a table with fixed columns

Imagine we have the following dataframe:

We can convert it in a list of dictionaries, where the heads of the columns are keys and each row represent and item in that list.

suspects_list = df.to_dict(orient='records')
==> [{'full_name': 'Cassandra “Andi” Brand',
'relationship': 'ex-business partner',
'alibi': 'too cool for this',
'motive': 'business fallout'},
{'full_name': 'Duke Cody',
'relationship': 'old friend',
'alibi': 'live-streaming on Twitch',
'motive': 'tbd'},
{'full_name': 'Lionel Toussaint',
'relationship': 'head scientist of company',
'alibi': 'none yet',
'motive': 'business fallout'},
{'full_name': 'Claire Debella',
'relationship': 'old friend',
'alibi': 'at a political congress',
'motive': 'money'}]

We can simply pass the variable suspects_list to the context variable like this:

from docxtpl import DocxTemplate

doc = DocxTemplate('tpl.docx')
context = {'suspects' : suspects_list}
doc.render(context)
doc.save("script.docx")

The associated template:

And the results:

Rendering any dataframe as table

Now say that the number of columns is also dynamic and we want to print any dataframe.

We can do:

from docxtpl import DocxTemplate

doc = DocxTemplate('tpl.docx')
context = {'df' : df}
doc.render(context)
doc.save("script.docx")

with the following docx template file:

The generated docx is:

Rich text rendering

In our input fields, we have some rich text editor (that also supports images) which sends HTML data, which we need to render as docx.

docxtpl does not support insertion of HTML directly, but we can:

  • convert this HTML in a docx document using other libraries,
  • insert it as a subdoc in our master document using docxtpl.

HTML TO DOCX CONVERSION

We’ve explored two libraries which can be used: pypandoc and html2docx. The first is a wrapper for pandoc, a universal document converter. The second uses an html parser to get the HTML structure and the python-docx library to translate to a docx format.


html = """
<p>We think the killer could be either:</p>
<ol>
<li>That person</li>
<li>But maybe that person too</li>
</ol>
<p>This story is kind of <em><u><strong>crazy</strong></u></em> don't you think?</p>
"""

# using pandoc
import pypandoc
pypandoc.convert_text(html,format='html', to='docx',outputfile="html-docx.docx")

# using html2docx
from htmldocx import HtmlToDocx
from docx import Document

sub_docx = Document()
new_parser = HtmlToDocx()
new_parser.add_html_to_document(html,sub_docx)
sub_docx.save("html-docx.docx")

In both cases, this is the content of html-docx.docx

INSERTING AS A SUBDOC

docxtpl allow to insert subdoc at a given tag like this:

doc = DocxTemplate('tpl.docx')
sub_doc = doc.new_subdoc("html-docx.docx")
context = {'company_name' : 'Alpha',
'owner': 'Miles Bron',
'killer' : 'Stop the spoiler',
'rich_text_var' : sub_doc}
doc.render(context)
doc.save("script.docx")

Where tpl.docx is:

And script.docx is:

Note the added white space because of the html rendering. To avoid this just remove the white space before {{rich_text_var}} and the previous paragraph in the template.

STREAMING DATA

The method above means that each time we need to insert rich text, we need to save a local temporary file containing the converted html text. To avoid this, we can stream the data and save it in memory instead. The full workflow using html2docx is:

import io
from htmldocx import HtmlToDocx
from docx import Document

def get_subdoc(doc,raw_html):
# convert
sub_docx = Document()
new_parser = HtmlToDocx()
new_parser.add_html_to_document(raw_html,sub_docx)
# save docx in momory
subdoc_tmp = io.BytesIO()
sub_docx.save(subdoc_tmp)
# create docxtpl subdoc object
subdoc = doc.new_subdoc(subdoc_tmp)
return subdoc

# insert in master docx
doc = DocxTemplate('tpl.docx')
context = {
'company_name' : 'Alpha',
'owner': 'Miles Bron',
'killer' : 'Stop the spoiler',
'rich_text_var' : get_subdoc(doc,html)
}
doc.render(context)
doc.save("script.docx")

The script above should create the same output as before. Note that pypandoc does not support streaming for docx as the documentation states:

“It’s also possible to directly let pandoc write the output to a file. This is the only way to convert to some output formats (e.g. odt, docx, epub, epub3, pdf). In that case convert_*() will return an empty string.”

RENDERING IMAGES WITHIN THE RICH TEXT FIELD

We have seen earlier how to insert an image at a tag’s location, but what if the image is returned in the rich text as part of the HTML. In our use case, the image is encoded as base64. In an html raw file, it looks like this:

<img src='data:image/png;base64,iVBORw0KGgoAAA[...]kSuQmCC'> 

Where the very long string iVBORw0KGgoAAA[…]kSuQmCC is the base64 encoded image. To render this, we used a bit of a hack:

  1. Use a regex expression regex = r’<img.*?>’ to identify images and extract their base64 code
  2. Replace the whole images tags with jinja2 tags such as {{img0}},{{img1}},…
  3. Use docxtpl on the subdoc to replace the image.

The get_subdoc function used in the above section becomes:

def find_images(raw_html):
"""
Takes the raw html, finds all the images and return
a list of image_metadata where image_metadata is a dict:
{
'id': int -> the id of the image (1st image -> id=0, 2nd -> id=1, etc..)
'html_tag': str -> full string representing the HTML image tag.
ex."<img src='data:image/png;base64,iVBORw0KGgoAAA[...]kSuQmCC'>"
'base64' : str -> base64 encoding of the image
'bytes' : BytesIO object representing the image.
}
"""
regex = r'<img.*?>'
images = []
for cnt,match in enumerate(re.finditer(regex,raw_html)):
image_metadata = {
'id':cnt,
'html_tag':match.group()
}
image_metadata['base64']= extract_img_base64(image_metadata['html_tag'])
image_metadata['bytes'] = decode_image(image_metadata['base64'])
images.append(image_metadata)
return images

def extract_img_base64(img_html):
# Find base64 encoding of image in image html tag
return img_html.split("base64,")[1].split('>')[0][:-1]

def decode_image(base64_img):
# Convert base64 encoded image to BytesIO object
img_bytes = base64.b64decode(base64_img)
return io.BytesIO(img_bytes)

def get_subdoc(doc,raw_html):
# check for images & replace with jinja2 tags
images = find_images(raw_html)
for i,img in enumerate(images):
raw_html = raw_html.replace(img['html_tag'],"{{img%s}}"%i)
# create subdoc & parse
sub_docx = Document()
new_parser = HtmlToDocx()
new_parser.add_html_to_document(raw_html,sub_docx)
# save subdoc in memory
subdoc_tmp = io.BytesIO()
sub_docx.save(subdoc_tmp)
subdoc_tmp_path.seek(0)
# replace images in subdoc using docxtpl
sub_docxtpl = DocxTemplate(subdoc_tmp)
context = {}
for i,img in enumerate(images):
img_obj = InlineImage(sub_docxtpl, img['bytes'])
context["img%s"%i] = img_obj
sub_docxtpl.render(context)
sub_docxtpl.save(subdoc_tmp)
# create subdoc object for use in master docxtpl
sub_doc = doc.new_subdoc(subdoc_tmp)
return sub_doc

A note on formatting

Text inserted at a jinja2 tag location will take the formatting of the tag itself within the docx template, so most of the formatting can be done in the template directly.

Rich text gets inserted in a subdoc and will therefore have default word behaviour. For instance, if the whole template has justified text, the rich text will not be justified. However, this is easily fixed by taking advantage of the python-docx features on the subdoc itself. For instance, we could add this code at the very end of the get_subdoc function above to justify all paragraphs.

from docx.enum.text import WD_ALIGN_PARAGRAPH

sub_doc = doc.new_subdoc(subdoc_tmp_path)
for p in sub_doc.paragraphs:
p.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY

Bullet points formatting using html2docx uses pre-defined style names for lists inherited from the docx template. See repo code here. In effect, adding a list item will be done with this line of code:

self.doc.add_paragraph(style=list_style) 

where list_style = ‘List Bullet’ or ‘List Number’. If these styles are not defined properly, the rich text format will be wrong. By default, they should be well defined, but if you experience any formatting issues, go check the styles in your docx template. For this, go to the little arrow on the bottom right of this panel in Home:

This will open this window, scroll down to the desired style (e.g. List Number), hover over it and click the little arrow that appears on the right and click modify:

You can check that the style is what you’d expect there. Refer to Microsoft documentation for more guidance on this.

Converting to a PDF

Converting to PDF on a Windows local machine is easy, but most automated workflows are built on Linux, and this is where things become more complex. Disclaimer, we haven’t found an efficient way of doing so in a Linux environment, but we quickly go over the explored options.

In a Windows (or Mac) environment

The Python library docx2pdf performs that conversion. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.

from docx2pdf import convert

convert("input.docx")
convert("input.docx", "output.pdf")

Disclaimer: I have not tested that solution on Mac.

In a Linux environment

The above library will not work in a linux environment. We list below a few of the solutions explored, but none of these so far produced satisfactory results:

  • Using Libreoffice.

Note that this approach can mess up the formatting quite a bit, so we did not go ahead with it.

If you’re using a docker image, you can add the following to your Dockerfile:

RUN apt-get update && apt-get -y install libreoffice

Otherwise, just download LibreOffice by following the instructions here.

In python, you can then convert a docx to a pdf through a subprocess by doing:

import subprocess

# get output directly
output = subprocess.check_output(['soffice',
'--convert-to',
'pdf',
docx_path])

# or save to a file
subprocess.call(['soffice',
'--headless',
'--convert-to',
'pdf',
'--outdir',
pdf_path,
docx_path])

This calls the binary soffice that was installed as part of LibreOffice so make sure your system can find it. Note that — headless starts in “headless mode” which allows using the application without a user interface.

  • Using Pypandoc.

This was not the solution used because this does not keep the formatting at all as it outputs a pdf using LaTeX formatting.

See in the documentation: By default, pandoc will use LaTeX to create the PDF, which requires that a LaTeX engine be installed (see --pdf-engine below).

import pypandoc

pypandoc.convert_file(docx_path, format='docx', to='pdf', outputfile=pdf_path)
  • Delegating to ChatGPT.

Other solutions to try are kindly provided by ChatGPT:

Formatting is not preserved for solutions 1 and 4, so we have our doubts about the other solutions, but we welcome your views and experiences on the subject!

Holistic AI is an AI risk management company that aims to empower enterprises to adopt and scale AI confidently. We have pioneered the field of AI risk management and have deep practical experience auditing AI systems, having reviewed over 100+ enterprise AI projects covering 20k+ different algorithms. Our clients and partners include Fortune 500 corporations, SMEs, governments and regulators.

We’re hiring :)

--

--

Holistic AI Engineering

We are the engineers at Holistic AI, the company that wants to change the way humans interact with AI systems. Check us out here https://www.holisticai.com