An Example of Word Cloud with Mask

A word cloud is a pretty traditional, and maybe already old fashioned way to depict the content of a text or a corpus (a set of texts). Nevertheless it is still a good way to convey the general idea of the text. This form of communication can be further improved by generating a word cloud image which resembles the general idea of the text. This article intends to demonstrate how to generate such a word cloud.

This word cloud was generated in the context of an assignment where the goal was to assess Family Trust Deeds. The corpus was composed by a set of trust deeds templates. The preprocessing consisted of cleaning some "mark ups" from imported PDF and Word files. Then a parts of speech filtering was performed leaving only nouns and verbs into the text. With this, only the words that convey meaning regarding the trust deeds domain were left to compose the word cloud.

The Python WordCloud API was used to generate the word cloud; nevertheless, an image was necessary to be the mask onto the cloud should be generated. The image should somehow transmit the feeling of a Family Trust and could not be too detailed, otherwise, the algorithm would not be capable of clearly rendering the worlds into the borders of the shapes. The SVG image to the left was found at and could be used free of charge.

The word cloud image obtained from the corpus and using the mask above was the following.

The code used to generate the image can be found in the Gist below. The variable filtered_text is the corpus already preprocessed as described above. The WordCloud constructor builds the object, while the wc.generate() method creates the image on memory. When plotting the image, the method wc.recolor() uses the mask image to render the word cloud with the colours of the mask.