A Deep Factorization of Style and Structure in Fonts

Chinmay Talegaonkar

Published in

The Startup

12 min readMay 4, 2020

EMNLP 2019 Oral Paper by N. Srivatsan et. al.

Samples from Capitals64 dataset showing partially observed character sets for a variety of font styles.

Introduction

This paper talks about an interesting problem, of disentangling an alphabet in a given font into its constituent font style and structure. Optical Character Recognition is one such problem — where the algorithm needs to perform font and style agnostic detection of characters. A slightly more challenging problem is to create alphabets of a given font or generating new font styles and corresponding characters. A character with a given font style is referred to as a glyph. For instance, a and a are two glyphs, with the same content but different styles. This paper proposes a clever and powerful generative model as a potential solution to this challenging problem and demonstrates promising results. This blog is an attempt to unfold the key insights of this paper and highlight some interesting results.

Font reconstruction as matrix completion

An interesting take on font reconstruction is to treat it as a matrix completion problem. Let xᵤᵥ be an entry in the matrix X, for an alphabet (content) u and font style v. Not all characters/alphabets are available for all font styles, hence some entries of the matrix are empty. The task of recovering unseen glyphs for a given font style is called font reconstruction. This is equivalent to filling in the missing entries of the matrix X. An interesting aspect of this work, however, is that not only can it generalize on unseen content instances, but it can also generate new font styles. This is possible since the generative model actually attempts to learn a manifold of font styles, i.e. a continuous vector space, whose each element corresponds to a different font style. For a good font manifold, the distance between 2 vectors in a manifold would indicate the similarity between the 2 corresponding font styles. Learning such a manifold also allows for style transfer between different font vectors and embeddings. The generative model works on the fundamental idea, that the style of a glyph can be learned as a latent vector embedding zᵥ (where v is the font style of the glyph) using glyphs of font style v as inputs. Each content instance (or alphabet) is represented as a learned embedding eᵤ (where u is a content instance). Each alphabet across different fonts shares the same content embedding, while each font style shares the same latent vector representation across all glyphs.

How to train your dragon — Network architecture

Network Architecture* (images taken from the paper)

The network comprises an encoder and decoder architecture and uses the idea of variational inference like VAEs to disentangle a glyph into separate style and content representations. A projection loss based on text image statistics is used to get sharper font reconstructions.

Encoder Architecture

The encoder takes the images of the glyphs (of a single font style) concatenated with their content embeddings, to generate an embedding zᵥ for the corresponding font style v. The encoder consists of 3 blocks for a given input glyph. Each block consists of a convolutional network CNN, followed by a max pool with a stride of two, an instance norm, and a ReLU unit. Thus for every glyph, we get a vector after passing it through the three blocks. An element-wise max operation (similar to max pooling) is performed across the vector generated for each glyph after the blocks, thus generating a single embedding for the font. This embedding is then passed through a few fully connected layers to get the final latent parameters, from which an embedding zᵥ is sampled for the input font style v. Masking out some of the blocks randomly along with the max pooling operation prevents the dependence of the font embedding on a particular character.

Decoder Architecture

The decoder architecture is built on the key idea that a glyph can be modeled in terms of

A low-resolution character representation eᵤ for a content instance u, which is fed earlier to the decoder as it coarsely determines the shape of the glyph.
A high-resolution style embedding zᵥ for a font style v that adds complex style features to the glyph corresponding to v.

The decoder for a given glyph consists of four blocks in sequence. Each block contains a transpose convolution, which upscales the previous layer and reduces the number of channels in the previous layer by a factor of two. Each transpose convolution is followed by an instance norm and a ReLU activation, which is then followed by two convolutional layers. The character embedding is first passed through an MLP, and then reshaped into a tensor to be fed into the decoder. What follows the transpose convolution block, is a very interesting feature of this architecture. The convolutional filter for each block is not learned through backpropagation through the decoder, but rather is the output of a small multilayer perceptron whose input is the font latent variable zᵥ. Hence, the decoder’s convolutional filters are parametrized by the encoder’s font embedding (zᵥ). These filters form the convolutional layers which are inserted between the transpose convolution layers for fine-tuning and adding font specific details. To reconstruct the glyph for a given content instance u, the decoder takes in eᵤ as input along with the learned font style embedding zᵥ.

The content embeddings and the latent style embeddings are given as input at different parts of the decoder asymmetrically as mentioned above and shown on left. For inference, the test input consists of a random subset of glyphs, which are fed to the encoder to learn the corresponding latent font embedding. This font embedding, along with the set of content/character embeddings is fed to the decoder to generate all the glyphs for the font style of the test input.

Projected Loss

One of the common loss metrics for images is the MSE loss function. The MSE loss function arises from assuming an independent and Gaussian output distribution for each pixel. Prior work shows that since the value of a pixel is actually dependent on its neighboring pixels for text images, modeling text images or glyphs using heavy-tailed distributions (or edges) is a better approach and leads to sharper reconstructions.

Based on this insight, this papers models compute the loss as follows -:

Transform the decoder output and the ground truth glyph using a 2D DCT basis. 2D DCT is an orthogonal basis, and hence the likelihood computed between the transformed variables corresponds to the likelihood of the original image domain as well. Computing a heavy-tailed loss over the frequency decomposition provided by the 2D DCT instead of pixel values helps the decoder to generate sharper images without using an adversarial discriminator. Let the observed glyph be x. Let the output of the decoder be y, and f() be the function that induces the 2D DCT transform.
Impose a heavy-tailed distribution, like Cauchy distribution in the transformed space to model the loss function. Cauchy distribution is a justified choice since it models the fact that images tend to be mostly smooth, with a small non-smooth variation in the form of edges. g() denotes the likelihood which is used as a loss function.

Results and Interesting Observations

The results in the paper show that this approach outperforms existing approaches for font reconstruction. The authors present comparisons with GlyphNet which is a GAN (generative adversarial network) based approach, and a naive nearest neighbors baseline. Nearest neighbors performs well for datasets where neighboring font styles are almost indistinguishable. However, its generalization capabilities to new or unseen datasets is severely limited.

Qualitative Comparison

Comparing reconstructions of the proposed method and the 2 baselines for unseen fonts

The above figure highlights the capabilities of the proposed method and the two comparative baselines for the task of unseen generalization. In general, GlyphNet does not produce fully enclosed letters or match the texture, missing out on some subtleties. The outputs of the nearest neighbors are content-wise accurate, but the outputs don’t identify the font style that matches the style of the input. The proposed method seems to pick up the font style subtleties much better than the 2 baselines.

Traversing the Font Manifold

The proposed method can generate new fonts by interpolating between inferred font representations on the font manifold in an interpretable manner. To illustrate this idea, the authors take two fonts belonging to the same font style family, but which differ in one property. The corresponding glyphs are passed through the encoder to obtain the latent font style embedding for each of the two fonts (zᵣ, zₒ). A linear combination of these 2 embeddings can be then used to generate a new style embedding vector zₐ. E.g. one can generate

where l can be varied between 0 and 1. zₐ can then be given as input to the decoder to produce a new font with properties intermediate of zᵣ and zₒ. The figure below shows a result from the paper that shows how the model can apply serifs, italicization, and boldness gradually while leaving the font unchanged in other respects. This property would be quite useful in controlled generation new fonts that differ from existing fonts in certain specific aspects.

**Showing smoothness of the latent manifold.** Linear combinations of the embedded font variants of the same font family correspond to outputs that have intermediate characteristics of the fonts chosen as inputs.

Conclusion

The paper highlights an interesting approach to disentangle content and style from a glyph and generalize to unseen font styles and content instances. Learning a continuous manifold of the font styles allows the generation of new fonts. The clever inductive biases incorporated in the encoder and decoder help the network in outperforming the baselines on generalization to unseen fonts. An interesting extension to this idea would be to expand from glyphs to higher dimensional modalities like sentences, paragraphs with multiple font styles used at once.

Related Work

“Unsupervised Transcription of Historical Documents” Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein. ACL, 2013

The paper uses a generative modeling approach to transcribe printing press era documents. A 31% improvement is obtained on the google OCR dataset using this approach. The generative model is inspired by the structure of the printing process. The text and noise process is modeled together. The text modeling is done using a Language model, while the font style is modeling using an inkling model, noise model, and a typesetting model for the region encompassed by a glyph. The generative model prints character images line by line. This work is related to the chosen paper for review as both these papers focus on separating content/meaning from variable font types.

“Improved Typesetting Models for Historical OCR” Taylor Berg-Kirkpatrick and Dan Klein. ACL, 2014

This paper extends the above-mentioned paper by introducing richer typesetting models. The models break the independence assumption between vertical offsets of neighboring glyphs thus significantly reducing transcription error rates in glyphs. Another model introduced in the paper learns multiple font styles jointly allowing accurate tracking of italic and nonitalic portions of documents. This paper also introduces faster inference systems compared to the previous paper (25x faster) and achieves a 22% reduction in the error rates compared to the state of the art models on old newspapers datasets. This work is related to the chosen paper for review as historical OCR is closely related to glyphs of different/unseen fonts, and the proposed solution in the chosen work is a step towards handling unseen fonts for content disentanglement from font style/typesetting.

“Towards Controlled Generation of Text” Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, Eric P. Xing, PMLR 2017

There has been plenty of recent work in computer vision for controlled image and video generation. This work aims to bring the power of generative models to the language domain. They tackle the problem of controlled and realistic text generation, whose attributes are modeled using disentangled representations learned using VAEs. The proposed generative model combines variational auto-encoders (VAEs) and attribute discriminators for inducing a semantic structure on the generator. Their approach allows VAEs to leverage fake samples to generate more plausible and realistic examples. The proposed model uses a differentiable approximation to discrete text samples, explicit constraints on independent attribute controls, and efficient collaborative learning of generators and discriminators to learn interpretable word representations and produces sentences with desired attributes of sentiment and tenses. This work is related to the paper chosen for paper review, as this work also uses disentangled and decoupled representations and then fuses them together to generate text.

“Improved variational autoencoders for text modeling using dilated convolutions”. Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick ICML, 2017

It is generally seen that VAEs with LSTM decoders perform worse than much simpler LSTM language models, but this result is not well understood. One of the explanations is that LSTM decoders don’t take into account the conditional distribution information given by the encoder. This paper uses a dilated CNN as a decoder for the VAE instead of LSTM. Changing the decoder dilation in the architecture allows controlling the window of context from previously generated words. The authors discovered a trade-off between the contextual capacity of the decoder and the effective use of encoding information. When this trade-off is carefully balanced, the results show that VAEs can outperform LSTM language models, showing the first positive language modeling result with VAEs. This work is related to the paper chosen for paper review, as this work also uses disentangled representations and then fuses them together (or modifies) using a CNN based decoder.

“Style Transfer from Non-Parallel Text by Cross-Alignment”. Tianxiao Shen, Tao Lei, Regina Barzilay, Tommi Jaakkola, NeurIPS 2017

This work focuses on style transfer for a non-parallel text setting. Separating the content of the sentence from other style-related aspects is a challenging problem. A Shared latent content distribution is assumed across different text corpora, The paper then introduces a refined alignment of sentence representations across text corpora that is used to perform style transfer. An encoder is learned that takes a sentence and its original style indicator as input and generates a corresponding style-independent content representation. A style-dependent decoder uses this representation for rendering the output sentence. The paper chosen for review has goals similar to style transfer, as it seeks to learn a manifold of font styles for inference from a small sample of glyphs. Hence this paper is relevant and related to the paper chosen for review. Style transfer tasks in general for the language domain fail to disambiguate style and content. In the paper chosen for review, this setting is clearly defined, with content (i.e. the character) governing the coarse shape of a glyph, and font style governing finer level features.

“A Deep Factorization of Style and Structure in Fonts”. NikitaSrivatsan, Jonathan T. Barron, Dan Klein, Taylor Berg-Kirkpatrick, EMNLP 2019

This is the original paper that has been reviewed in this blog post. All the images used (except the equations) in the post have been taken from the original paper https://arxiv.org/abs/1910.00748.

References

[1] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. 1974. Discrete cosine transform. IEEE Transactions on Computers.

[2] Samaneh Azadi, Matthew Fisher, Vladimir G Kim, Zhaowen Wang, Eli Shechtman, and Trevor Darrell. 2018. Multi-content GAN for few-shot font style transfer. CVPR.

[3] Jonathan T. Barron. 2019. A general and adaptive robust loss function. CVPR.

[4] Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein. 2013. Unsupervised transcription of historical documents. ACL.

[5] Taylor Berg-Kirkpatrick and Dan Klein. 2014. Improved typesetting models for historical OCR. ACL.

[6] Neill DF Campbell and Jan Kautz. 2014. Learning a manifold of fonts. ACM TOG.

[7] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. 2016. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136.

[8] Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285.

[9] David J. Field. 1987. Relations between the statistics of natural images and the response properties of cortical cells. J

[10] OSA A. Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. CVPR.

[11] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Controllable text generation. arXiv preprint arXiv:1703.00955, 7.

[12] Jinggang Huang and David Mumford. 1999. Statistics of natural images and models. CVPR.

[13] Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR.

[14] Diederik P Kingma and Max Welling. 2014. Autoencoding variational bayes. ICLR.

[15] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. ICML.

[16] Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. NeurIPS.

[17] Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard Scholkopf. 2017. Adagan: Boosting generative ¨ models. NeurIPS.

[18] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2016. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022.

[19] Laurens Van Der Maaten. 2013. Barnes-Hut-SNE. arXiv preprint arXiv:1301.3342.

[20] Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. 2017. Deep matrix factorization models for recommender systems. IJCAI.

[21] Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. 2017. Improved variational autoencoders for text modeling using dilated convolutions. ICML.

[22] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV.