Generating Album Artwork from Lyrics

Nick Nieman
14 min readDec 19, 2018

By: Nick Amaya, Nick Nieman, Raghav Prakash, Farhan Rahman

ABSTRACT

From the Beatles’s iconic Abbey Road to Kanye West’s My Beautiful Dark Twisted Fantasy, album art has shaped pop culture over the last several decades. While creative professionals traditionally developed these popular, square images, innovative machine learning techniques are paving the way for a new type of art. The rise of GANs, or generative adversarial networks, in 2014 has allowed data scientists to generate organic content with fantastic results. In this project, we applied some of these new models to a dataset of album covers and their lyrics with the goal of generating new cover art for an album based solely off its lyrics. We trained our models on a large dataset of albums and lyrics we scraped from the web. While our results were not as successful as we would have hoped, we’ve demonstrated that the task of generating album art strictly from lyrics is not outside of the realm of possibility and theoretically possible with AttnGANs.

INTRODUCTION

GOAL

In this project, we intended to generate convincing album art covers given the lyrics of an album. Note that we were not attempting to recreate existing album art, but rather draw on previous album artwork and create an organic image for the album. Since our models trained on existing lyric/album cover pairs, we also expected to be able to draw connections between lyrics and their respective album art. For example, generated heavy-rock album art may tend to be darker in shade with surreal imagery when compared to generated pop albums.

DATA

Fairly early on in this project, we knew that gathering a substantial dataset would be a challenging task. To summarize our data acquisition process, we began with a list of album and artist names from 2005–2018 scraped from Wikipedia’s “List of 20XX Albums”. We then used these album and artist names and the Spotify API to download album artwork and track names. With the Genius API and album/artist names, we were also able to get each track’s lyrics webpage. Once we had the link, we scraped the relevant Genius pages and saved both the album cover and scraped lyrics to our database.

METHOD/APPROACH

Our approach to this problem was relatively simple. We began with a very simple model and added complexity after successfully testing the previous model. To begin, we implemented a straightforward fully connected, one layer network [1]. After successfully testing this model with the MNIST dataset [2], we modified the GAN to intake both random noise and a word embeddings vector. Following another round of successfull testing on MNIST, we implemented a DCGAN and trained/tested it on our albums dataset.

PERFORMANCE

Evaluating artificially generated images is a relatively new problem. Since the inception of GANs just 4 years ago, very few metrics have been developed to measure performance. Among the available evaluation metrics, the inception score (IS) and Fréchet inception distance (FID) are the two most widely used. While the IS is still seen as the industry standard, the FID score is becoming more popular, as it addresses some of the issues associated with the IS. We score our models on both metrics. Both these metrics are described in more detail in the background section.

BUSINESS APPLICATION

As stated earlier, our project attempts to augment the creative design process for album artwork. Whether generated images are directly used for album artwork or simply referenced for inspiration, we believe this could be a useful tool for creative professionals in many different fields. More generically, the idea of creating an image from text can be broadly applied to many industries. The justice system, for example, could utilize a text to image tool for depicting suspected criminals instead of relying on a sketch artist.

BACKGROUND

GANs

Understanding the architecture of a GAN is crucial for understanding the discussion of our implementation detailed in the models section. Be sure you have a solid understanding of the material below before continuing.

At its core, a GAN consists of two neural networks. The first network, called the generator, takes a random noise vector, Z, and generates images. The second network, called the discriminator, takes a sample of real images and a sample of fake images generated by the generator and attempts to classify the two. The classification information is sent back to the generator, which is trying to fool the discriminator. Each neural network takes turns training in an iterative approach, where each iteration or epoch generally improves the accuracy of each neural network. As a result, the generator is able to produce more and more accurate images. See Figure 1 below for a visual representation of a generic GAN.

Figure 1. Generic GAN Architecture

EVALUATION METRICS

Before discussing the Inception Score (IS) and the Fréchet Inception Distance (FID), it’s worth mentioning the role of the objective function. The objective function is used during the GANs training process to evaluate the effectiveness of the generator and discriminator. While useful in training, the objective score fails to communicate any information about the quality or diversity of the generated images.

As mentioned previously, the IS is currently the standard for evaluating the performance of GANs. IS uses two criteria to measure performance: the quality of the generated images and the diversity of images (i.e. how similar are they).

IS measures the quality of a sample of generated images by running them through an Inception-v3 model pre-trained on ImageNet and recording the conditional probability of each image given a label. Conditional probabilities that are very similar (low entropy) indicate a real image. In other words, IS considers a set of images realistic if ImageNet consistently assigns them the same class with equivalent probability.

In addition to quality, IS measures the diversity of a sample of images with the same conditional probability above. Images with unpredictable marginal probabilities (high entropy) have a high IS score . Once the scoring function evaluates both diversity and image quality, it combines them using the Kullback-Leibler (KL) divergence. A large divergence between the conditional and marginal distributions corresponds to a high IS score (which is good).

Although IS is currently the industry standard, it has some shortcomings and is steadily losing ground to the FID score. The FID score improves on IS by actually comparing the statistics of generated samples to real samples. It uses an intermediate layer of an inception-v3 network, pretrained on ImageNet, to project the sample of generated images and the sample of real images to a lower dimensional space. Once in that space, it uses the Fréchet distance between the distribution of generated images and the distribution of real images as its scoring metric. Consequently, a lower FID Score corresponds to more similar real and generated samples.

DATA

Our approach to data for this project was broken into two phases: acquisition and formatting. The following sections will go more into details about how we gathered data through API calls and scraping, removed extraneous information and formatting from the lyric files, and transformed the lyrics into word embeddings.

APPROACH

When initially analyzing our problem and intended goal, we quickly realized our model would need to be trained on a substantial list of lyrics and corresponding album artwork. After experimenting with a pre-assembled collection of album images [3] and a compiled list of albums from Rolling Stones top 500 albums of all times [4], we realized we needed to go a different route to obtain a larger, more robust dataset.

After researching further, we determined that scraping Wikipedia’s “List of 20XX Albums” would be the best course of action (2005–2018). Note that the Wikipedia page before 2005 formats the album list differently and requires a different scraper. With our scraper, we were able to get approximately 10,000 album and artist names.

Once we compiled a list of album names and artists, we began utilizing Spotify’s API to extract the album art and corresponding tracks for each album. With this information, we were able to interface with Genius’s API to access links to the lyric webpage of each song. Finally, we created a web scraper that pulls the lyrics from each webpage and stores it in a file. Figure 2 aptly summarizes our data acquisition pipeline.

Figure 2. Final Pipeline for Data Acquisition

API LINKING

Referencing our fetch_data.ipynb file on our GitHub may clarify some of the points made below. To start, we utilized the SpotiPy library to interface with the Spotify API. Once we entered authentication information, we were able to extract album artwork and tracks by utilizing Spotify’s API search call with album name and artist. Next, we fed the tracks and artist name into the Genius API, which returned a link to a Genius web page containing the lyrics for that track. Our web scraping script described below discusses implementation details.

SCRAPING

With the song lyric links from Genius API, we used the BeautifulSoup library to parse the link and extract the lyrics from the html tag. We repeated this scraping process for all the songs in an album and saved them into a folder named after the album.

As for Wikipedia, our main task involved scraping album and artist names from multiple tables across several web pages (one for each year). Our scraper extracted the “Album” and “Artist” columns from each table on the web page. We did come across some corner cases where information from certain cells couldn’t be extracted. These cells were simply ignored.

FORMATTING

Once we acquired a lyric text file for each song, we noticed that many of the files contained extraneous information and irrelevant text. For example, most songs contained annotations for verse 1, verse 2, chorus, etc. We generated a script to remove irrelevant text and formatting.

WORD EMBEDDINGS

Word embeddings play a crucial role in the development of our model since they represent the lyrics our models train on. In short, word embedding map a text input into a single vector representation of that image. Each element in the vector characterizes that text input in some way. Since each text input is mapped to a vector of the same length, word embeddings create a standardized text dataset a machine learning model can train on.

Some of the most popular embedding models include bag of words (BOW), Word2Vec, and Doc2Vec. Since both BOW and Word2vec fail to account for the context and relationship between words in a string, we chose to build our embedding model with Doc2Vec.

In short, Doc2Vec creates a fixed-length, numeric representation of a document, regardless of the document’s length. It is built on top of Word2Vec, adding an extra layer account for the context in which words appear. Our implementation makes use of this model by first building the vocabulary with gensim tagging. Next we train our model on all lyrics and output the inferred vectors for each album’s lyrics.

LEARNING/MODELING

Our goal was to generate an album cover given lyrics of a song. We understood that this meant we needed a generative model that can generate samples from the joint distribution of album covers and lyrics. We researched a couple of models, such as a hidden markov models and variational autoencoders, but ultimately landed on generative adversarial networks (GANs) due to the substantial amount of research and proven results for this model.

GENERATIVE ADVERSARIAL NETWORK (GAN)

We began by implementing a GAN on the MNIST handwritten digits dataset with a fully-connected, one-layer generator network and a similarly structured discriminator network. We implemented the model in TensorFlow and were successfully able to generate digits after about 10,000 epochs.

We began to port this model to our dataset by replacing the MNIST images with our albums covers and the Z noise vector with our text embeddings; however, we stopped when we realized that the generator would have no way to generate different images for a given text embedding. For example, if two persons input “the black, weathered road” into the model, then the subsequent generated images would be equivalent since there was no noise variation in the input.

CONDITIONAL GENERATIVE ADVERSARIAL NETWORK

After some research, we came across the conditional GAN, which models the conditional probability P(images | label) from a noise vector Z, a label, and a real-image as inputs to the model. The architecture is shown in Figure 3.

Figure 3. Conditional GAN Architecture

Essentially, the model concatenates the inputs of the generator and discriminator with the label before feeding them into their respective models.

Mapping our text embeddings and images to the conditional GAN, we are able to calculate P( album cover | text embedding). We added this new structure to our TensorFlow graph, ported the model over to our dataset of album covers and lyrics, and trained the model for 10,000 epochs.

When training the model, we realized it was making insignificant progress. After some debugging, we discovered that the current network architecture was insufficient for our data. Our album covers were 256 x 256, which meant they needed 65,536 input nodes to maintain fully connected layers. As a result, the fully connected network architecture was taking too long to train. Reducing the size of our images would lose too much information, so we researched variations of our network structure capable of scaling to a larger input.

DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORK

Eventually, we landed on DCGANs or deep-convolutional generative adversarial networks (shown in Figure 4) . These networks replace the fully-connected one-layer architectures of the generator and discriminator with a deconvolutional and a convolutional network, respectively. The advantages here come from a technique called pooling which is built into the convolutional network.

Figure 4. DC-GAN Architecture

Pooling is a network structuring technique that connects a hidden node only to the closest N input nodes, where N << (number of input nodes). This allows the number of input nodes to grow without significantly affecting the overall computational complexity of the network. It also allows hidden nodes to be more localized. As a result, the hidden nodes only learn the features of the closest N input nodes, greatly reducing their dependence on other nodes and improving learning.

This network was much more challenging to implement. A number of issues made training difficult, the most prominent being the infamous exploding gradient. After as few as 3 epochs of training, the gradient of our loss approached infinity.

We implemented a few techniques to try to fix this problem. First, we standardized our album cover images and text embeddings. Second, we added a regularization term to the loss, which penalized large weights. Third, when neither the first nor the second technique worked, we clipped the gradients of our models to always be between 1 and -1. After applying all of these techniques, we were able to train our model.

Figure 5 shows a sample of our results. While they are not as clear as we would like, there are some clear improvements from the 0th epoch to the 800th epoch. The colors, for example, are less jittery and more solid, which reflects the way real images look.

Figure 5. DCGANs Image Generation Results Over Epochs

RESULTS

We evaluated our model using the IS and FID scores introduced above. Figures 6 and 7 represent results from our baseline model when trained on the MNIST dataset. The IS score is consistently increasing over an increasing epoch count, meaning our model is getting better. Additionally, the FID score is decreasing as the number of epochs increases, also meaning our model is getting better. Thus we can conclude that our baseline GAN is generating more realistic images over time.

Figure 6. Baseline Model Inception Score (MNIST)
Figure 7. Baseline Model FID Score (MNIST)

On the other hand, our DCGAN model’s results did not align with our expectations. In Figures 8 and 9, we see that the IS and FID scores vary over epochs. Normally, we would conclude that the model fails to generate realistic images. In our case, we suspect the model is capable of generating realistic images, but failed due to a few reason. First, we were only able to test on a limited dataset (25 images) due to memory constraints. Second, computational constraints restricted the training dataset size to 600 albums, which is fairly small. Last, time constraints limited our training time to 800 epochs, which is likely not enough time to demonstrate noticeable improvement.

Figure 8. DCGAN Model Inception Score (Albums Dataset)
Figure 9. DCGAN Model FID Score (Albums Dataset)

FUTURE WORK

Although the quality of our results was not as strong as we would have hoped, we are in the process of making improvements by introducing a new state of the art network structure that learns to associate word embeddings with certain areas of an image. The end result is a much sharper generated image. This model is known as an AttnGAN.

ATTENTIONAL GENERATIVE ADVERSARIAL NETWORK

AttnGANs, otherwise known as attentional generative adversarial networks, generate detailed images from given text fragments by introducing an architecture that stacks multiple DCGANS. The architecture is shown in Figure 8.

Figure 10. AttnGAN Architecture

At a high level, the AttnGAN model works as follows: the first DCGAN, denoted F_0 in Figure 8, takes the global text embedding and generates the overall structure of an image. For example, it would generate a low resolution image of a bird sitting on a branch, if given “a red bird sitting on a branch”. The subsequent DCGANs, F_1, F_2,, then upsample the output of F_0, adding more detail with each pass. The resulting output is a much sharper and clearer image.

CONCLUSION

While it wasn’t apparent to us at the time, our initial goal of generating album art from lyrics was fairly ambitious. With our limited knowledge of GANs, we spent much of our time familiarizing ourselves with the GAN architecture, TensorFlow, and PyTorch. Although somewhat underwhelming, our generated images from the DCGAN hold promise. With a significantly larger training size and thousands of epochs (instead of hundreds), we expect the results from the DCGAN to significantly improve. Along the same lines, we will also continue working on porting our dataset to the AttnGAN library.

On the other hand, the data acquisition and processing aspect of our project went well. Our fetch_data.ipynb and format_data.ipynb scripts enable a streamlined workflow to download images and lyrics and transform them into word embeddings. We’ve made our dataset publicly available in our github repository to enable further research into album and lyric related machine learning projects.

PROJECT LINKS

REFERENCES

[1] Carey, Owen. “Generative Adversarial Networks (GANs) — A Beginner’s Guide.” Towards Data Science, 13 Sept. 2018, https://towardsdatascience.com/generative-adversarial-networks-gans-a-beginners-guide-5b38eceece24.

[2] MNIST Handwritten Digit Database, Yann LeCun, Corinna Cortes and Chris Burges. http://yann.lecun.com/exdb/mnist/. Accessed 18 Dec. 2018.

[3] “One Million Audio Cover Images for Research : Internet Archive : Free Download, Borrow, and Streaming.” Full Text of “Passing”, London : F. Warne ; New York : Scribner, Welford, and Armstrong

https://archive.org/details/audio-covers

[4] Stone, Rolling, and Rolling Stone. “500 Greatest Albums of All Time.” Rolling Stone, 31 May 2012, https://www.rollingstone.com/music/music-lists/500-greatest-albums-of-all-time-156826/.

[5] Hui, Jonathan. “GAN — How to Measure GAN Performance?” Jonathan Hui, 18 June 2018, https://medium.com/@jonathan_hui/gan-how-to-measure-gan-performance-64b988c47732.

[6] Le, Quoc V., and Tomas Mikolov. “Distributed Representations of Sentences and Documents.” ArXiv:1405.4053 [Cs], May 2014. arXiv.org, http://arxiv.org/abs/1405.4053.

[7] Mirza, Mehdi, and Simon Osindero. “Conditional Generative Adversarial Nets.” ArXiv:1411.1784 [Cs, Stat], Nov. 2014. arXiv.org, http://arxiv.org/abs/1411.1784.

[8] Zeiler, Matthew D., and Rob Fergus. “Visualizing and Understanding Convolutional Networks.” ArXiv:1311.2901 [Cs], Nov. 2013. arXiv.org, http://arxiv.org/abs/1311.2901.

[9] Reed, Scott, et al. “Generative Adversarial Text to Image Synthesis.” ArXiv:1605.05396 [Cs], May 2016. arXiv.org, http://arxiv.org/abs/1605.05396.

[10] Xu, Tao, et al. “AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks.” ArXiv:1711.10485 [Cs], Nov. 2017. arXiv.org, http://arxiv.org/abs/1711.10485.

--

--