My Neural Net Will Name Your Next Podcast

Naming your podcast is hard. Do you go descriptive? Shoot for a pun? Something obscure that begs an explanation? After finding… some success naming law firms based on Law360’s list of the 500 largest law firms, I thought it might be interesting to see if I could train a neural network to help me name a podcast we’re working on at the Internet Law and Policy Foundry. After a few weeks, a fair amount of frustration and hours and hours of copying and pasting — by scraping more than 300,000 names from iTunes’ web directory — I generated thousands of unique, hilarious, bizarre podcast names for you to choose from.

You can read through the whole process below, or if you’d like, skip right to the Results at the bottom of the page. If you’d like to download the database I trained the network on, or any of the resulting lists I generated, check out the Github page below.

Getting the Data

Okay, a refresher: a neural net is only as smart as the data you train it on. On my previous post, I trained the neural net using only 500 names — which is the why there’s so much randomness in those results. To start off this project I knew that I needed a long list of podcasts to train my neural net, or the results would be mostly noise.

This actually turned out to be a tougher problem then I expected. My initial search turned up Stitcher’s (and iTunes’) Top 100 Podcasts, and Podcast Chart’s list of the top 200 podcasts, but I couldn’t for the life of me figure out how to get iTunes’ entire podcast directory in a usable format. I trained the neural net on the top 200 names, and got some predictably poor results . You can see the problem the neural net had with the training data: with such a small list to train from, there’s lots of nonsense words and misspellings — and repeats (can you tell how popular Game of Thrones is right now?).

Some of the initial results, after training using the Podcast Chart’s Top 200 list.

The Breakthrough and The Slog

What was most frustrating was that I knew that there were hundreds of thousands of podcasts on iTunes’ directory, but I couldn’t figure out a way to actually make that data useful. Copying and pasting from inside iTunes was impossible, and there didn’t appear to be any publicly accessible web version (and certainly no easily usable database) of their directory. Apple doesn’t make it easy, but I finally found the web version of the iTunes Podcast directory, organized by Topic, Letter, and Page Number.

So.. many.. pages. Fun fact: Did you know that out of all the categories listed, by far the most popular categories (by sheer quantity) are Religion & Spirituality, with Music following far behind in second?

The good news was that I had finally found a source with 1000 times the number of podcasts to train my network — the bad news was that I had 1000x times the number of podcasts names to add to my own database, and no easy way of scraping that data (note: I’m sure there’s an easy way of doing this. If you know how I should have been scraping these names, please tell me, in the comments, on twitter, email — anything works).

Not knowing any alternative, I gritted my teeth, set up a profile for my mouse to make copying and pasting and jumping around a spreadsheet moderately easier, and began the long, arduous task of painstakingly copying a page’s listings, pasting them into a spreadsheet, navigating to the next page, rinsing and repeating. For a few hours each day, for the next week or so, I would put on a podcast and slowly build up a usable database of iTunes’ podcast directory. If you’d like to use that data for your own project (please do! I shudder to think that all of that time is going to be wasted on this lark of a project…) you can download the .xlsx from Github here, and check out the rest of the files on below.

The Methodology

Generally, the methodology I followed to actually train and sample the neural net was identical to my previous post, following Jefferey Thompson’s excellent tutorial for torch-rnn, but there were a few wrenches thrown into the process that bear some explanation.

First, and most annoyingly, I had to clean up the data before training the neural net. From what I could tell, torch-rnn has trouble training on anything that isn’t encoded in UTF-8 (or at least, that’s what I could gather based on the errors that kept getting spit out in terminal). If you run into similar trouble with your dataset, I found the iconv utility useful to strip out any non utf-8 characters with the command “iconv -c -f utf-8 -t ascii [input.txt] > [output.txt]

I found the iconv utility useful to strip out any non UTF-8 characters with the command “iconv -c -f utf-8 -t ascii [input.txt] > [output.txt]

Finally, I had some success training the network, but sampling from it created lists like this:

Of course, what I realized is that the dataset I had trained the network on was all organized alphabetically, so the network learned that the results it should generate should also be grouped alphabetically (and by topic, you’ll notice that almost all of the generated names above are all in the “Arts” category). This bit of insight into the process made me realize that if I wanted a good mix of interesting names in my samples I would also need to randomize the dataset before training the network. Adding a column in the spreadsheet, filling it down with RAND and sorting by that column solved the problem.

Finally, after running the preprocess script on the resulting txt file to generate a .h5 and .json to train on, I ran the training script, walked away for a few hours and came back with a trained neural network ready to generate some results!

The Results

Now for the fun part. As I covered in the previous post, sampling results from the neural network can be tweaked according to a number of variables, with “temperature” determining how novel or creative the results are. Low temperature means that the network is essentially not taking many risks, trying to generate data that closely matches the dataset it trained on. Conversely, high temperature is more likely to be gibberish or at least very different from the training set.

I wanted to generate two groups of results with this project: One batch would be a low/medium/high temperature sampling of all podcasts, another would be a selective sampling of only the Technology, Government and Society categories — since those were the topics we’ll be covering in our show. The results are far too numerous to paste in Medium, but you can find text files of each on the Github page. I also pasted a sample of the results below, with some of my favorites highlighted.

Low Temperature — All Podcasts

Lots of Star Trek and Star Wars podcasts here, for some reason

Medium Temperature — All Podcasts

*Glick in the Services is SO GOOD. I actually think this list generated the most useful, interesting, unique names out of all of the results. Super Shakespeare, Secret Britter Network Podcast, Learning of the Millennial are all great.

High Temperature — All Podcasts

I especially like “Fail / Comedy” and “Wizard Warious,” personally.

Low Temperature — Tech, Government and Society

I would absolutely listen to Bear Talk, Free the Podcast, Super The Content, and That Hour Show.

Medium Temperature — Tech, Government and Society

The Truth Sermons is a Good Podcast Name. All Law Podcast and What Talk Talk too!

High Temperature — Tech, Government and Society

Right now I’m leaning towards “RT. . .” and “Stories No!” as my next podcast name! What do you think? Man on Grid Future is also [fire emoji].

So, what now?

Yeah, that was my response too, after realizing that I had spent the better part of a week and a half on this project. Vacation and free time is a hell of a drug. Well, I think there’s some take-away’s here that are interesting.

I think the real nugget here is how important it is to have a robust training data for your neural net. Comparing the results of the iTunes scraped names to the paltry Top 200 lists is remarkable. With the latter, all I really got was gobbledygook and variations of “Game of Thrones Cast.” With the full, 300,000 list I got some genuinely unique, interesting sounding podcasts! Some of my favorites include “*Glick in the Services,” “RT…” and “Stories No!” — but there are literally thousands and thousands more.

Another take-away from this experiment is that I think neural networks can be useful tools for brainstorming. On the Github you can find massive lists generated from the dataset that can be fun to poke around in. I recommend ctrl+f’ing a word and seeing what interesting names you can find in the list. Who knows, you might find your next podcast — and likely will find some of your current favorites as well (“This Week in Law” — the podcast I cohost occasionally on the network shows up on that list, for example!).

If you like this kind of project, if you’re interested in playing with neural networks yourself, I definitely recommend giving it a go. It’s not nearly as difficult as you may think, and it can produce some really interesting results. Again, I definitely recommend you check out Jefferey Thompson’s tutorial. Let me know what your favorite generated show is below!