NLP for bots, and bots for NLP

Esther Seyffarth
8 min readNov 1, 2016

--

This post is a transcript of a talk I gave at an Open Source and Natural Language Processing workshop in Cambridge on November 10, 2016. My goal was to get the audience excited about twitterbots and the twitterbot community. The talk and this text are about some of the functions of twitterbots, the wonderful people who create bots, and what all of that has to do with the Open Source landscape. My slides are available here.

What’s a bot?

Technically, you could define a twitterbot as a program or agent that generates content and posts it to twitter automatically, following some schedule or reacting to some trigger. But that’s not the whole story! Twitterbots exist in a space that’s otherwise inhabited by humans and organizations, and the way this space is shared between bots and non-bots results in some interesting interactions.

The people who make twitterbots do so for all sorts of reasons. A twitterbot can be a tool for playing games with your friends:

The bot evaluates your replies and announces the winner after 8 minutes. Made by @muffinista.

Or it can be more serious, like a tool for political activism to raise awareness about a certain topic:

A bot’s output can also be actual, beautiful, pure poetry:

The creator of this beautiful piece of art (piece of artist?) is Allison Parrish.

A bot can be a parody of mainstream media:

One of the many wonderful bots made by Nora Reed.

A bot can provide sophisticated political commentary about ongoing events, such as the US presidential debates:

Also made by Nora Reed.

A bot can be designed specifically for self-care reasons:

Made by Sui Sea.

Or you could teach a bot to automatically create instances of corny phrases or old jokes after a specified pattern:

This one is by me.

It’s easy to make a bot — the tools are all there for you to use. I’ve made over 25 twitterbots so far, and each of them has taught me something. Some of the things I learned were about corpora and dictionary resources: how to use the Wordnik dictionary API, how to use WordNet, how to extract specific information from a word’s ConceptNet entry. Some of the things were more technical: how to represent data in various scenarios and for various needs, how to deal with weird Unicode symbols, and again and again, how to create the regular expression that does exactly what I want it to do.

I solved this — can you? I bet you can after making your first 5 bots. Source: The 2013 MIT Mystery Hunt. This version retrieved from http://almostobsolete.net/regex-crossword/part1.html.

And many of the things I’ve learned were related to linguistics: how do I find out how many syllables a word has, or whether it rhymes with another word? How do I ensure correct case, number and gender agreement in sentences my bot generates?

One of the most delightful things I’ve learned happened with a bot I created recently, Bot in World, where I had to think about whether an arbitrary Adjective Phrase has to go before or after the noun in a noun phrase.

This bot tweet lives in an adjective-phrase-first world.
And this one lives in a world followed-by-an-adjective-phrase.

The original point of this bot was to automate this sort of phrase in order to make you see new ways of dividing things in the world into two groups — for instance, things that lay eggs and things that don’t. When you, a human being, use this sort of phrase, it always seems easy. Automating this in a way that still sounds like language was more of a challenge than I had anticipated— and when I succeeded, I understood the English language a bit better than before.

This is also the reason why I love the concept of twitterbots as programming exercises, especially in the context of computational linguistics education. When you’re an undergrad student, it’s helpful to simplify things in your mind. It’s probably the first time in your life that you’re taught to think about language in a structured and scientific way, and it can be a bit overwhelming. Teaching a piece of software to use language in a way that resembles human usage is a good exercise in formalizing our intuitions about how language works. And the programming part itself doesn’t have to be difficult! If you prepare the part of the bot that posts messages to twitter and have your students focus on the generation of tweets, they will be able to create wonderful bots after less than half a year of programming classes.

How to be a #botALLY

I got into botmaking when it was slightly more of a niche hobby, and not the hype it has been for the past 12 months. I had had a very good education in programming and in linguistics, but I still wouldn’t have been able to make my own bots if it hadn’t been for the wonderful community that exists online and offline and does its best to welcome new members and help you get started.

The community came into being in 2012, around a bunch of people who discussed “non-human phenomenology” on a mailing list, which inspired some of them to create bots based on ideas from that discussion. The name #botALLY was probably coined in this tweet:

… which makes it sound pretty violent, but luckily that wasn’t the direction the community took from there. Instead, they built more art bots, exchanged ideas, created an IRC channel around that #botALLY theme, and then in 2014, opened a Slack channel.

Now, there are lots of delightful communities in the world, and I’ve been a participant in other, sort of similar groups before, but I think a lot about what makes this particular one such a great experience. First, I have no doubt that having a Code of Conduct that is actually enforced by a team of moderators is a huge benefit; it leads to a diverse culture where everybody can feel welcome and valuable, and when conflicts arise — as they inevitably do sooner or later when lots of humans are involved — you can count on the moderating team to give you support if you need it, or to help you make amends if you’ve hurt someone else in any way. It’s not a Big Deal; most of the time, it’s as simple as, for example, a person using a term that has a racist connotation they’re not aware of (because, say, English isn’t their first language), others calling them out on it, and them going “oh, I didn’t know that, sorry for using that word, and thanks for letting me know”.

Another thing that I think goes a long way towards the #botALLY Slack being a friendly and welcoming place is that we’re not in it for the big AI money. These are artists, poets, philosophers, linguists — we’re not interested in “beating the Turing Test”, we’re interested in making fun little side projects and exchanging ideas about language, social media, human interactions, identity, bot ethics, and all sorts of things that you wouldn’t often find discussed in detail in an Artificial Intelligence textbook. This also means that we are not competition in any way; we help each other with programming questions, data source ideas, questions about linguistic intuitions… because we want to.

And that passion and helpfulness isn’t confined to the Slack channel, either. People regularly publish their libraries that they’ve developed as parts of bots on GitHub. For example, Allison Parrish has written a Python wrapper for the CMU Pronunciation Dictionary, called pronouncingpy, that you can use to find out which words rhyme, how many syllables a word has and where its lexical stress is, and so on.

The Corpora project is a general-purpose, crowd-sourced data-collection effort, a GitHub repository created by Darius Kazemi where you can get (or upload) all sorts of word lists and tidbits that can be useful in text generation contexts, ranging from topics like pizza toppings to names and attributes of all existing Pokémon to dinosaurs. It currently has 100 contributors. Bot allies who take it upon themselves to create a machine-readable list of, say, all the bones in the human body, often make the results of their labour available to others afterwards. On the GitHub page for corpora, Darius Kazemi explicitly states that he hopes the collection will be helpful for rapid prototyping of new bots and as a tool in teaching environments.

Other bot allies don’t work on language-centered tools for bot development, but on libraries that try to stay on top of recent twitter API changes, like Jeremy Low, who works on the python-twitter package in an effort to make it easy for bot makers to access twitter in all sorts of ways, or Bradley Momberger, who made his Python Library for javascript available publicly, or Colin Mitchell, who did the same thing with his twitter library for Ruby.

But one of everyone’s favourite tools in twitterbot making today is Kate Compton’s Tracery. Tracery is an “author-focused generative text tool” that works by letting you create phrase structure grammars in JSON notation. Kate Compton created Tracery in JavaScript, and it’s also been ported to several other languages, including Python (by Allison Parrish) and Ruby (by Eli Brody). What does “author-focused” mean? It means you can write the grammar, and Tracery will take care of generating random well-formed sequences according to the rules of your grammar. To me, it’s obvious that that’s basically the greatest thing ever: Imagine you’re teaching your first-year students about Phrase Structure Grammars, and you use Tracery to visualize the way your toy grammar works and to generate sentences that are possible with that grammar. I wish everybody knew about this! It’s a great demonstration of both the expressive power and the limits of PSGs.

Of course, generating well-formed sequences is only one of the tasks you need to complete before you’ve successfully created a twitterbot. You also have to store the grammar somewhere and schedule your tweets to be sent automatically at a specified interval. Have I got news for you — bot allies have already invented a platform that does that for you too! It’s called CheapBotsDoneQuick, it’s developed and maintained by George Buckenham, and many of the most popular bots on twitter were made using it.

Not only is CBDQ itself also open source, but it additionally provides the option of publishing your bot’s Tracery code on the CBDQ page. This makes it possible to learn from other people’s projects, find out how to use the tricky parts of Tracery, and when you’re excited about a bot, you can go and see exactly how it creates its content.

So, to sum up, here are some reasons why I think you and everyone else should be excited about twitterbots:

  • They help you (and your students) formalize intuitions about human language;
  • They help you practice programming tasks like corpus processing, usage of various APIs, string manipulation with or without regular expressions, and more, depending on the nature of your bot;
  • They help you get to meet cool people in a cool community built on mutual respect, shared excitement about bots and interest in all sorts of human and non-human matters;
  • And finally, being excited about twitterbots often leads to involvement in open source projects and sharing the results of your work in a friendly community.

If you’re curious about bots now and have any questions on how to get started beyond what I’ve talked about here, don’t hesitate to ask!

--

--

Esther Seyffarth

PhD student in computational linguistics in Düsseldorf, Germany.