When I last posted some six or seven months ago I was just taking my first steps with Python, having moved on from shell scripting in bash. Since those first attempts I’ve found that I really enjoy using Python to achieve things I’d never have thought possible a year ago.
Towards the end of last year, after completing an overhaul of some scripts I was using for work, I decided to cast the net a bit wider for my next project. I wanted something that seemed ambitious but also achievable, and I settled on the idea of creating a Twitter bot.
I had read somewhere about Markov chains — very basically: given a sample text to work from, the last one or two words in a sentence being constructed determine what the next one will be. It’s the idea used in your phone’s predictive text software, and it seemed like a good way of generating content for the bot to post.
This meant I would also need to gather input text for it to work on, so I added web-scraping to my to-learn list, which also featured:
- Interacting with APIs
- Configuring Twitter apps
- Writing modular code
- Using version control (for which I chose git)
- Working in a virtual environment
So, plenty to get my teeth into!
Step 1: Gather input text
I wanted my bot to generate fake tech news headlines. Suitable input texts for this were readily available from reddit in the form of post titles from tech-related subreddits. To collect these, I installed PRAW and registered my code as a reddit app.
Pulling the headlines from each subreddit is pretty simple. The json response also includes a large amount of metadata — most of which I don’t use! I do however make use of each post’s ID to ensure I only collect it once. Post IDs are stored in a log file, and the post titles are written to text files for later use.
It became clear quite early on that this could quickly become unwieldy. At the same time I decided that I wanted the bot’s output to be based on a rolling time-window of headlines from reddit in order to seem vaguely topical. To achieve this, and control the looming unwieldy file growth, I wrote a function that checks the text files’ ages against a given age-limit and removes any that are too old. As a side-effect of this, my learning list gained another entry: working with config files. I chose YAML for this, parsed by PyYAML.
This came in very handy when it came to working with git. I wanted to upload my code to Github, but didn’t want to expose sensitive data such as API keys, passwords etc. All I had to do was store these away in my config file, which I added to my .gitignore file to avoid uploading it. Then I could safely upload the rest of my code to github without having to worry about accidentally sharing anything I didn’t want to.
Step 2: Generate output text
Now that I had my base text, it was time to create some sample sentences. I found the Markovify library which fitted my needs exactly. Using that I was very quickly able to produce output that looked encouraging.
Twitter says it will ban diesel vehicles by 2025.
AI learns to predict the future of government?
Step 3: Send output to Twitter
Again, this proved to be simpler than I had anticipated. All I had to do was
pip install a suitable package to work with the Twitter API. I chose Twython but there are plenty of others out there. I believe Tweepy is well thought-of too, for example.
It wasn’t long before I had registered a Twitter account for the bot, configured app access, and sent out some tweets:
And now that my bot had Twitter access it meant I could also pull material from thereto use in creating tweets, including photos.
I had anticipated that as the corpus of input texts grew the quality of the bot’s output would improve. This did happen to an extent. Some of the ‘stories’ it generated seemed distinctly plausible.
However these serendipitous successes were sprinkled quite sparingly among a lot of… well, rubbish, to be quite frank. And that’s still the case now.
I am very happy to have completed what I set out to do, and feel that I learned a lot along the way, but there remain many ways in which the project can be improved. I completed this project in December 2016, but have continued coding and learning since then. Looking back now, even such a short time afterwards, I can see the following limitations:
- The code itself could be better. My functions are too large and do too much.
- There are no tests!
- Lots of the output is still garbage. The ratio of good quality output needs to be increased.
- The bot still lives on the command line of my home computer. It’s only active when I start it up manually.
I am working on addressing these and hope to have another post to share soon about how I overcame them, and how the bot is better as a result.
And finally, I will leave you with a few more of my favourite lines from the bot’s early testing:
Panasonic to launch megaphone that can help those dealing with anxiety!
The ‘just walk out technology’ of Amazon Go store bills everything to your posterior!
Google DeepMind could invent the next four years.
Mobile shopping, now in the distant future!
The world’s most expensive headphones might make you smarter at work — reason for termination without notice?
Yes Please! Richard Branson unveils plans to help refine systems that control autonomous planes.
Autonomous quadcopters may be caused by gut bacteria.
Climate change could drive a semi-autonomous car on public roads.
Aging is a DDoS attack?