Alpha Go Zero and Thoughts on Data
A new breakthrough from Deepmind has been in the news since a few days and many podcasts have issued episodes with great discussions on the new algorithm and what it means for the progress of Artificial Intelligence. The contents and mood of the episodes vary from excited expectations for AGI (Artificial General Intelligence) in the very near future as the direct result of the new algorithm to healthy scepticism saying it is just another small step forward. In the links below you can find the original blog post from the developers and two episodes featuring discussions on the new algorithm from a16z and Wired UK. Personally, as much as I like the Wired UK podcast, I agree more with the thoughts and ideas expressed on a16z. But they both are totally worth a listen and I will not repeat all the clever and reasonable things that were said there and in other podcasts. Instead, I want to focus on one aspect, namely, training input.
I like the fact that to train this algorithm next to no input data was needed, as compared to previous approached. Yes, the program knew the rules of Go and knew when it was losing or winning, but in a way it is a classical (although no doubt very clever) reinforcement machine learning aproach to a very complex problem. So yeah, suddenly massive amounts of data were no longer necessary?
When I started studying computational linguistics in 2005, data, especially digitised texts, were scarce. A few newspapers had digitised their archives, and Project Gutenberg was around since 70s, but all these corpora were too small for many NLP tasks, especially those involving machine learning. Just scraping the hell out of the Web was not going to do much good either, because one ended up with texts from different domains, written for different purposes — the resulting dataset would be too sparce to generalise well for, say, English language in general, and too noisy to capture a specific domain, e.g., medicine or law. Hence one of the sayings around was “there is not data like more data”.
Indeed, if one wants to model such compex system as human language, it seems that no amount of data will ever be enough. However, about a year later after I started my undergrad, Twitter was founded and suddenly the Internet unleashed an endless stream of words on us, computational linguists (and everyone else). Tweets are not easy to work with — they are short, full of neologisms and typos, hashtags and link. But I remember how a clever co-student of mine used a large Twitter corpus to train a model — he showed that the more data one used for training, the more accurate the predictions became, but the relationship was not linear, at some point doubling the training dataset did not double the model’s performance, the whole thing kind of saturated.
So data may very well be the new oil, but is it often crude oil. And just like oil, if you have tons of it and no technology to purify it, it will be of no use to you, it will clog your pipelines and break the whole system. And for some tasks you may need very little of it or none at all!