The training corpus problem

Let’s imagine you have a cool idea for a startup. And of course in order for it to work you need to have a little bit of AI — machine learning of some sort. Great. Let’s start working and within a month you will have everything ready.

But what do you need in order to get machine learning actually working. You need to have a training corpus (or several corpora). Every machine learning algorithm needs a training corpus — a collection of information in which you have encoded what the AI should be learning. If, for example, you have to categorize documents, you would need sample documents with their respective categories. Much like how people work — you give them examples and they make abstractions.

So how can you get one. It depends on the kind of a problem you have. If you do document classification, machine translation or syntactic sentence parsing, there are free corpora available for you. But what if you do something different. If, for example, you want your system to learn the meanings of a word. A clever way to approach this problem is to get tons of crosswords and have their different explanations teach you what a word means (collective discourse).

But sometimes you don’t have this kind of information available. What if you want to create a corpus for skill extraction from job postings. You don’t really have this information floating around. You will have to scrape the jobs, then annotate them — go through all the documents yourself and mark what is a skill. That’s not too bad, right? But then what is a “skill”? If you have the sentence “I have good working knowledge of Java” what do you want to get as a skill: “Java”, “knowledge of Java”, “working knowledge of Java” or “good working knowledge of Java”. Would you consider iOS a skill? And if yes, then do you consider “working with iOS” and “writing apps for iOS” different skills? There are many edge cases that should be considered while annotating a corpus. There are major decisions that have to be made when you are considering what is the goal of your corpus. If you are writing a parser and need to get all the possible information from the corpus, you will consider everything to be a skill including “played tennis in college”. But then if you are trying to build candidate-job matching software you may want to skip “outgoing” and “creative”.

Once you have annotated your first batch of documents you will have a set of rules, which will be version 1 of your annotation guidelines. Don’t think it is final. Once you try it in practise you may realize you actually need “working with iOS” vs “writing for iOS” instead of just “iOS”. And then you will need to annotate more documents that will cover other edge cases. So you will have to add more rules to your guidelines. Then you can say that you actually don’t need 1 corpus — you need 50. Because a skill is a different concept when it comes to finance, IT, accounting and manual labor. And whenever a new document comes you will need it categorised in one of these categories. So you will need another corpus for the categorization of documents.

But then you have another problem — your corpus is in progress and your code is also in progress. And you have a lot of benchmarks that measure how well you have taught your AI. But you have to keep track of these benchmarks. And the fact that your current benchmark is so different from your previous could be based on:

  • The code has changed
  • The features have been changed
  • More documents have been added to the corpus.
  • The annotation guidelines changed.

So you have to keep track which benchmark was done with which set of documents, with which version of the code, and version of the guidelines.

Then there is the question of features — the features of the text you want to feed to the AI (if your AI works with text). For document categorization, every word may be a feature of the document. Then the algorithm will realize that “Obama” and “Putin” are used in political documents and “football” and “Messi” — in sport documents.Then you can add a bunch of additional features, like what is the part of speech of these words, how long they are, are they in the beginning of the sentence or in the middle… There will be a ton of features and you may feel compelled to:

But some features would introduce noise. So you have to be careful. Try to understand the problem yourself. Machine learning does not do the thinking for you. What it does is give you the power to put in code the knowledge of how something should be learned.

You can try to solve it as if it was a problem in another language that you don’t know. What would you need to categorize documents in Sanskrit? Do you need to understand the meaning of the words to do it? Or you can group the documents just by the common words you see in them even without knowing what they mean?

If you feel like trying how machine learning “at home” you should definitely do it. Now it is easier than ever due to the many available tools. But mind the corpus problem — you almost never have a ready-to-use corpus and that’s the hard part. Keep in mind that AI does not actually substitute thinking.