How did I build an Origin-destination classifier

Image source: http://culturefit.com/wp-content/uploads/2017/03/machine-learning-2-700x300.jpg

Ticket vending chatbot as my pet-project

In my spare time I’m building a ticket vending chatbot as my pet-project. There are quite a lot of challenges in this kind of project. One of them is data extraction. In this article I would like to explain how I solved the problem of defining the station as origin or destination.

Let me show you some basic assumptions to understand this problem. All possible stations names are known and stored as a list. Customers can ask for tickets in various ways, ie.

„Hey I need a ticket to London” or „Please plan me a journey from Manchester to London”.

Information about stations can also be passed in dialog like:

Client: „Hey, I need a ticket to London”

Bot: „Sure, and where do you want to start your journey?”

Client: „Oh, at Manchester Airport”.

We can easily find station names in a sentence, but how do we „teach” our bot to recognise them as an origin or destination?

Machine learning

Probably your first idea was something like „Oh c’mon! you can do it by some IF statements” — my thought was exactly the same at the very beginning. After a few IFs I understood that they are too tentacled to make this code clean and easy to read. It was way easier to provide lot of example sentences, define the expected result for them and let the bot find rules by itself. Oh wait…. isn’t that machine learning ? :)

There are two main challenges if you want to build any classifier. The first is: where can I get training data from? Second: how to pick up classifier and define features set?

So, where I can get data from? Simple, it’s just a simple sentence, so you can write it by yourself. Boring? Sure! So, take something that allows you to replace writing by speaking (I picked: webkitSpeechRecognition tool), add some simple logic and an interface and you can dictate sentences and tag them. This makes creating the features set quite fun, you can speak with your computer!

Now, time to solve a more creative problem. How to pick up the correct classifier and features set? For a beginner (like me), the best method is: try and error method. Try, which gave me quite good results, turned out to be really simple. I picked up Naive Bayes Classifier, more precisely, implementation from NLTK library for Python (http://www.nltk.org/_modules/nltk/classify/naivebayes.html). Feature set for this classifier is a list of features with boolean values associated, like:

{
„feature_1” : true,
„feature_2” : false,
„feature_3” : true,
}

This sample feature set should be interpreted like: „classified object has feature_1 and feature_3, but does not have feature_2”.

Now, how to choose features? Try and error again! Quite quickly I realised that the most natural way to define if a station is origin or destination is to take a look at the word before the station name. For instance: if we have „from London” most probably it will be the origin. So I decided to extract from sentences triplets with one word before the station name, station name and one word after the station name, like: „from_STATION_to”, „between_STATION_and”. If the station name was at the beginning or end of the sentence, one of the elements in the triplet is blank, like: „_STATION_to” or „to_STATION_”.

That would work great in the case of a single sentence input like „Plan me a journey from London to Glasgow”, but in the case of context conversation it won’t be so obvious. Imagine a conversation like:

Client: „Hey, I need a ticket to London”

Bot: „Sure, and where do you want to start your journey?”

Client: „Manchester”

In that case we have to implement a kind of „context”. The minimum knowledge we need is „do we already have an origin/destination”? So, we can introduce 2 additional features: „has_origin” and „has_destination”. The example features set would look like:

[
(
{
“has_origin”: False,
“has_destination”: False,
“from_STATION_to”: True
},
“ORIGIN”
),
(
{
“has_origin”: True,
“has_destination”: False,
“_STATION_”: True
},
“DESTINATION”
)
]

Train your classifier

Now, if we have already gathered enough data and prepared our training features set we can train our classifier, that’s quite simple (if you already have a training set):

stations_classifier = nltk.NaiveBayesClassifier.train(features_set)

Done, we just trained a fully operable stations classifier. Now we can test it. Sample input for our classifier would be something like:

feature = {
"has_origin": False,
"has_destination": False,
"from_STATION_to": True
}

and then call like:

stations_classifier.classify(feature)

should give us: „origin”. That’s it, we can define a pattern as an origin or destination. Of course there is some more code to write to automate this process, transforming text to features set, preparing training data, tagging station names in text and so on, but that’s a relatively easy task. So, enjoy your further work :)

Author: Rafał Orłowski