Open Sourcing Duckling, our probabilistic (date) parser
We’ve previously discussed ambiguity in natural language. What’s really fascinating is that even the simplest, seemingly most structured parts of natural language, like the way we humans describe dates and times, are actually so difficult to turn into structured data.
The wild world of temporal expressions in human language
All the following expressions describe the same point in time (at least in some contexts):
- “December 30th, at 3 in the afternoon”
- “The day before New Year’s Eve at 3pm”
- “At 1500 three weeks from now”
- “The last Tuesday of December at 3pm”
But wait… is it really equivalent to say 3pm and 1500? In the latter case, it seems that speaker meant to be more precise. Is it OK to drop this information?
And what about “next Tuesday”? If today is Monday, is that tomorrow or in 8 days? When I say “last month”, is it the last full month or the last 30 days?
A last example: “one month” looks like a well defined duration. That is, until you try to normalize durations in seconds, and you realize different months have anywhere between 28 and 31 days! Even “one day” is difficult. Yes, a day can last between 23 and 25 hours, because of daylight savings. Oh, and did I mention that at midnight at the end of 1927 in Shanghai, the clocks went back 5 minutes and 52 seconds? So “1927–12–31 23:54:08” actually happened twice there.
There are hundreds of hard things like these, and the more you dig into this, believe me, the more you’ll encounter. But that’s out of the scope of this post.
At Wit.ai, the built-in entity that’s the most used by the community of developers is
wit/datetime. So we had to work on this problem. From our past experiences with NLP, we knew that a fully rule-based approach was a recipe for disaster. Unfortunately (or not), humans are very bad at following strict (syntactic) rules. On the other hand, temporal expressions are quite regular and hierarchical compared to other sides of language. A fully machine-learned approach like we have in other parts of Wit.ai seemed difficult. So we started to design a hybrid system, based on both rules and examples: Duckling.
Open sourcing Duckling
Today, we are both happy and eager to share our approach with the community by open sourcing Duckling. Duckling is far from perfect, but we think it may help a few developers with similar problems. Meanwhile, as we wrote earlier, natural language is such a hard problem — we need to join forces!
We’ve been using Duckling in production for one year now, and while it’s still a very early-stage library, it parses hundreds of thousands of weird temporal expressions in five languages with a lot of success.
Moving forward, we’ll continue to open source more and more of Wit.ai. Please give us lots of feeback about Duckling, and of course, contribute!