Understanding of date and time for a chatbot in French

An evaluation pipeline: translation, entity extraction and datetime transformation

Published in

Empathic Labs

6 min readMay 18, 2020

Multilingual Appointment Chatbot is a project in collaboration with a Swiss startup called Deeplink specialised in chatbot technologies. For one of their customers, Deeplink requires a chatbot being able to detect if there is a time and a date in a text message in French sent by a human to the bot and to respond with a proper answer. This chatbot will be used in order to take appointment with customers.

Multiple existing algorithms already solved this task, but those only works for English.

For my semester project at the HEIA-FR (Fribourg, Switzerland), my task was to compare several popular Natural Language Processing algorithms with text in French translated by a translation service and find a way to extract dates to Python datetime object. At the end, the idea was to see if translating a sentence before sending it into popular date extraction algorithms is viable.

Technologies

During this project, the following technologies were used:

Translation service: Deepl API

NER Algorithms :

Datetime extraction libraries:

Dataset creation

In order to test the translation, NLP algorithms and datetime extraction libraries, it was essential to have a sufficiently large dataset.

Deeplink created a service called Chatbotstrap providing tools to create dataset filling out in the form of a polling with different scenarios to create. The polling can be sent to anyone simply by giving the link to someone.

Here is an example of a scenario:

25 people answered the poll, so we collected 421 sentences and 371 of them were validated.

Several types of date were found:

Date: like “Tuesday” or “the 5th of May”
Date & time: like “today at 5 p.m.”
Time: like “10 a.m.”
Range: like “tomorrow between 2 p.m. and 6 p.m.”
List: like “Thursday or Friday”

Translation

After getting data from Chatbotstrap, I translated the whole French content dataset in English with the help of Deepl.

The whole dataset will pass through the translation engine and every translation will be verified manually.

After passing multiple steps of translation tests, 89,8% of the dataset was considered correctly translated. The remaining 10% includes translation that were not useful. For example: “à 16h” was translated “at 4” without mentioning PM or AM.

NLP Algorithms

The selected NLP algorithm will go through the same process in order to compare which one find the most correct date and time from our translated dataset and which one is the fastest.

spaCy

Using spaCy’s larger model en_core_web_lg, 81,4% of data are correct and takes 2.5 seconds to run the NER process.

stanza

It was, therefore, found that 74,2% of the named entities are correctly recognised by stanza. The big difference is that it is slower than spaCy. To run the NER process, it takes about 35 seconds.

flair

After going through the entire process and validating results, it has been found that 58,9% of the named entities have been correctly found. This is the worst result of the three algorithms. It is also the slowest with about 1 minute and 12 seconds to run the NER process.

Results

With 81,4% of correct named entities found in 2,5 seconds, spaCy is the best NLP algorithm of our test. It beats the two others by far in terms of speed and precision. So spaCy’s results will be kept for the next step of the pipeline.

Datetime extraction

After being translated and ran into NER process, the dataset had to pass through datetime extraction. This is the last step of the pipeline. Selected datetime extraction libraries will be equally tested in order to find the most efficient one. Each ”spaCy valid” records will run through the three libraries and each result will be verified.

dateparser

First of all, we had to import data of spaCy’s validation because we want to have the most valid data possible. Then we iterated through the dataset and use dateparser to extract datetime object from records. In total, 61% of extracted datetime was correct.

parsedatetime

Parsedatetime went through the same process. It has better results than dateparser. 68,9% of datetime objects are correct.

timestring

This library is the worst of this test. Only 44,6% of datetime object are correct.

Results

The winner of this test is parsedatetime with 68,9% of correct datetime objects. Unfortunately, this library is not capable of doing range extraction, but with a bit of data manipulation, it would not be difficult to do it.

Conclusion

To summarise the results, we have:

Deepl with 89,8% of correct English translations
spaCy with 81,4% of correct date and time named entities
parsedatetime with 68,9% of correct datetime extraction

The following figure illustrate that at the end of the pipeline, 184 records of the entire dataset (374 records in total) passed the entire pipeline. This means that about 49% of the data is being correctly translated, date and time entities are recognised and datetime object are extracted.

The main question is now: is the solution viable? The answer is probably no at this stage. Basically, the translation by Deepl and date/time entity recognition by spaCy returned some good results and parsedatetime wasn’t that bad, but all of them together gives at the end 49% of correct results. So one out of two French text gives out a correct datetime object. This percentage is bad, in the most part because of texts containing ranges of date. Indeed, this solution needs some improvements, especially for the datetime extraction.

Parsedatetime gave some good results for simple date and list date. Critical cases are range and list of ranges because it was not able to recognise them correctly at all (0%). Timestring was the hope for this, but it gave so bad results. The solution would probably be to manually detect for ranges and find the start and end date in order to insert them into the datetime extraction process separately. This process would surely give some better results than the ones we have right now.

Telegram bot

The bot created for the project is used to show the process of the full pipeline that means the translation, named entity recognition and datetime extraction. The user can access the bot from Telegram by typing its username in the search bar and can start the conversation by typing ”/start”.

You can have a try on this link: t.me/ps6_multi_lang_bot