A battle with language recognition
After my 3rd year of studying at the Gdańsk University of Technology, I was able to get an internship and then a position of a Java Developer at Jit Team. It feels great to finally be working on an actual project in an actual software company - “Internal Facebook” as many people called it when I first started — a tool called WeJit for gathering and sharing information about people that work in the company.
It was very helpful for me at the beginning, because whenever I needed to find someone at the office, first I could check them out on WeJit and see what they look like, what room they sit in, etc. And even though I know a lot more people around the office now, I still use it every day at work. It’s also a great tool for our HR team because it has many functionalities dedicated for them - like hiring new employees, managing their status and generating profiles of our developers in the form of resumes, so that our clients can easily get to know us.
Given that, the use of WeJit is geared towards international cooperation and adding that the project is going to be open sourced, we decided to focus on the English version of the system, and discontinue the Polish version of the UI and the content (resumes of the employees) written in Polish. The technical part of this task was not hard at all: change some json or properties files, hide the language picker on the GUI etc.
So why am I even writing this if it was so simple you might ask? Well, we live and work in Poland. Jit Team is a Polish company and the vast majority of people that work here are Polish. So some people still filled out their profiles in Polish. Moreover, some people already had their profiles in Polish filled in before we “blocked” the Polish version of the system. And as it turned out, adding warning signs or pop-ups did not solve the problem — content in Polish still kept popping up from time to time.
To solve this, we thought that a good idea might be implementing a mechanism that checks whether the text that the user is trying to save is written in English. At first, I had very mixed feelings — it sounded like a cool task, something different to try to weave into our system. But on the other hand, it seemed it might be very hard. From my university courses I remember some methods that might come in handy — a Bayes classifier maybe? But will I manage to deliver such a functionality in a sensible timeframe?
Then I asked myself: is it really necessary for me to try and reinvent the wheel? Maybe I should first try to find a library that might help me do this? If one exists it will surely give better and quicker results than anything I would manually be able to implement. And sure enough it did not take me long to find what I needed. A quick Google search for “java language detection library” directed me to this GitHub repo… and it seemed it had all I needed. It even had tens of language profiles built in!
With this library, I could just grab the library, conveniently packed in a Maven dependecy and start detecting languages! The code was very easy to apply. First, the Maven dependency (in the pom.xml file):
Then the actual detecting:
The code above is relatively straight forward. We start by acquiring all of the language profiles that are built in the library (you can also pick which ones you would like to detect — refer to the README file in the GitHub repo). Then we build a language detector with the given language profiles and a TextObject with the text we want to analyse. The last step is to use the detect() method on our TextObject which will return an Optional of the LdLocale class which will contain the detected language information. Be aware that this Optional we get is not the Java API class, but one provided by Google (hopefully this will be fixed soon to support standard Java API).
And there we go, we’ve performed our language detection. Task solved!?
The first tests came out very well. The detection seemed to work for all the samples I tried which had 2 or more words and so I used it being proud of myself that I found a fairly quick and accurate solution to the problem I was given. But when I handed it over for further validation of the QA guys it turned out that my testing and research wasn’t thorough enough (what a surprise). Many English sentences were being detected as Polish, some of which were:
“I am Joanna”, “I like boxing”, “I like to eat”…
I was aware that language recognition might not work that well with short sentences but after trying it out on my own a couple times I thought that it would do quite fine. But the more times we tried it the more I saw it just didn’t.
So I decided to sit down and look for a better solution, this time with proper preparation. I gathered a sample set of random generated proper English sentences and an equivalent Polish set. I built a quick project that helped me quickly evaluate a given detection library with the testing set.
The full source code of my evaluation scripts, coupled with sample data sets with sentences in English and Polish is available: https://github.com/gajewa/LanguageDetection
To pick candidate solutions for evaluation I followed a stack overflow topic which covers top language detection libraries. I decided to try two of the suggestions given there: Language Detection API and Apache Tika. Below, I describe how to use these tools and after that I show the results of my evaluation.
The Language Detection API is a hosted service offering a wide range of access plans (including a free plan limited to 5000 requests / day). Sending the texts to a 3rd party service may be problematic for some use cases. Yet, I’ve decided to give it a try to have a comparison with other solutions. From the project website it seems that this API is using Google’s Compact Language Detector 2 (taken our from the Chromium source tree). Sadly, this framework is written in C++ and up to my knowledge there is no mature, up-to-date wrapper for Java. Luckily for the Language Detection API there is an open-sourced Java client library. To use it we need to register an account on the project website (I’ve used only the free plan), get an authentication key, and then import the corresponding dependency to start using the service in our project:
The detection code is even simpler than in the previous case:
As you can see, this way we also get information about whether the detection is reliable (a Boolean value) and a confidence rate (a double value, the higher the value the more confident we can be that the proper language was detected).
Now, last but not least — Apache Tika. A content detection and analysis framework which is supposed to detect and extract information from over a thousand different file types.
The appliance is very similar to the Language Detection API based method. The difference is we don’t need an authentication key, and the confidence level is not a double value but one of 3 values — HIGH, MEDIUM or LOW.
The three described solution / libraries (Langugage Detection API, Apache Tika and Optimize Detector) were all tested using the same set of English sentences (230 lines) and Polish sentences (297 lines). The results of the examination are presented in the table below.
I provided the datasets along with my testing project on my GitHub repo. If you would like to try it out with your own data just paste it in the en.txt or pl.txt resources files, or add a new one.
The only way I was able to get a wrong detection from the Language Detection API was to input a very short sentence made up of words that were not clearly from one language (eg. “I am”, whereas “I am Joanna” worked fine).
As can be seen, the results obtained by theLanguage Detection API are better than the other two libraries. If using a 3rd party, hosted service is possible in your case (taking into account privacy and pricing issues) you should probably go with that. On the other hand, if you need a local solution with reasonable accuracy you should stick to Apache Tika.
Summing up, the language detection task which initially seemed very challenging turned out to be relatively easy to solve using ready-made libraries and APIs. The simple test framework provided here, for validating language detection solutions allowed me to pick the most accurate one for our use case. If you need a good language detection solution for your project, I encourage you to use the test script created by me as a starting point to make your choice. Good luck!