Automated Ontology Generation, Part 3 : Software
Toolkits And SDKs
In the previous posts (part 1, part 2) I discussed doing NLP (Natural Language Processing) on Upwork’s set of user profiles and job posts for automated ontology generation. Here I’ll talk about software we used, quirks we found there, rocks we stumbled upon. We decided to implement the software in Java. Java is definitely not the default choice for any NLP implementation. In this case, though, Java is appropriate because the software we are writing (and using, mostly) doesn’t perform any matrix operation and pretty much every software developer at Upwork can read and write Java code. Java also provides easy and convenient access to multithreading, enabling parallel processing.
Text analysis toolkit
We at Upwork use CoreNLP. I’d like to begin by saying that CoreNLP is a great toolkit. Without it doing NLP in Java would’ve been dramatically more difficult. It’s free and open source, which is enormously helpful. That being said, there are two things that caused us a serious amount of pain: lack of documentation on many non-trivial features and apparent lack of testing at scale. Keep reading for gory details.
We used POS (Part Of Speech) tagging with MaxentTagger class as a base line as the simplest case we could later improve upon. The class is thread-safe and our application managed to process approximately 60K documents per 10 minutes over multiple threads, and this is the baseline we will compare the other software to. Like with any other parallel processing tasks you have to make sure the number of threads you use is appropriate for the hardware you are running on. Parsing consumes a lot of CPU cycles so we set the number of parsing threads to number of cores — 1. Then we tried to use CoreNLP’s Simple API to add dependency parsing. We liked its advantages and were unaffected by disadvantages. Turns out dependency parsing with Simple API is about 3 times slower than just POS tagging with MaxentTagger out of the gate. It processed only 20K documents over 10 minutes. Then we discovered that SimpleAPI exponentially slows down as it processes more documents. By the end of the first hour SimpleAPI spent more than 10 seconds processing a single document. Clearly that level of performance was unacceptable as Upwork has many hundreds of thousands of documents to process.
We moved on to the standard API. The performance of the standard API was initially higher than Simple API’s: 30K documents per 10 minutes. The performance also fell with time but luckily the degradation wasn’t as dramatic as Simple API’s: after a few hours the application still processed ~15K documents per 10 minutes on a single instance. We followed all guidelines published in https://stanfordnlp.github.io/CoreNLP/memory-time.html. We also discovered that NER (Named Entity Recognition) is not an option at Upwork’s scale. With NER configured as a parsing pipeline option we managed to process only 5K documents in 10 minutes. Due to business requirements, we need to be able to process at least 20 K documents / 10 min on the same hardware the results presented here were obtained. Next obstacle, thankfully not related to the performance any more, was the realization that we need to use the SemanticGraph class to analyse dependencies (discover compound terms). With Simple API dependency checking was easy and straightforward, which couldn’t be said about SemanticGraph one has to use with the standard API. The problem here is the total lack of any examples (at least we were unable to locate any) explaining the results one gets from the class’ methods. I am not talking about data types here but the content of collections and links between returns of various methods. For example, there is a method getLeafVertices() returning a set of IndexedWords but there is zero explanation as to what these IndexedWords are. I suppose if you are an NLP scientist and working with text analysis is what you do day in — day out you know these things by heart. However if you want to apply text analysis to a business problem and NLP isn’t your bread and butter the output isn’t obvious. So as far as usage goes you are basically on your own. To help future explorers here’s a snippet of code handling compound nouns from our application:
The code is looking for the governor of a compound relation (if any) given a dependent. The code implements the “belt and suspenders” strategy: it checks both incoming and outgoing edges because we couldn’t locate any document on the web explaining if just one group of the edges is enough or what are the circumstances when both need to be checked.
Database / storage
We used tried and true HSQLDB (embeddable RDBMS) for intermediate/final storage of the results. It can be embedded and run as a server over the same data files, you can switch from in-memory tables to on-disk tables with one keyword, it’s fast, reliable, and well-documented. We also discovered that the time the application spends on writing to the database and database-related calculations is negligible compared to the time required to parse the documents. In our environment write time was about 10%-20% of parse time. We also didn’t use any HSQLDB-specific features, so it can be replaced with any other relational database without updating the code.
All documents (user profile, job post, catalog project etc) at Upwork must be written in English. However we discovered there is approximately 1% of documents either written in other languages or too short, or just junk. To filter out these documents we used Language Detector , which works surprisingly well. There’s a limited number of languages available but as our main goal was separation of English vs all others so that doesn’t bother us at all.
The Application Itself
We designed the application itself as a set of share-nothing executions. We do the partitioning of the source data on the query level — each execution retrieves only a subset of source documents allocated for that execution. For example, if each source document has a unique identifier in range R and we want to execute 10 instances simultaneously then each instance gets documents in one of R/10 ranges, assuming even distribution of the identifier across the ranges. The first partition gets the range 0 to R/10, the second partition — the range of R/10 to 2*R/10 and so on.
Each application has one thread reading the source documents, filtering out non-english language documents and junk documents and putting documents into a limited capacity queue for parsing threads to parse. A parsing thread also stores the terms in the HSQLDB database, incrementing running aggregates along the way.
The only other consideration worth mentioning is that we made sure to write SQL used to store the terms and perform all calculations in such a way as to avoid deadlocks between threads and application instances.
The framework described in these 3 posts (part 1, part 2) sets us up for a very efficient start. We still need ontologists to look at the results before updating the ontology but the turnaround for the updates is much shorter now and requires significantly less effort.