Highlights from spaCy IRL

Arne Lapõnin
6 min readJul 11, 2019

--

Me at the beautiful Heimathafen Neukölln

Since January, I have been doing an internship at a Netherlands-based company, Océ that develops and sells industrial grade printers. I have been exploring records written by technicians maintaining the machines. These records contain unstructured text, so extracting information from them requires the use of natural language processing (NLP). Once I started exploring the available NLP tools, I was immediately drawn to spaCy due to its ease of use, strong community support, and understandable documentation. When I noticed that the creators of this wonderful library at Explosion are organising a conference in Berlin, I bought a ticket straight away.

SpaCy IRL was focused on two themes: highlighting the applications of NLP and spaCy in the industry and looking at the future of spaCy.

Developments from the community

Sebastian Ruder has been at the forefront of applying transfer learning for NLP. In essence, transfer learning is about using knowledge from one domain to accomplish tasks in another domain. Using transfer learning for NLP is wise since many NLP tasks share common knowledge about languages and approaches, such as ULMFit, ELMo, OpenAI transformer, and BERT allow models to be pre-trained to understand deeper syntactic dependencies, hierarchical relations, and even sentiment. Language modelling as a pre-training task can improve the performance of other tasks, such as text classification, translation, question answering, coreference resolution. Ruder kept emphasizing the need for sharing these pre-trained language models trough hubs, author-released checkpoints or third-party libraries. He illustrated the necessity of sharing by bringing up a paper that points out that training a language model from scratch produces more CO2 than a car during its lifetime.

Ruder was followed by Giannis Daras, who gave an overview of his research in improving the architecture of language models. That would make them faster to train and more usable in real-world situations. I must admit that I am not knowledgeable enough to fully understand everything that Ruder and Daras were talking about, but it is nice to see that the field of NLP is developing quickly.

Sofie Van Landeghem gave an overview of a new feature she is developing for spaCy. Her work would link textual mentions of real-life objects (named entity tags) to knowledge base concepts. The idea is to use data from Wikipedia and Wikidata to create a knowledge base so that tagged entities, such as Ada Byron, Countess of Lovelace would all point to a single person of Ada Lovelace. The work is complex, but the results seem to be promising and the functionality should be rolling out with the next version of spaCy.

Great examples of the vibrancy of the spaCy community were the presentations from Guadalupe Romero and Mark Neumann. Romero, as a native Spanish speaker, noticed that spaCy’s lemmatization capabilities for Spanish were not as good as the ones for English. One of the biggest drawbacks she noticed, was that the Spanish lemmatizer could not take advantage of POS-tags. English lemmatizer can transform “meeting” into “meeting” and “meet” depending on whether the original word is a noun or a verb. Together with the creators of spaCy, she designed an approach to extend the English rule-based lemmatizer to Spanish and German. She ended up with a solution that takes advantage of POS and morphological rules and is extendable with additional rules and new languages. Mark Neumann was working on biomedical data, when he noticed that spaCy, being trained mostly on web data, is not able to handle highly specific medical data. He used annotated biomedical resources, such as GENIA corpus and CRAFT, to create a layer on top of spaCy that is able to accomplish NLP tasks on scientific data with good accuracy. The tool that Neumann created, scispaCy, is able to parse both web and scientific documents, achieve comparable results, and is in some ways ahead of the standard spaCy library. ScispaCy already contains an implementation of named entity linking similar to the functionality Sofie Van Landeghem is working on.

I was really impressed with these three presenters, especially with Romero and Neumann. They saw a problem in the field and tried to solve it. Romero’s work on the Spanish lemmatizer is something that every NLP practitioner can use now.

Applied NLP

Yoav Goldberg gave a presentation about the missing elements in NLP. He pointed out that even though academia is fascinated with deep learning and using it for transfer learning, most of the industry is still using regular expressions to extract information. Goldberg is really fascinated about bridging those two communities and he presented some of the work that he and his team have been doing. I was fascinated by the problem of handling missing data, meaning how to make sure computers understand the subtly of the languages that humans grasp easily. For example, in a sentence “She just turned 50”, the number 50 represents a concept called a numeric fused-head. Currently, NLP tools aren’t able to understand that the number points to an age. There were other areas that Goldberg pointed to as well, such as the need for tighter human-machine collaboration in crafting NLP pipelines.

I really appreciate that Goldberg pointed out how the NLP academia and industry are currently dealing with very different problems and that there is a need to bring them closer together. Goldberg envisions transparent and debuggable models that are born from human-machine cooperation. It would certainly alleviate some of the issues the society has been facing with respect to machine learning.

Peter Baumgartner presenting

Peter Baumgartner presented the lessons he had learned during his time using NLP in the industry. He gave great examples of the different types of clients a technician can encounter and their varying knowledge of the capabilities of NLP. All the different types are available here. He wanted to emphasize how critical it is to approach problems from the value-first perspective and not from the technology-first. Additionally, he made great points about how project management in data science has become much more of a probabilistic process and how important it is to keep your stakeholders informed about your intermediate results, for example by presenting your experiment log.

Patrick Harrison from S&P Global is challenged daily by his client’s interest in alternatives to traditional financial data. Nowadays more and more companies are interested in Environmental, Social, and Governance metrics, which answer questions such as does the company have targets to promote diversity and inclusion in its workplace, and hundred more in this vain. These metrics are not standardised so extracting them from promotional material, social media, amongst others, takes a lot of work. It would be fascinating to see more details of that work. The company’s “100% precision and recall” guarantee seems like an insurmountable challenge.

In the end, the creators of spaCy gave an overview of how their company was born, how they are earning money, and what are the future directions of the library. Some of the improvements made me cheer, such as the static analysis of pipeline components. This could have saved me many debugging hours last week when I was trying to use lemmas before they were generated.

I liked that spaCy IRL was first and foremost a conference of applied NLP. I talked to a number of people facing similar problems in their everyday work as I am. Thanks to them, I got ideas for some of the issues that have been bugging me for weeks. I am thankful to the people at Explosion for organising the conference and for all their hard work on spaCy, which has been an important tool for my work.

--

--

Arne Lapõnin

Data Engineer working at Thoughtworks. I love technology, art, travel, and politics.