New Syntax API in Watson Natural Language Understanding

Pavan Tummala
Feb 8, 2019 · 4 min read

One of the fundamental tasks in the field of Natural Language Processing (NLP) involves breaking down content into its smallest possible units, understanding the meaning of each unit, and using that information to build higher order features such as Named Entity Recognition and Sentiment Analysis.
Watson Natural Language Understanding (NLU) applies a suite of such NLP tasks for supporting features like Keywords extraction.

Today we are releasing these building blocks of language understanding as part of a new Syntax API within NLU. This is a free experimental feature with support for the English language at the moment. Support for other languages will be added in the coming months.

The Syntax API is comprised of four features — Tokens, Lemmas, Part of Speech and Sentence Boundaries. Let’s dive into the details of each of these features.

Tokens

Tokenization is the task of breaking the text into units, called tokens. Tokens are a sequence of characters that are semantically meaningful units. Here’s an example sentence with its tokens.

"My email is abc1@example.org. What’s yours?"Tokenization:
[My][email] [is] [abc1@example.org] [.] [What] [‘s] [yours] [?]

Applications of tokenization:
This is usually one of the first tasks performed in an NLP pipeline. Tokens can be used for part of speech tagging, dependency parsing, lemmatization and more. The quality of the higher order feature you are building will ultimately depend on how good your tokenizer for the language is.

Lemma

The words you find in a dictionary are lemmas: the base form, or root form of words. Think of it as normalizing a word.
Example:

“We are running several marketing campaigns in these markets.”Lemmatization:
[We] [be] [run] [several] [marketing] [campaign] [in] [this] [market]

Notice the subtle ambiguity in marketing vs markets that lemmatization helps resolve.

Applications of lemmatization:
Lemmatization (along with stemming) is commonly used in information retrieval systems or search engines while building the indexes. Words like, documenting,engineeringand communicating can be converted to their root forms (lemmas) before adding to the search index. At query time the text is normalized and compared with the index.

Other applications include building word clouds, normalizing words in different dialects (organise/organize, colour/color) and detecting spelling errors.

Part of Speech

This is the task of tagging all the tokens in a text with part of speech such as noun, verb, adjective etc. Watson NLU uses the Universal Parts-of-Speech scheme which is consistent across languages. In almost all languages, certain words can mean different things depending on the context and this is where part of speech tagging is very useful.

Example:

"I am on break. Don't break anything."Part of Speech tagging:
[I = PRON] [am = AUX] [on = ADP] [break = NOUN] [.=PUNCT]
[Do = AUX] [n't = PART] [break = VERB] [anything = PRON] [.=PUNCT]

Notice that the two occurrences of the word break in the example above have different meaning and part of speech tagging correctly tags them as such.

Applications of part of speech tagging:

Part of speech tagging has several applications. Some of the important ones include word sense disambiguation and understanding the intent of utterances within text and speech based chatbots.

Sentence Boundary Detection

This is a seemingly simple task of identifying the beginning and end of sentences in the text. Punctuation marks and other special characters can add complexity to this task.
Example:

"The price is $9.99. It was $19.99 last year."Sentence boundaries:
“The price is $9.99.”
“It was $19.99 last year.

Applications of sentence boundary detection:
Similar to tokenization, sentence boundary detection is an important initial step in building higher order features. For example, to determine the sentiment of a paragraph with multiple sentences, you first have to identify where individual sentences start and end.

Why use Syntax API from NLU?

Our team continuously improves the underlying technology and quality of the API across languages. So your team can focus on the business needs instead of having to deal with the complexity of language models. You can read my previous post on effortless cloud-based natural language understanding for business here.

Syntax API can be used in conjunction with any other NLU features such as Entities and Categories. Here’s a sample request’s JSON body which is requesting Entities and Part of Speech tagging.

{
"text": "Be the change that you wish to see in the world. ― Mahatma Gandhi.",
"features": { "entities": {
},

"syntax": {
"sentences": false,
"tokens": {
"lemma": false,
"part_of_speech": true
}
}
}
}

Go ahead and take this free feature for a test ride. You may find the following resources useful in getting started.

Watson NLU Demo | Syntax API | Getting Started | NLU Product Page

Questions and comments are welcome. Thanks for reading.

Pavan Tummala

Written by

IBM Watson

AI Platform for the Enterprise

More From Medium

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade