30 Years of Devotion to Natural Language Processing Based on Concepts

Published in

SyncedReview

13 min readAug 15, 2017

Recently jiqizhixin.com interviewed Mr. Qiang Dong, chief scientist of Beijing YuZhi Language Understanding Technology Co. Ltd. Dong gave a detailed presentation of their NLP technology and demoed their YuZhi NLU platform. With HowNet, a well-known common-sense knowledge base as its basic resources, YuZhi NLU Platform conducts its unique semantic analysis based on concepts rather than words. Hence, the processing complexity is reduced dramatically. The system will thus be easily deployed to offline mobiles or edge devices. After more than 30 years of hard work, now HowNet of NLU has come to the public as Beijing YuZhi Language Understanding Technology.

Mr. Qiang Dong

Potentiality of Concept Computation

With convolutional neural networks (CNN) and recurrent neural networks (RNN) respectively, deep learning has reformed computer vision and ignited natural language processing. Large amount of research outcomes and application cases can already be seen, but can a black box of deep learning really break through the bottleneck in natural language understanding? It has been NLP’s target that machines can understand natural language; nowadays Google’s neural algorithm uses powerful encoder-decoder structure, attention mechanism and bidirectional LSTM networks on machine translation. But does it really comprehend lexical property and structure?

Some may think it doesn’t matter whether it comprehends or not; it matters only if the system obtains a good result. It’s true that deep learning may come to good results, but this processing — on lexical level instead of conceptual level — demands large quantity of data sets tagging, and distributed trainings on GPU and computing capacity of machines. Only by this way can the model we cultivate be complex enough for mass and complex words and sentences. If a system can understand the property and concept of a word, then it will understand the sentence and its background knowledge on a concept level. Since one concept may represent many words, the computation on concept level will no doubt reduce computation complexity. From this point of view, YuZhi Technology which is based on conceptual processing can undoubtedly help deep learning, enhance it, and bring better effects to it.

Data Problem

Deep learning is a supervised learning, which needs huge amount of tagged data sets. Deep learning converts meaning into vectors and geometry space, and gradually learns complex geometric transformation, establishing mapping between two spaces. So we need higher-dimension space to acquire all possible relations of initial data, and inevitably we need large amounts of data tagging.

HowNet is a common-sense and general-domain knowledge base, so with tagging only once, we can transfer this knowledge to other vertical tasks and scenarios. Furthermore, only tagged once according to knowledge network’s framework, new vocabulary can be added into database and exploited repeatedly. So, utilizing HowNet can help deep learning reduce dependence on tagging.

Generalization Problem

Deep learning can be taken as a kind of partial generalization: when any new inputs are not identical with the data from model training, the mapping of depth network from input to output will encounter difficulties. When we are going to complete some task, we need enormous amount of samples of the task for training, so the output model will be basically the same sort of task. When we improve the present deep learning techniques, superposition and more training data cannot better generalization, because the scope within the reach of models is limited.

However, as the processor of YuZhi NLU platform is based on HowNet, it possesses very powerful generalization function. Its conceptual processing, in the final analysis, is based on lexical sememes and their relationships (details seen below), so the processing is involved with property and background knowledge. It is believed that it can help improve the generalization in deep learning. At present, by changing another way of processing, Chinese word segmentation system of YuZhi Technology can directly be applied in the tasks of word similarity and sentiment analysis.

Humans can adapt to a totally new and never-experienced situation with little or even no data. Abstraction and reasoning can be called identification characters of human cognition. Deep learning can hardly come to generalization to this extent, because it is merely mapping from input to output. But conceptual process is more easily to abstract to property and to reason relationships of things. This type of generalization is what we should pay attention to.

Robust Problem

Deep learning is very robust to some degree. For example, neural machine translation will not change in scale with small disturbance, but adversarial samples will. Deep learning model does not understand properties and relations of input samples. Deep learning model only learns to map data to certain geometry transformation by humans, but this mapping is merely a simplified expression of initial model in our mind. So when the model is confronted by expression without coding before, robust will weaken. But conceptual processing based on HowNet enjoys better robust, because the trees of every concept are definite. The tree form changes only with change of concepts. Random disturbance will not cause lowering of model’s function, nor lead to defect of adversarial samples.

In a word, the real success of deep learning is the ability to map between sample space and expected space under the conditions of mass artificial data tagging. If we can do it well, we may change every industry thoroughly, yet it is still a long way for AI to reach human standard. The conceptual processing based on HowNet of YuZhi can make up for the deficiency of deep learning, enabling natural language processing more close to natural language understanding.

HowNet’s Structure and its Conceptual Processing

Now we know from above that conceptual processing has powerful potentiality. It can break many limitations of DL model in NLP. How should we convert the processing of words or sentences into conceptual one? Based on HowNet, YuZhi expresses words or sentences as trees of sememes, and then carries on processing. Next, we will explain the structure characteristics of HowNet, and how it describes words or concepts by means of tree forms using sememes and relationships.

Structure of HowNet Knowledge Base

It is believed by HowNet that knowledge is a system, which contains relationships between concepts and relationships between properties of concepts. Well-educated people master more concepts and more relationships between concepts and between properties of concepts. HowNet, a common sense knowledge base, can be called knowledge system. Common sense is the subject of description, and relationships between concepts are built and described.

A sememe refers to the smallest basic semantic unit that cannot be reduced further, Mr. Qiang Dong said. For example, as compound of many properties, “human” can be a very sophisticated concept, but we can also take it as one sememe. We suppose every concept can be divided as various sememes. At the same time, we also suppose a limited sememe congregation, sememes in which can gather into an infinite concept congregation. As long as we can manage this limited sememe congregation, and utilize it to describe relationships between concepts and properties, it would be possible for us to establish a knowledge system up to our expectation.

Sememe, Mr. Qiang Dong added, is the basic unit for describing HowNet. The set of sememes are summarized based on observations and statistics. For example, Modern Chinese Dictionary uses around 2,000 Chinese characters to explain all words and expressions. The set of sememe is established on meticulous examination of about 6,000 Chinese characters. To take the “Event” class for instance, we ever extracted as much as 3,200 sememes from Chinese characters (simple morpheme). After the necessary merger, 1,700 sememes are derived for further classification that finally resulted in about 800 sememes.

Note that up till this point, no polysyllabic words (in Chinese) are involved. These 800-odd sememes then served as a tagging set to tag polysyllabic words. In the end, we coded all 800 sememes, and the code of sememe is expressed as mnemonic symbols. For example, word 「打开」, it has a concept as「(open a (box))打开一个东西（盒子）」, it is expressed by sememe {open|打开}. It has another concept as「turn on a light 打开一盏灯」, it is expressed by sememe { TurnOn|启动}.

To understand what is sememe is not complex. Longman English dictionary uses 2,000 words to explain and define all its vocabularies. HowNet also extracted about 2,100 sememes, which have no disambiguation. By combining sememe and relationships, HowNet described all concepts in a net structure.

Representation of Concepts

HowNet emphasizes the relationships between concepts and their properties (attributes or features) of concepts. The network of the relationships is provided to the computer as a system. Thus we will use this system to fulfill NLP tasks. In HowNet a concept or a sense of a word will be defined in a tree structure with sememe(s) and the relationship(s).

For example, “hospital”, is defined as DEF={InstitutePlace|场所:domain={medical|医},{doctor|医治:content={disease|疾病},location={~}}}，we can see from the definition, the word is described by its attributes in sememe form, and is structured by the relationships between its attributes in a structured hierarchy.

The whole knowledge network is a structured conceptual system based on sememes. Sememe is an atomic semantic unit. A complicated concept is constructed by the basic concepts and the relationships among these concepts. The concept-defining language used by HowNet is called KDML（Knowledge Database Markup Language）This markup language solved the problem of embedding structure of a concept. It is acknowledged that concepts and sememes are much more stable than words. Deep learning mostly uses words, and its popular word denotation method is word embedding, typically, word2vec. In DL, no matter whether we use word2vec or weak supervising pre-training like selfcoding, or end-to-end supervising, their computing complexity and consuming is far bigger than the computation of concepts.

How to Compute Concepts

How we should use HowNet to implement the tasks of word segmentation, reference computing, sentiment analysis, Name-Entity recognition, etc. The computation of concepts differs greatly from machine learning. As similar words in concept space are much closer than the token words, the handling of concepts will be much simpler. Generally ML can be seen as a mapping of input space to output space, while in the concept computation based on HowNet, the input space is mapped to a concept, then the mapped concept will be mapped to the output space. The sample in the concept space will take a definite standard form, which will be closer from those related concepts. Now let’s discuss the methods for the above-mentioned tasks.

Word Segmentation

In Chinese word segmentation, we encounter two problems with ambiguities: (1) Combination ambiguity, i.e. the possible combination between two characters; (2) Intersecting ambiguity. For example, “提高人民生活水平”, the machine may segment as follows：提/高人/民生/活/水平. (The correct segmentation is: 提高/人民/生活/水平”)The phrase is an example of continuous ambiguity.

However, with common algorithm, it is just a piece of cake, because the case can be easily handled by the simple method of minimal principle of segmentation. So this, in fact, is a type of so-called pseudo-ambiguity. The difficulty lies in how to make the right choice according to the word list. In Chinese the words are more frequently formed by a combination of characters. In English we may also meet similar cases, English MWE (Multi-Word Expression) are also combined by words. Mr. Quing Dong points out, the correct conception should be “combination”, rather than “segmentation”: In other words, the machine should learn which character/word should be combined with the right neighbor(s).

In ML, segmentation uses CRF, but for traditional CRF features had to be set by human, so large amount of labor-intensive featuring work was needed. In recent years DL has brought brand-new resolutions to many researches. In Chinese segmentation, the method based on neural network (NN), usually uses “word vector+bidirectional LSTM+CRF” model in order to learn features by NN and to reduce hand-coding to minimum. This technology generally contains three processing layers.

First, the embedding layer represents discrete characters in word vectors. Then the feature layer uses the forward and reverse LSTM, considering the timing dependency while extracting useful textual features, Finally, the deduction layer will use CRF to segment the word based on the previous feature. This model needs to express the complex and varied words in the form of vector, is it possible to use HowNet to represent words before the next computation?

As mentioned before, the Chinese word segmentation can actually be regarded to be completed when each character in the text is separated. The rest of the task is to combine, either to combine them into MWEs or phrases. Mr. Qiang Dong said, HowNet for Chinese it can be taken as a word list.

First of all, we should check and see whether the characters in the text can match with any combinations in the HowNet list, and check if there is any ambiguity in the matching. We will then keep all the possible ambiguous combinations and put them into a sentence or a context for computation. Since every word and expression has its corresponding concept(s), we can determine whether the combination(s) can form any proper semantic collocations. When a proper collocation is found, the combination will be then settled. If no proper semantic collocation can be found, then the next possibility will be tried, The iteration for the whole sentence will be carried on until all the proper semantic combinations have been settled.

Computation of Similarity and Relevancy

YuZhi Technology has a special tool to do relevancy computation. The tool is called “Inference Machine”. In HowNet the relevancy among words and expressions is found with its synonymy, synonymous class, antonyms and converse. This type of relevancy is represented by their meanings themselves. The second type of relevancy is based some way on the common sense, such as “bank” and “fishing”. YuZhi technology will use “Inferece Machine” to handle this type of relevancy.

First we will try to find and similar concepts along the corresponding sememe trees, then use the sememes to describe their possible relevancy. HowNet has its own interpreter to describe its KDML. HowNet doesn’t use the mechanism of bag-of-words; it uses a tool called “Sense-Colony-Tester” based on concepts. ML generally computes of words’ similarity rather than their relevancy. ML considers the distribution of words and believes that the words in a similar context will be similar in their meaning. The semantic similarity between two words can be directly converted into two vector space distance, However ML method rarely has algorithms to compute relevancy among words. It is difficult for those methods to find logic relations and dependency relations, hence it will find difficult to use relevancy in disambiguation.

When Qiang Dong talked about YuZhi’s similarity testing, he said, “If we insist to do similarity testing between ‘doctor’ and ‘walk’, we will certainly find a very low similarity between the two words. This is because they do not share the same tree in the hierarchy. Now let’s take the words of the same semantic class, e.g. ‘neurologist’ and ‘doctor’.

First of all, they share the same sememe, i.e. ‘human’. It is important to be of the same sememe, especially the same categorical sememe. Then we may check and compare the rest of sememes in the definitions. Let’s take ‘doctor’ and ‘neurologist’ for example. The two words or concepts share a few sememes, but they are by no means synonymous. The complexity of the latter is much higher than the former. Their definitions are as follows:

‘doctor’ :
DEF={human|人:HostOf={Occupation|职位},domain={medical|医},{doctor|医治:agent={~}}}

‘neurologist’ :
DEF={human|人:HostOf={Occupation|职位},domain={medical|医},{doctor|医治:agent={~},content={disease|疾病:scope={part|部件:PartPosition={nerve|络},whole={AnimalHuman|动物}}}}}

Here we can clearly see an embedding structure in the concept definitions in HowNet. When doing similarity testing, first, we will check the distance between their common categorical sememe and the common superordinate node if they do have, the weighting depends on their distances.

The scheme of representing concepts in a sememe tree contributes definitely to the multilingual and cross-language processing, for the similarity computing using HowNet is based on concepts instead of words.

Future

NLP’s goal is to achieve natural language understanding. YuZhi Technology considers that the results of NLP mainly rely on the employment of knowledge and the ways of processing in NLU.

Extendable HowNet Knowledge Base

Presently YuZhi Technology is providing common operations based on HowNet. With the scale-up of HowNet, YuZhi will surely provide more APIs. On the basis of general domain, knowledge bases in various vertical domains will be developed, and the general-domain knowledge and specific-domain knowledge will be well-matched, thus the problem of so-called “human mental retardation” may give way.

Knowledge Base Integrated with Deep Learning

The basic conception of YuZhi Technology’s future development is to merge deep learning with the core edges of HowNet’s knowledge system and the advantage in NLU. Linguists can definitely do something useful before the “black box” of deep learning. They will be able to help computer scientists recognize language and knowledge in depth. It is believed that the recognition for computer will have a break-through only by their common efforts of computer scientists and linguists.

HowNet represents an important direction. HowNet itself reveals the theory and method to construct a knowledge system. We can apply the theory and method to ground general-domain knowledge graph and specialized-domain knowledge graph. The basic method is to apply HowNet’s systemic rules, and to use sememes to describe the relations between concepts and their features. The method features its interconnection and receptivity which will help in the cross-domain knowledge representation.

In the last 30 years, HowNet has provided research tools to academic fields, totaling more than 200 institutions. Today HowNet is coming to the public as a startup.

YuZhi Technology is one of rare platforms which provides comprehensive NLP tools. Its APIs will be free for academic research and for students.

As shown above, the user can now use the APIs: Chinese Text Analyzer, Word Similarity Tester, Word Relevancy Tester, etc. The APIs show the following prominent features:

A comprehensive and structured knowledge system can help deep learning attain more effective achievements. The NLU based on concept computation is very powerful to extract the logic semantic relationships in beyond-sentence text or discourse. This proved to be true in its applications to intelligence field in a few years ago.
Affordable computation capacity, independency of GPU, low doorsill for NLP applications, use of small AI equipments and edge devices or offline facilities will all be better choice for applications.
Multilingual and cross-language applications, such as Game NPC(Non Player Character)., a virtual AI serving players of different languages.
Rich semantic information supply will benefit the second developments.

Author: Si Yuan