Word Embeddings with POSPair
Hey there, I wanted to share a major development about POSPair model. I have developed POSPair Word Embeddings for word representation. POSPair Word Embeddings is developed by modifying Word2Vec according to POSPair model.
POSPair model works on a basic principle. Sentence consists of a set of words. Each word is categorized into different part-of-speech based on their function. Each part-of-speech explain how a word is used in a sentence. On the basis of the above, words in a sentence are only associated to specific words. Such as Adjective describes Noun, Verb describes action or state of Noun; but Adjective doesn’t describe about Verb or vice-versa(such relation provides wrong values).
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.
Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram.
In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption).
Considering a simple sentence, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 1, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on.
The skip-gram model’s aim is to predict the context from the target word, the model typically inverts the contexts and targets, and tries to predict each context word from its target word. Hence the task becomes to predict the context [quick, fox] given target word ‘brown’ or [the, brown] given target word ‘quick’ and so on.
We simplify this further by breaking down each (target, context_words) pair into (target, context) pairs such that each context consists of only one word. Hence our dataset from earlier gets transformed into pairs like (brown, quick), (brown, fox), (quick, the), (quick, brown) and so on. The skip-gram architecture weighs nearby context words more heavily than more distant context words.
An alternative approach is dependency based contexts. Dependency based contexts derive contexts based on the syntactic relations the word participates in.
In POSPair model, (target, context) pairs are created between the words related through part-of-speech in a right way. Words of only specific part-of-speech are related with each other and provide some meaningful relation. All relations are one sided relations. POSPair model tends to capture more about the word itself: what other words are functionally similar?
Rather than taking all adjacent words from left and right of the target word, POSPair model only takes into account the words that are actually related and provide meaningful context. Usually, having larger window size may capture more domain or topical information, but lacks at finding similiar words.
In CBOW and skip-gram, (target, context) pairs such as (quick, brown), (quick, the), (brown, quick), (brown, jumps) fail to provide any useful meaning and context to the target word.
i.e., (quick, brown) — In the sentence, words “quick” and “brown” are used to specify about the word “fox”. Both words behave as adjective; adjective specifies about noun. The possible part-of-speech relationship pair is Noun-Adjective. “quick” doesn’t specify anything or relate to “brown”.
Also (target, context) pairs such as (quick, fox), (brown, fox), (jumps, fox) provide context and relation in wrong direction to target word.
i.e., (quick, fox) provides context word “fox” to target word “quick”. But the proper relation is (fox, quick). It specifies “fox” as “quick”. Adjective specifies about Noun; Noun doesn’t specify anything about Adjective. It may contribute to understanding or relate in some way, but semantically it fails to describe about the word.
Same goes for Dependency based contexts, some (target, context) pairs such as (australian, scientist), (discovers, scientist) also provide context in wrong direction to target word.
In POSPair word embeddings, all arguments can be passed on to function as same as Word2Vec function, except for window, min_count and corpus_file. There’s a seperate function for corpus file named as txtfileinput, which takes .txt file as input. You can also change the sg argument value; default is set to 1. After training, it returns a Word2Vec model as output. You can perform all operations of Word2Vec on POSPair word embeddings.
POSPair word embeddings is available on GitHub. You can start using it via pip. You can also clone it and contribute to it.
Get in touch at firstname.lastname@example.org
All rights reserved © 2018 Jim Macwan