Paraphrase Detection In Tamil with and without Word2vec

Paraphrase Detection is a concept of detection where two sentences or paragraph is equal or not. Through there are numbers concept and tools to implement or reuse in English. But language like Dravidian it is kind of rare for it. So that make me to work on paraphrase, no. This project was choose accident by us as it was very new and challenging. When you search for tool related to Dravidian language, the search content will be less.
The data-set has been provided by Amrita University in Coimbatore with around of 5000 set of pairs and these data-set general classifiers Sub Task-1 and Sub Task-2 with three types class
- Paraphrase[P]
- Semi paraphrase[SP]
- Not a paraphrase[NP]
Our total project working view can be divided into two types
- Synthetic classification
- Semantic classification
Tools:
- Python [for preprocessing and customizing]
- R [for some Machine Learning algorithm]
- XG booster [new one]
Shallow Parser:
The shallow parser is a lightweight parser useful in NLP(Natural Language Processing) field which makes analysis on the given sentence by finding the POS(Parts-of-Speech) of sentence. The shallow parser used from IIT Hyderabad which is available only in web interface( even though it is available in offline sometimes it is not working ).
Synthetic Classification:
What do you mean by synthetic classification? It kind of based on the structure, repeated etc..

For example
`A lazy fox jumps over the lazy dog` and
`A lazy dog jumps over the lazy fox`
On computing with “Bag of words” this statement is said to be paraphrase.
The working of our paper is by passing the each sentence into the shallow parser and extracting the POS tagged values from the files. Feature file is construct based on the output of the shallow parser(with 16 selective POS tagged).
Semantic classification:
The classification of sentences based on the meaning. So how do find based on semantics of words. For example using Word-Net to understand the depth of the word is one method (but Tamil didn’t have proper word-net tool and the construction is very complex process) , Word2vec is a neutral network algorithm which uses mathematical word vector and find the semantic distance between words , there are other algorithm which can be chosen based upon on the environment, data-set, performers etc.
KING — MAN + WOMEN= QUEEN ??????????
Construction of Cosine Similarity are from the word2vec values for the set of given paraphrase sentence.


88.0% and 65.60%?????:
After hearing about the XGBoost[ eXtreme Gradient Boosting ] which is very good working performance and gives very good accuracy. I just gave a shoot (see below the result) .
Preprocess:
Preprocess is a step to take place before applying the machine learning algorithm. Is it a very important process, yes because it also helps fine-tuning the accuracy.
About the images
These are the cosine plot as at the Y range about 1000, there is clear deviation of the data plot showing for P, NP [and SP in second image]. This indicate show binary classification has advantage
Some of the list of preprocessor are
- Word Tokenization
- Punctuation Removing
- Stemming or Lemmatization
- Sentence Tokenization
- Stop-words Removal
The first two list of concept had been used in our paper.
Evaluation metrics:
The measures in related to machines learning is terms for “How good is your algorithm?”
Some of the evaluation metrics are
- F1- scoring(used in our paper)
- Multiclass log loss
- Average Precision for binary classification
- Mutual information

# these values are generated using XGBoost #Task 1
Confusion Matrix:
[[207 48]
[ 27 343]]#Task 2
Confusion Matrix:
[[180 37 45]
[ 90 51 92]
[ 13 24 343]]
Accuracy:
Based on the confusion matrix the accuracy of an algorithm can be easily calculated and our accuracy.
- Sub Task-1 SVM [ accuracy: 0.73, F1-score: 0.72 ]
- Sub Task-2 SVM [ accuracy: 0.59, F1-score: 0.53 ]
- Sub Task-1 ME [ accuracy: 0.75, F1-score:0.74 ]
- Sub Task-2 ME [ accuracy: 0.61, F1-score:0.56 ]
- Sub Task-1 GB [accuracy: 0.88, F1-score:0.87] *
- Sub Task-2 GB [accuracy: 0.65, F1-score: 0.59] *
First 4 are synthetic accuracy and last 2 for semantic accuracy
Conclusion
Even using algorithm like SVM and ME along with word2vec, the accuracy was under 80%(Subtask-1) and 64%(Subtask-2). The applications of paraphrase are Plagiarism, Text-summarization, Question Answering system etc..
To know more about the data-set in depth it is better to read our paper
List of Authors for this paper are R.Thangarajan(Professor), S.V.Kogilavani (Assistant Professor, SrG) A. Karthic, and jawahar s, KONGU ENGINEERING COLLEGE.