# Finding duplicates.

The final of my second phase into GSoC is related to NLP for finding duplicate issues within a repository. To be honest, I’m just a learner and not much into these stuff. The algorithm I came up with is not even close to being perfect. But, it might be a good basis for a better GitMate in the near future.

#### Our requirements

• Clustering Issues based on duplicates.
• Uses issue title, description and labels.
• Algorithm has to be scalable to thousands of issues. Comparing pair wise is not an option.

#### Tried approaches

1) Using word vectors

Construct word vectors over an n dimensional vector space and grouping things based on a threshold.

`>>> def similarity(s1,s2):... start = time()... x, y = nlp(s1), nlp(s2)... r = x.similarity(y)... end = time()... print(“time elapsed = “ + str(end-start))... print(“similarity ratio: “ + str(r))`
`>>> similarity(‘Disable cache for tests’, ‘Use caching in production’)`
`time elapsed = 0.002780914306640625similarity ratio: 0.490852554719`
`>>> similarity(‘Disable cache for tests’, ‘Use cache in production’)`
`time elapsed = 0.0010590553283691406similarity ratio: 0.724472864444`

Drawbacks
As you can see above, a simple change of cache to caching brings about a lot of change in the outcome. And, for instance, some issues might be using a body template, in such cases the title won’t be much of an affecting factor.

Calculating the similarity for every pair is not feasible performance wise. Other things possible?

2) TF-IDF Vectorising

References

Construct a TF-IDF (Term Frequency — Inverse Document Frequencty) matrix from raw text summary and learn vocabulary from the training set with a **fit** followed by a **transform** on a `TfidfVectorizer`.

`threshold = 0.8 # normalized after several observationsvect = TfidfVectorizer(min_df=0, ngram_range=(1,5))tfidf = vect.fit_transform(issue_texts)matrix = tfidf_similarity(tfidf)# Map issue_numbers with respective scoresresults = [zip(issue_numbers, scores) for scores in matrix]# Filterresults = [dict(filter(lambda r: r[1] >= threshold, row)) for row in results]# Map (key=issue_number, value=result)results = dict(zip(issue_numbers, results))`
`| Threshold | Time (in sec)     | Issue Count | Duplicates || — — — — — | — — — — — — — — — | — — — — — — | — — — — —  || 0.80      | 5.794461250305176 | 2323        | 44         || 0.85      | 5.536595821380615 | 2323        | 44         || 0.90      | 6.009936809539795 | 2323        | 6          |`

Drawbacks

This approach while being a rock solid base of issue identification, has a critical drawback in some repositories, like [coala/coala](https://github.com/coala/coala) where issues follow a certain body template. The false marking of issue similarity based on issue description leads to unneccessary outcomes. More likely, a higher threshold ratio might spoil the purpose.

3) Training a model

Using an ML based algorithm. e.g. https://github.com/Glavin001/IssueBot This would be the most ideal solution for the problem. But is a long way to go and isn’t acheivable in a shorter timeframe. In the long run, a good ML algorithm would be used for identifying duplicate issues.