Finding duplicates.

Naveen Kumar Sangi
Jul 21, 2017 · 2 min read

The final of my second phase into GSoC is related to NLP for finding duplicate issues within a repository. To be honest, I’m just a learner and not much into these stuff. The algorithm I came up with is not even close to being perfect. But, it might be a good basis for a better GitMate in the near future.

Our requirements

  • Clustering Issues based on duplicates.
  • Uses issue title, description and labels.
  • Algorithm has to be scalable to thousands of issues. Comparing pair wise is not an option.

Similar existing tools

Tried approaches

1) Using word vectors

Construct word vectors over an n dimensional vector space and grouping things based on a threshold.

>>> def similarity(s1,s2):
... start = time()
... x, y = nlp(s1), nlp(s2)
... r = x.similarity(y)
... end = time()
... print(“time elapsed = “ + str(end-start))
... print(“similarity ratio: “ + str(r))
>>> similarity(‘Disable cache for tests’, ‘Use caching in production’)time elapsed = 0.002780914306640625
similarity ratio: 0.490852554719
>>> similarity(‘Disable cache for tests’, ‘Use cache in production’)time elapsed = 0.0010590553283691406
similarity ratio: 0.724472864444

Drawbacks
As you can see above, a simple change of cache to caching brings about a lot of change in the outcome. And, for instance, some issues might be using a body template, in such cases the title won’t be much of an affecting factor.

Calculating the similarity for every pair is not feasible performance wise. Other things possible?

2) TF-IDF Vectorising

References

Construct a TF-IDF (Term Frequency — Inverse Document Frequencty) matrix from raw text summary and learn vocabulary from the training set with a **fit** followed by a **transform** on a `TfidfVectorizer`.


threshold = 0.8 # normalized after several observations
vect = TfidfVectorizer(min_df=0, ngram_range=(1,5))
tfidf = vect.fit_transform(issue_texts)
matrix = tfidf_similarity(tfidf)
# Map issue_numbers with respective scores
results = [zip(issue_numbers, scores) for scores in matrix]
# Filter
results = [dict(filter(lambda r: r[1] >= threshold, row))
for row in results]
# Map (key=issue_number, value=result)
results = dict(zip(issue_numbers, results))
| Threshold | Time (in sec) | Issue Count | Duplicates |
| — — — — — | — — — — — — — — — | — — — — — — | — — — — — |
| 0.80 | 5.794461250305176 | 2323 | 44 |
| 0.85 | 5.536595821380615 | 2323 | 44 |
| 0.90 | 6.009936809539795 | 2323 | 6 |

Drawbacks

This approach while being a rock solid base of issue identification, has a critical drawback in some repositories, like [coala/coala](https://github.com/coala/coala) where issues follow a certain body template. The false marking of issue similarity based on issue description leads to unneccessary outcomes. More likely, a higher threshold ratio might spoil the purpose.

3) Training a model

Using an ML based algorithm. e.g. https://github.com/Glavin001/IssueBot This would be the most ideal solution for the problem. But is a long way to go and isn’t acheivable in a shorter timeframe. In the long run, a good ML algorithm would be used for identifying duplicate issues.

My New Roots..

)

Naveen Kumar Sangi

Written by

A 21 y. o. Software Dev from India. Pythonista and JavaScript. Reach me at naveenkumarsangi@pm.me

My New Roots..

Life as nkprince007, the web developer.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade