Hacking similarity search with python

bla bla
bla bla
Jul 28, 2017 · 2 min read

Most people have searched for a file on their computer. On windows, you can use the builtin file search. On linux, you can use a tool like find or locate from the command line.

However, what would you do if you wanted to know which files are the most similar to a particular text-based file? For example to find a particular configuration file which has changed its filename and its contents.

Perhaps you know the answer. I did not. I had this problem a few days ago and after some Googling and trying out some things like Anti-Twin I decided that what i wanted did not exist yet. There are many options for finding copies, but few for finding the most similar files. I was actually surprised at this as it seems like it should be a fairly common problem. Maybe my googling is just bad, oh well…

So i created a dirty hack with my favourite dirty-hacking tool. Python. it (ab)uses a Natural Language Processing library to give a number to how similar two text files are. For the record, I know little about Python and nothing about NLP. (right now it still has a bit of a bug in that it does everything twice *. Maybe I’ll fix it someday. or not. Update: fixed.

It is not really meant to be used, but maybe it will inspire a great mind to write something better than a Dirty Hack. Pretty please? Also it tells you something about what a few lines of python code can do. Although i am still a beginner at Python, It has quickly become my favourite if i need functionality quickly. this one actually works, sort of.

* Has been tested on the Linux subsystem for Windows. I had problems on Cygwin. It has not been tested on any other system.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade