Hacking similarity search with python

bla bla
2 min readJul 28, 2017

--

Most people have searched for a file on their computer. On windows, you can use the builtin file search. On linux, you can use a tool like find or locate from the command line.

However, what would you do if you wanted to know which files are the most similar to a particular text-based file? For example to find a particular configuration file which has changed its filename and its contents.

Perhaps you know the answer. I did not. I had this problem a few days ago and after some Googling and trying out some things like Anti-Twin I decided that what i wanted did not exist yet. There are many options for finding copies, but few for finding the most similar files. I was actually surprised at this as it seems like it should be a fairly common problem. Maybe my googling is just bad, oh well…

So i created a dirty hack with my favourite dirty-hacking tool. Python. it (ab)uses a Natural Language Processing library to give a number to how similar two text files are. For the record, I know little about Python and nothing about NLP. (right now it still has a bit of a bug in that it does everything twice *. Maybe I’ll fix it someday. or not. Update: fixed.

It is not really meant to be used, but maybe it will inspire a great mind to write something better than a Dirty Hack. Pretty please? Also it tells you something about what a few lines of python code can do. Although i am still a beginner at Python, It has quickly become my favourite if i need functionality quickly. this one actually works, sort of.

* Has been tested on the Linux subsystem for Windows. I had problems on Cygwin. It has not been tested on any other system.

--

--