Simplest entropy model in source code: To prove the refactoring in math

Tao-Sheng Chen
ShopBack Tech Blog
4 min readSep 1, 2018

--

This article introduces a simple way to prove one of our engineer’s talented work in refactoring

The father of information theory - Claude Shannon introduced a whole information and communication theory in which still a foundation of current Internet world. Everything from compression to data transfer could be guide or inspire from that foundation. However, Shannon’s study also covered early stage of AI and also Machine learning. He also combined thermodynamics to information theory (although at that time, it seems not yet have the term: information theory) and that made a full different path of thinking of “the whole world is actually all about information”. This book might be a good start point for those who is not software technology background to know a bit of information system.

Entropy is a way to evaluate the “degree of disorder”. Thermodynamics hold original meaning of entropy and Shannon leverage entropy to make data exchange (or communication) more easy to manage.

For show respect to all previous genius and also to have a easy way to know if our talented ShopBack Taiwan engineer (let us call him A) really did a good job as previous article explained(here). I’d like to check if chaos level of the system did lower down, a.k.a. lower entropy.

There are many research of machine learning on counting entropy in text. (too many, for example: here, here) However, we can’t evaluate “text” in source code, since all programming language had more limited grammar than human language which make the chaos level in text are always low. We should evaluate “interface” or “interaction” between function or computing block (an instance of class, an part of callback or any other kind of module). In nodejs, the most easy way to identify by import. Let me name it as “Simplest Entropy Model”

In Simple Entropy Model. We use the same function to count entropy:

However, the most critical part is how we identify the Xi. We need to select a better graininess. As explained, we can’t use text or alphabets in computing language. It can’t be words, which might be useful in blog article or to identify if this twitter post make any sense. Our model use the “use import from other module” to be the graininess.

Meaning: in this model,P(xi) is the probability of a module could be used in other modules.

Actually, we didn’t have count the “link” number in this simple model. It should be count but i am a bit lazy to do so. If you are still reading this article without fall in sleep. and you feel that this model is too stupid and you have much better way to do source code analysis. You might be our target to cowork with. We are still hiring, mail me if you want to have a further chat.

Obviously, we could simple write a script to do following things to get the entropy number from a repository.

(1) check out a branch

(2) count all imports source from a file. and also get all nodejs file number.

(3) count each of the import link divide by all file numbers, we will got one P(xi)

(4) Sum of P(Xi) * Log (P(Xi)) then we will get the entropy

For our ShopBack browser extension project, we know that engineers did work pretty hard to push features in between version 1.x to 3.0.x. However, we also know that our talented engineer A try to do more refactoring after 3.0.0 release. With this model, we can have a simple view to know if that refactoring did lower down the system chaos from pretty high level view.

The result is, the git repo tag in 3.0.0’s entropy is 12.940505128626393. And the entropy is actually keep growing up. For example, the entropy of 1.0.0-alpha.9 is about 10.62.

However, after long work of refactoring. Current master branch(about 3.3.0) is 10.53185135650777 which is actually much lower than 3.0.0 release.

Hence we could simply say that A did a pretty good job in refactoring to make our product better!

--

--