Visualization of Information Theory on Features (Part 2)

Laurae
Data Science & Design
10 min readAug 22, 2016

Laurae: This post is about visualizing continuous and discrete features using information theory. It supposedly came before Part 1, but for reading it comes as Part 2. No changes were made to the post, originally at Kaggle.

ololo wrote:

I used 5000+ different conditional inference forest to determine the potential best buckets per variables

I also wonder how you did this? Is there an R/python package for that? Might the buckets be useful themselves, as features?

I used a own-made R script to create the potentially best buckets for each variable. I checked them manually one by one afterwards. There are variables, when taken alone, are able to get to ~0.47 if you use their derived properties (such as their weight of evidence, etc.). This technique comes from the banking sector, specifically in the credit scoring:

  • For each continuous variable, you determine “buckets” (or bins, it’s the same thing) using an algorithm of your preference you feel that suits the best the data (and conditional inference forests are really good at resisting overfitting, and are pretty conservative in their nature) — each bucket has at minimum 5% of all values
  • Each “bucket” has different properties, such as information value (IV), weight of evidence (WoE), rate of goodness (we call it positive rate or good credit rate), rate of badness (we call it negative rate or bad credit rate), and many other outputs (enter link description here).

In order to use multifactor dimensional reduction, you need to discretize variables.

There are packages for discretization in R. There are many ways of discretizing continuous variables, therefore you must take care about the hypotheses behind the ways you choose.

ololo wrote:

Might the buckets be useful themselves, as features?

I use them (although not every). It can gives a hefty boost to 0.45 very easily if you understand what they can give you. Even 0.445 is achievable using them.

ololo wrote:

@Laurae, I have some troubles trying to understand the plots. Could you please give a hint how to read them and what information is there?

The first series of graphs are using the Interaction Information value. I am using a three-variable case. If you do not know about information value, I recommend you reading about the H-value (Mutual Information), then reading about the IG-Value (Information Interaction).

But in short, it is a generalization of mutual information to more than two variables. I used three here, as it is still understandable (four looks like a black box).

IG(A; B; C) where C = label, A = the variable, B the link variable, where we also have IG(A; B; C) = IG(B; A; C), C remaining fixed because C can lead to conditional mutual information.

A nice wikipedia image example:

IG(A; B; C) is the difference between I(A; B) when C is present (non-ignored) and C is not existing (ignored). Therefore, an influence of C on (A; B) is assessed, instead of a simple pairwise association.

Unlike the typical information value, interaction information can be both negative and positive. Also, it’s normalization is the information gain. The interpretation must be done this way:

IG(A; B; C) = I(A; B|C) — I(A; B) = “interaction minus correlation/dependency” (note “correlation”)

You can get these all the required values using a multifactor dimensionality reduction combined attribute network. There are very specialized software for this (you have to look up if R can do this, but I suspect it probably doesn’t do graphs).

Positive interaction information:

When IG(A; B; C) > 0, you have a positive interaction information: there is evidence for a non-linear interaction. More simply said, A and B together enhances the prediction of target.

Negative interaction information:

When IG(A; B; C) < 0, you have a negative interaction information: there is a redundancy of information between A and B (they provide the same information) when C is taken into account. More simply said, A and B does not give additional information for the prediction of target.

Null interaction information:

When IG(A; B; C) = 0, you have a mixture interaction information: there is evidence for either conditional independence, or a mixture of both synergy and redundancy (i.e, confusion).

In the graphs, you have variables (vertices) and interaction information (edges between two variables). What is interesting is how the interaction information is spread, and where it is located.

To visualize all the information, you have different representations of graphs you can use. For MDR, mostly good and usable representations are Fruschterman-Rehingold, Kamada-Kawaii, Self Organizing Maps, and Dendrograms. They represent the same data, but in different ways.

For instance, if you take the first graph: here

You can clearly notice v110 leading a large edge that is apart from every other variables from the original big cluster. It has an IG(v110, v112, target) = -0.04% = -0.0004. It means, that there is a very small redundancy of information between v110 and v112 conditionally by target. However, if you look at the dendrogram (here), you will notice v110 is in green color. Green color is the level before redundancy.

When you look at v110 alone, you can see “1.98%”. It means that Shannon entropy minus conditional Shannon entropy is equal to 0.0198. As formula, it becomes: I(A; C) = H(A) — H(A|C), i.e entropy minus conditional entropy, or the entropy percentage loss (loss > 100% should not happen). You may lose a high %, but what remains evident is how it clusters variables apart to each other.

Here is how you can interpret the dendrogram of a multifactor dimensionality reduction when using a combined attribute network:

Dendrogram levels:

  1. Red = high synergy
  2. Orange = synergy
  3. Gold = null
  4. Green = redundancy
  5. Blue = high redundancy

Dendrogram left-to-right interpretation:

  • Left = weakest interaction with target
  • Right = strongest interaction with target

The colors used are “next-to” identical in graphs and in dendrograms (in a graph it does show the details, hence you might obviously lose aggregate information). If any links are not null, they are immediately apparent with their colors being non-gold (orange, red, green, blue).

As for the information in the graphs… it depends mainly on how you can understand and interpret the visualization. When you have ~115 variables assessed, you have over 1 million interactions to assess and it would be painful to explain each one manually. What is important to read, are in short:

  • Visualization cutpoints (where can you make your own clusters?)
  • Non-gold colored edges (i.e, the links with the highest bivariate conditional effect, non null interaction information -> synergy to help predicting target, or redundant information to predict target?)
  • Are there endpoints? (variables that are at the most edges that do not allow a loop)
  • Where are endpoints? (variables that are at the most edges and/or the furthest away from the largest visual “cluster”)
  • Where are localized entropy loss? (highest nodes %, i.e the variables with the highest univariate conditional effect)

Simply said, highest %:

  • On a node (a variable): when taken apart, that variable is predicting target at a higher rate than other variables that are of lower % (for instance, 2% is very high)
  • On an edge (a link between two variables): when taken apart, the linked variables are predicting target together better than taken separately (for instance, 2% is very high)

Cartesian product networks are another way of visualizing the data using different metrics. But they give roughly the same information behind.

Multifactor dimensionality reduction is used in genetics for gene-gene interactions. It is also used for assessing epistasis (gene-gene conditional interdependence/causal effect).

ToonRoge wrote:

Laurae wrote:

There are some extremely interesting categorical rules in the dataset…

For instance, assuming for 63692 cases: A = {v31 == A and target == 1} and B = {v3 == C}, you have:

  • A -> B @100% rate
  • (-A) -> B @92.62% rate
  • B -> A @57.60% rate
  • (-B) -> A @0% rate
  • A <-> B @58.98% rate

Algorithms telling 100% confident on this is an interesting rule. There are many 99% confident rules on categorical variables also that imply the label we are looking for. I wonder if hardcoding all the possible rules from the data set could beat a single xgboost alone (and I think that’s potentially true when I look at the confidence level of all rules of only categorical variables, but getting them in the order require a lot of manual entries -_-).

I don’t understand this post, but it sounds interesting. What does A -> B means? And (-A) -> B? … When I look at the combinations of v31 and v3 I see nothing exceptional. If v31=A, v3 = C, but this holds for both target = 0 and target = 1. There is some signal in v31 for sure, but nothing spectacular I would say.

What am I misinterpreting?

Going to explain line by line for the whole process of analyzing just the links I provided.

If you want to follow strictly the inference process, read the following parts separated by a line: 1 => 3 => 4 => 2 => 5 => Conclusion.

A -> B @100% rate

When you have A true (v31 == A and target == 1) then B is forcibly true. From there you know you are in a selective node v3 == C.

Inference #1: (v31 == A & target == 1) => (v3 == C)

Note: it does not prove whether (v31 == A & target == 0) => (v3 == C) is missing. Check inference #3.

(-A) -> B @92.62% rate

When A is not true (either v31 != A or target != 1), then B remains right at 92.62% confidence rate. It means that (v3 == C) is also spread using v31 and target as variables.

Using inference #3 you can have an inference #4: if v31 is not A, you should be able to predict (target == 1) at a potentially 92.62% confidence rate.

B -> A @57.60% rate

When B is true (v3 == C) then A is right at 57.60% rate.

It means that when v3 is C, you have a higher than average confidence to have (v31 == A) and (target == 1).

Inference #2: high proportion of (v3 == C and target == 1) in the data set than the contrary.

(-B) -> A @0% rate

There can not be a case where (v3 != C) leads to (v31 == A and target == 1). This was the first inference found.

Inference #3: using inference #1, we know if (v31 == A), target is remains a mobile variable. However, you cannot have (v31 == A and v3 == A) nor have (v31 == A and v3 == B). Confirms inference #2 (if it was not true, then inference #3 is rejected).

A <-> B @58.98% rate

When both A and B are both conditionals for each other, it happens at a confidence rate of 58.98%.

It means all these three conditions together:

  • You need (v31 == A and target == 1) to have (v3 == C)
  • You need (v3 == C) to have (v31 == A and target == 1)
  • You have a confidence of 58.98% of it happening.

Conclusion: v3 == A has a good ability to segregate target. You also know that if v31 == A, you forcibly have v3 == C. The baseline 58.98% is a starting point if you are assuming the relation (v31 == A and target == ???) and (v3 == C). More exactly, the complete inference tree becomes:

  • If (target == 0 and (v31 == A and v3 == C) ) then ~41% confidence of being true (target = 1 at 41%)
  • If (target == 1 and (v31 == A and v3 == C)) then ~59% confidence of being true (target = 1 at 59%)
  • If (target == 0 and (v31 != A or v3 != C)) then ~7% confidence of being true (target = 1 at 7%)
  • if (target == 1 and (v31 != A or v3 != C)) then ~93% confidence of being true (target = 1 at 93%)

i.e, when you don’t have (v31 == A and v3 == C) true, you have an extremely high chance of having target == 1, and when it holds true you can slightly differentiate whether target is 0 or 1 (small separation).

If you are using loose probabilities, you would use B -> A %confidence rate instead of A <-> B %confidence rate. It’s mostly up to human bias when dealing with (rules and rules’ confidence rate) to probability conversion.

A more manual analysis would be:

And if you are a probabilistic statistician, you would use these values in that leaf report as probabilities that you would compound with each other depending on the situation you are in, to create the final probability of target given v31 and v3.

The major issue is that a human can not check all these variables one by one due to time constraints. For a three-way relation analysis against target, you are looking for an absurd 17K+ possibilities (edit: 51K+ possibilities due to three-way relation where all variables are mobile in each side). Imagine a four-way interaction ^^ Good thing that computers can find it for us :)

Might be of interest for some Kagglers (if they find it useful):

--

--