Visualization of Information Theory on Features (Part 1)

Laurae
Data Science & Design
4 min readAug 22, 2016

Laurae: This post is about visualizing continuous and discrete features using information theory. It supposedly came after Part 2, but for reading it comes as Part 1. Only the initial off-topic quotation was removed to the post, originally at Kaggle.

Large post

Interpretation tutorial: HERE

I’m sharing some charts about variables’ relations. I used multifactor dimensional reduction with interaction information variable (multi-way interaction, not two-way interactions) on, very important in this case, supervised discretized continuous variables that I could discretize (I used 5000+ different conditional inference forest to determine the potential best buckets per variables — and if they do exist). Settings for the visualizations are the following:

  • Edge visibility threshold: -4.6e-4 (9%) if applicable — if I were to use 100%, it would be pointless because all lines would be drawn: see picture below
  • Node visibility threshold: 4.0e-4 (100%) — to show all values

big

See here if you need to read about information interaction: Wikipedia about Interaction Information — tl;dr: generaliaztion of mutual information expressed as a real-valued variable that can be positive (association enhancement) or negative (association inhibition)

All values are expressed in percentage of total. For 114 variables, you have 12882 potential ways to interact (if looking at pairwise interactions). Hence, you should be at ~7.8e-5 interaction in average per edge.

Fruschterman-Rehingold organization of variables: big1big2

Relations as circle chart (more “human” perception of intertwining links): big WARNING: start at right please! v113

Relations using Kamada-Kawaii organization: big

Relations using self-organizing maps: big

Dendrogram… because why not: big

Variables I could not discretize (because I could not find any two-way interactions with our label variable):

  • v18
  • v22 (arbitrary choice I made myself about that one)
  • v35
  • v42
  • v49
  • v54
  • v56
  • v67
  • v70
  • v77
  • v89
  • v105
  • v118
  • v120
  • v122
  • v124
  • v126

Output I got from a connected component analysis (using the thresholds):

  • Vertices: 91.0 -> some variables could not be plotted due to their uniqueness (who can help me to find v50? I tried hard to find it but I cannot find it… it is in the inputs though — it should be linked with v79 and v10 as they are both inter-linked when I’m looking to the raw values)
  • Edges: 483.0 -> 483 links are plotted
  • Diameter: 7.0 -> no need to care about
  • Average number of neighbors: 10.6154 -> 1 variable has in average 10.6 neighbors
  • Density: 0.1179 -> not dense relations, very sparse
  • Centralization: 0.7487 -> big cluster
  • Heterogeneity: 1.2258 -> not homogeneous variable relations, obviously

Cartesian product network using three-way interactions graph: Fruchterman-RheingoldCircle graph -Kamada-KawaiiSelf Organizing MapDendrogram

Settings:

  • Edge visibility threshold: 0.00154 (20%)
  • Node visibility threshold: 4.0e-4 (100%)

Connected components analysis:

  • Vertices: 86.0 -> missing lot of nodes
  • Edges: 115.0 -> 115 lines
  • Diameter: 4.0 -> to ignore
  • Average number of neighbors: 2.6744 -> for each variable, 2.6744 neighbors in average
  • Density: 0.0315 -> extremely sparse
  • Centralization: 0.9332 -> extremely centralized
  • Heterogeneity: 3.309 -> not homogeneous at all

Fruchterman-Rheingold:

Circle graph:

Kamada-Kawaii graph:

Self Organizing Map:

Dendrogram:

N.B: I personally thought v56 would be the center of the world of the cartesian product network (123 outcomes)… but this is not the case (v125 with 91 outcomes took its spot).

--

--