Intuit Engineering
Published in

Intuit Engineering

A New Intuit Open Source Release: RBHC — Recursive Binary Hierarchical Clustering

Intuit Software Engineer Ashwith Atluri is excited to launch RBHC — Recursive Binary Hierarchical Clustering. RBHC implements machine learning to accomplish recursive binary hierarchical clustering of data using Python. RBHC categorizes clickstream data — or other data — into a hierarchical cluster format. RBHC also provides statistics for each cluster and interactive visualizations of clustering trees, using d3.js library, on any web browser through a local web server.

RBHC logo

RBHC, created by Intuit Software Engineer Ashwith Atluri, was designed to accomplish recursive binary hierarchical clustering of data. RBHC allows users to input specially-formatted data and get back a hierarchical cluster tree for analysis. Using machine learning, comprehensive statistics for each sub-cluster formed, and interactive visualizations, RBHC provides a good tool for anyone looking to analyze data, especially clickstream data, and derive useful inferences automatically from that data.

Ashwith wrote RBHC primarily in Python. RBHC applies K-Means algorithm to the initial dataset, creating a binary partition, after which RBHC uses chi square score statistic to find the feature (event) that was responsible for partition. The remaining clusters are further divided recursively, using the approach above, until the cluster size reaches one, or the silhouette score reaches the threshold value. Results for clustering are available in multiple ways. Once the library has been imported and the clustering function used, RBHC function returns a Python dictionary with a tree structure representing hierarchical clusters and information about all clusters. The file includes the following fields:

name = Name of cluster node (string); parent = Name of its parent node (string); size = Size of cluster (integer); children = Tree structure of subtree (List); clusterCreated = If clustering has been successful (Boolean).

Clustering statistics are stored as well, and stats for each sub-cluster are stored in a .json file with the following attributes:

ClusterId = Identifier of a sub cluster; L=Level; G=Number of clusters in that level counted left to right; Size = Size of cluster; Primary feature cluster created by = Name of feature which is responsible primarily for this cluster formation; Features chi score = Shows chi score of all features in that cluster; Stats on cluster by each feature = Stats of each feature in this cluster; Ids = All instances that are part of cluster and names are derived from first column of data file

The hierarchical cluster tree is also rendered in an interactive way, providing information for each cluster visually, using d3.js.

Hierarchical tree expands
Clustering visualization using d3.js

Ashwith sees many potential use cases for RBHC, including its ability to help understand user clickstream behavior, and group users with similar behavior. He credits Intuit’s commitment towards open source as one of the primary reasons behind his decision to open source the project. “I felt that there was merit in open sourcing this project,” says Ashwith, “and Intuit’s processes for open sourcing helped me learn new skills.”

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store