Binning example on customers purchase history

Data Binning Explained

Mose Kabungo
7 min readFeb 18, 2023

--

Background and Disclaimer: In the company I’m working at, We’ve segmented our customers into four classes based on their payment behavior. Typically, when segmenting customers in a business setting, three parameters are used: Recency (how recently a customer made a purchase), Frequency (how often a customer makes purchases), and Monetary (how much a customer spends on each purchase). However, in this case, I will only be using Frequency to explain the concept of binning. In the future, I will provide a more comprehensive explanation of customer segmentation.

But what is binning?

Data binning is the process of organizing your data into a finite range of intervals. Those range of intervals are called “bins”.

Data binning, also known as “data bucketing”, is a data pre-processing technique used in machine learning and data mining to group continuous data values into discrete intervals or “bins.” Binning can serve multiple purposes, including reducing data errors by smoothing out fluctuations in the data, gaining insights about the data distribution, and converting real-valued features into discrete intervals, also known as one-hot encoding.

There are two primary types of binning methods: frequency binning and equal width binning. Variations of these methods exist, but they are all essentially based on one of these two approaches. In this tutorial, the focus is on implementing a basic equal-length binning algorithm. Equal width binning involves dividing the range of values in a dataset into a specified number of equally spaced intervals between the minimum and maximum values.

const width = (max - min) / number_of_bins
const bins_values = [min, min + width, min + 2 * width, ..., min + number_of_bins * width]

Can you show an example?

To illustrate the concept of data binning, let us consider a practical example using Typescript. Suppose we have a dataset that records the number of purchases made by customers within a specific time period. The dataset is represented in the following array:

const trxFreq = [229, 293, 394, 39, 39, 49, 93, 100, 29, 48, 188]

In this example, the values represent the number of purchases made by each customer. We can use data binning to group these values into discrete intervals, making it easier to analyze the data and gain insights about the purchasing behavior of our customers.

In order to analyze the dataset of customer purchase frequency, it is necessary to divide the data into four distinct bins labeled `D-Customers` through `A-Customers`. To accomplish this, we will first create a simple data structure to store our bins.

/**
* Structure of each bin
*/
export interface Bin {
/**
* Bucket.
*/
item: {
value: number
label?: string
}

/**
* Values in a bin
*/
values: number[]
}

/**
* Array type of bins
*/
export type Bins = Bin[]

In order to bin the customer purchase frequency data, we will create a function called toBin. This function will take an array of numbers (our dataset), the desired number of bins, and an optional label array. The toBin function will implement our binning algorithm, grouping the dataset into the specified number of bins and returning an array of the binned data.

The code for the toBin function might look something like this:

/**
* Convert input dataset array of numbers into bins.
*
* @param data dataset array
* @param binSize Number of bins
* @param labels Optional labels for the bins
*/
export function toBins(data: number[], binSize: number, labels?: string[]) {
// Implementation
}

If the optional label array is not provided to the toBin function, the algorithm will use bin values to label the bins. For example, if we are dividing the data into four bins, the bins might be labeled as follows:

  • Bin 1: Lowest frequency purchases / Lucky customers
  • Bin 2: Low frequency purchases / Little interested customers
  • Bin 3: Moderate frequency purchases / Okay customers
  • Bin 4: High frequency purchases / Good customers

These labels are based on the frequency of purchases made by the customers, and can be used to provide insights into customer behavior and preferences.

const sampleCount = data.length

// Sort data in descending order
const sortedData = data.sort((lhs, rhs) => lhs - rhs)
// Find min and max samples
const minSample = sortedData[0]
const maxSample = sortedData[sampleCount - 1]

// Create equal spaced bins
const binWidth = (maxSample - minSample) / binSize
const bins: Bins = Array(binSize)
.fill({ values: [] })
.map((bin, i) => ({
item: {
value: minSample + binWidth * i,
label: labels?.[i]
},
...bin
}))

To begin the binning process in the toBin function, we create a count of the number of elements in our input dataset by taking the length of the array. Next, we sort the dataset in ascending order, and record the minimum and maximum values present in the dataset for use in calculating the bin width.

The bin width is calculated by taking the difference between the maximum and minimum values and dividing it by the number of desired bins. This width will be used to group values into their appropriate bins.

Finally, we create an empty array to hold our bins, which we will populate with data in the next step of the algorithm. By creating these empty bins, we establish the framework for the subsequent grouping of data into bins.

// Search for the next chunk from the sorted array to add into the bin values
let minIndex = 0
let maxIndex = sampleCount - 1

for (let i = 0; i < binSize; i++) {
while (
minIndex < sampleCount &&
sortedData[minIndex] < bins[i].item.value
) {
minIndex++
}
while (
maxIndex >= 0 &&
sortedData[maxIndex] > bins[i].item.value + binWidth
) {
maxIndex--
}
for (let j = minIndex; j <= maxIndex; j++) {
bins[i].values.push(sortedData[j])
}

// Slice item from the sorted array into the bin
bins[i].values = sortedData.slice(minIndex, maxIndex + 1)

// Get the maxIndex ready for the next iteration.
maxIndex = sampleCount - 1
}

In this part of the algorithm, we locate the minimum and maximum indexes for each bin by multiplying the bin number by the width of the bins and rounding to the nearest integer. We use these indexes to slice the sorted dataset and populate the bins with the corresponding values.

It is important to note that this step requires the dataset to be sorted in ascending order in order to locate the appropriate indices. Failure to sort the dataset beforehand would result in incorrect binning.

To see the algorithm in action, a fully functioning example is given below.


export interface Bin {

item: {
value: number
label?: string
}

values: number[]
}

export type Bins = Bin[]

export function toBins(data: number[], binSize: number, labels?: string[]) {
const sampleCount = data.length

if (binSize >= sampleCount) {
throw new RangeError(
`Bin size ${binSize} should be less than sample size: ${sampleCount}`
)
}

// Sort data in descending order
const sortedData = data.sort((lhs, rhs) => lhs - rhs)
// Find min and max samples
const minSample = sortedData[0]
const maxSample = sortedData[sampleCount - 1]

// Create equal spaced bins
const binWidth = (maxSample - minSample) / binSize
const bins: Bins = Array(binSize)
.fill({ values: [] })
.map((bin, i) => ({
item: {
value: minSample + binWidth * i,
label: labels?.[i]
},
...bin
}))

let minIndex = 0
let maxIndex = sampleCount - 1

for (let i = 0; i < binSize; i++) {
while (
minIndex < sampleCount &&
sortedData[minIndex] < bins[i].item.value
) {
minIndex++
}
while (
maxIndex >= 0 &&
sortedData[maxIndex] > bins[i].item.value + binWidth
) {
maxIndex--
}

// Slice a bucket
bins[i].values = sortedData.slice(minIndex, maxIndex + 1)

// Get the maxIndex ready for the next iteration.
maxIndex = sampleCount - 1
}

return bins
}

Code in action…

Once the toBin function has been implemented, it can be invoked by client code to produce an array of Bin objects representing the grouped data. To do so, the client code must pass the input dataset, the desired number of bins, and optionally, an array of labels to the toBin function.

As an example, we can provide sample transactions and invoke the toBin function on this dataset to produce an array of Bin objects representing the grouped data.

const trxFreq = [229, 293, 394, 39, 39, 49, 93, 100, 29, 48, 188]
const labels = ['D-Customers', 'C-Customers', 'B-Customers', 'A-Customers']
const bins = toBins(trxFreq, labels.length, labels)

console.log(bins)
// Prints
// [
// {
// "item": { "value": 29, "label": "D-Customers" },
// "item": [29, 39, 39, 48, 49, 93, 100 ]
// },
// {
// "iem": { "value": 120.25, "label": "C-Customers" },
// "values": [ 188 ]
// },
// {
// "item": { "value": 211.5, "label": "B-Customers" },
// "values": [ 229, 293 ]
// },
// {
// "item": { "value": 302.75, "label": "A-Customers" },
// "values": [ 394 ]
// }
// ]

The resulting array of Bin objects produced by the toBin function can be used to compute a histogram by plotting the length of the values array against the corresponding label for each bin.

It is worth noting that the algorithm we have implemented has a time complexity of O(n log n) due to the need to sort the input dataset. Additionally, the space complexity of the algorithm is O(k), where k is the number of bins.

Closing Notes and Take aways:

To summarize, in this tutorial we have covered a simple equal width binning algorithm and provided an example of how it can be used to group customers based on their purchasing frequency. Binning is a common data pre-processing technique that can be employed to reduce error, gain insight about a dataset, or convert real values into discrete intervals. However, there are some challenges associated with this technique, such as deciding the appropriate bin size for a given dataset.

Depending on the specific use case, alternative techniques can be employed to organize a dataset into defined groups, such as the k-Nearest Neighbors (kNN) algorithm. It is important to carefully consider the specific requirements of a project in order to determine the most appropriate pre-processing technique to employ.

If you’re interested in learning more about binning or other data pre-processing techniques, be sure to explore my articles for more articles, tutorials, and resources. Opt in to receive emails when I publish new articles. Stay tuned for more insights and practical tips on the latest trends in machine learning and data science!

Finally, Here is the github repo for this tutorial Thank you!

--

--

Mose Kabungo
Mose Kabungo

Written by Mose Kabungo

Christian evangelist & Digital technologist

No responses yet