Three Routes: Binning and Grand Totals

Simi Talkar
The Startup
Published in
7 min readSep 21, 2020

Relevant Scenario

This scenario was developed in response to a requirement to translate excel pivoting capabilities into Pandas code so that the automation developed could be applied on a large set of excel files. The raw data in the excel sheet primarily consists of contracts won in each fiscal year paired with their contract amounts. The requirement is to report on the number of contracts in each fiscal year that fall into “bins” or “buckets” of pre-defined contract amounts. Six bins of contract amounts were identified as bins of interest:

1. <10M
2. >=10M<25M
3. >=25M<50M
4. >=50M<75M
5. >=75M<100M
6. >=100M

Three techniques have been provided below which create the output desired using a relational database, Pandas and Power BI- with usage of DAX functions.

Data table

The data we will consider here is an excel sheet with three columns as seen in Fig 1. We would like to end up with a summarization shown in Fig 2.

Fig 1:

Fig 2:

Technique in SQL

PostgreSQL
We will load the data directly from the excel file into the database using Import/Export functionality offered by pgAdmin. See section Setup Data below for details. After the load, the fiscalyearcontract table created, contains the data from the excel sheet.

A single query can create the output with
1. Rows for each fiscal year, with count of contracts in each bin.
2. Row totals (total number of contracts for each fiscal year).
3. Column totals (total number of contracts for each bin of contract amount range) .
The query contains common table expressions (CTE) which allow us to build upon each sub-query layer to group the data as well as create summarization.

Postgres Query
with T as
(select fy,
sum(case when orig_amount < 10000000
then 1 else 0 end) as "<10M",
sum(case when orig_amount >= 10000000
and orig_amount < 25000000
then 1 else 0 end) as ">=10M<25M",
sum(case when orig_amount >= 25000000
and orig_amount < 50000000
then 1 else 0 end) as ">=25M<50M",
sum(case when orig_amount >= 50000000
and orig_amount < 75000000
then 1 else 0 end) as ">=50M<75M",
sum(case when orig_amount >= 75000000
and orig_amount < 100000000
then 1 else 0 end) as ">=75M<100M",
sum(case when orig_amount >= 100000000
then 1 else 0 end) as ">=100M"
from fiscalyearcontract
group by fy),
RT as (select CAST(fy as VARCHAR),
"<10M", ">=10M<25M",
">=25M<50M",
">=50M<75M",
">=75M<100M",
">=100M" ,
"<10M" + ">=10M<25M" + ">=25M<50M" + ">=50M<75M"
+ ">=75M<100M" + ">=100M" as "Row Total"
from T),

CT AS (select 'Column Total',
sum("<10M"),
sum(">=10M<25M"),
sum(">=25M<50M"),
sum(">=50M<75M"),
sum(">=75M<100M"),
sum(">=100M"),
sum( "Row Total")
from RT)

select * from RT
union
select * from CT
order by FY;

Breaking up the query, we see that the subquery T, creates a new set of columns along with the fiscal years (fy). We add a one to an aliased column group when the query encounters an orig_amount belonging to that column group and a zero if it does not.
Running just the T subquery will result in the table as seen below:

We can split up the construction of the row and column totals into two queries. The row total (RT) is simply the sum of the counts in each group and is selected as a new column along with the columns created in T.
The column totals are computed in a query (CT) and then UNIONed to the result of RT to append the totals as the last row in the result.

Technique in Power BI

Power BI offers grouping and binning that can be used for creating histograms and for conditional formatting as explained here : https://docs.microsoft.com/en-us/power-bi/create-reports/desktop-grouping-and-binning. The limitation we will overcome for this scenario is creating custom sized bins that are not uniform or unevenly sized with the help of DAX.

We begin by importing the excel file as a query (Fiscal Year Contract) into Power BI using Get Data-> Excel. We ensure that the datatype for Fiscal Year (fy) is Whole Number and for the contract amounts (orig_amount) is Decimal Number. Also for the field of fy we select Don’t Summarize in Column Tools.

We proceed to define and create a range table using DAX’s DATATABLE. This nifty function allows us to create a query table specifying the column names and data types and insert rows in it in one shot.

Contract Amount Groups = 
DATATABLE(
"Contract Amount Range", STRING,
"Sort Order", INTEGER,
"Start", INTEGER,
"End", INTEGER,
{
{"<10M", 1, 0, 10000000},
{">=10M<25M", 2, 10000000, 25000000},
{">=25M<50M", 3, 25000000, 50000000},
{">=50M<75M", 4, 50000000, 75000000},
{">=75M<100M", 5, 75000000, 100000000},
{">=100M", 6, 100000000, 10000000000000}
}
)

Now that we have our custom bins defined with a label and Start and End, a simple DAX can pick out the range that orig_amount in each row falls into. To store the range next to the orig_amount, create a calculated column in the Fiscal Year Contract query with the DAX shown below that works in two steps:

1. It narrows down the Contract Amount Groups query to one row based on which category the amount falls in.
2. It returns the range label using the VALUES function.
3. If you find yourself extending this pattern of custom grouping to other applications, the thing to watch out for is that the amount must fall into only one group and so the comparison operators ≥ and < are to be employed with care.

This sets up the queries we can now use to visualize the entire data by fiscal year.

In the Visualization tab, we select the matrix visualization. With this arrangement we end up with the desired result shown in the matrix below.

Technique in Pandas

Data can be imported into Pandas directly from an excel or CSV file using the read_excel and read_csv functions.
The code below shows intervals being created using IntervalIndex with the left side of the interval being closed (included value).

Pandas has a cut function that will apply these custom created bins on a column of values.

To display the bin labels along with the bins, we ‘zip’ the bins categories with the bin labels and create a dictionary that can be used to map the bin category that an amount belongs to.

Pandas’ groupby function can take a list of columns to group by. It returns a groupby object with details on each group. An aggregation function such as count when called on each group, returns the number of rows in that group for each of the columns. We thus get the number of contracts for each fiscal year and each bin category.

All that remains to be done to the above result is to reset the index so that FY and CATEGORY_LABELS are columns so that we can pivot it with index = FY, column labels = CATEGORY_LABELS and values= CONTRACT_BINS. The final output is seen below. With the specification of margins=True, row and column totals come included in the pivot table created.

Setup data

CSV files can be imported into PostgresSQL (pgAdmin) , Pandas and Power BI.

To load the data into the database from an Excel file.

1. Create a table with the column names for the data to be imported in the Postgres database.

CREATE TABLE public.fiscalyearcontract
(
contract_id character varying(12) COLLATE pg_catalog."default",
fy numeric(4,0),
orig_amount numeric(23,4)
)
TABLESPACE pg_default;

2. Open the Excel file and save it as a CSV file.
3. Clear the headings from the CSV file so that data begins from the first row.
4. Select the table created above in pgAdmin, and then under Tools, select Import/Export.
5. Toggle the switch over to Import, specify the full path to CSV file and click Ok

To capture CSV data into a Jupyter notebook:

import pandas as pd
df = pd.read_csv('ContractsFiscalYearAmountBins.csv')

To capture CSV data into a Power BI Query:
Under Get Data select Text/CSV

--

--

Simi Talkar
The Startup

Certified DS Associate (DP-100, DA-100), pursuing Masters in University Of Michigan’s Applied Data Science Program https://www.linkedin.com/in/simi-talkar/