Histograms from Scratch!

Shirley Liu
3 min readFeb 13, 2018

--

Hello!

Today I’m going to go over how to build a histogram from scratch in python! Historically I’ve always just used a built in program to create plots and histograms. But how would you do that on your own? And what was more interesting was building a two dimensional histogram from scratch! A great friend of mine helped me visualize this simple concept behind this and so I hope to share this with everyone today!

The Data

From my last tutorial I showed you guys a histogram of male and female height. The raw data actually looked something like this. Where each row was a person and their given height in inches. However, I wanted to bucket these. Sure you can do this pretty manually, and or use a function, but what is actually going on in the backend of program?

The Code

def histogram_classifier(X,T,B,xmin,xmax):
HF = np.zeros(B).astype('int32')
HM = np.zeros(B).astype('int32')
binindices = (np.round((B * (X - xmin) / (xmax - xmin)))).astype('int64')
for i,b in enumerate(binindices):
if T[i] == 'Female':
HF[b] += 1
else:
HM[b] += 1
return [HF, HM]
hist = pd.DataFrame(histogram_classifier(df['total_inches'], df['Gender'], 32, minimum, maximum))

Lets break this down!

X = my list of heights

T = my data table

B = number of bins I want

Xmin/Xmax = The min and max heights for my people

B * (X - xmin) / (xmax - xmin)

So the goal of bin indices is to get which “bin” number to throw my data in. If I wanted 5 bins, and my min = 50 and my max = 70, each of my bins would be in intervals of 4 right? If the first height in my data was 55, which bin would it fall into? The 0th, 1st, 2nd…8th?

Because my min is actually 50, I need to offset this by subtracted it from my value. 55–50 = 5.

Which “bin” this data falls into is 5/4 = 1 with remainder 1. It falls into the “1st” bin. Keep in mind that bin indices start with 0!

  1. “X-xmin” this gives me the offset
  2. “xmax-xmin/B” this gives me the bin sizes
  3. #1/#2 gives me the bin number when I convert this to an integer.

The code in my example evaluates males & females separately, the enumerate function returns me a tuple of the bin # as well as the nth position the value is in my array.

For example my first tuple would return (0, 1). My data point of 55 is the 0th value and 1 is the bin number to throw that value in. I want to evaluate for each row was that a male or female in the table so I can create two separate histograms.

for i,b in enumerate(binindices):
if T[i] == 'Female':
HF[b] += 1
else:
HM[b] += 1

Thats it! Let me know your thoughts and if anything doesn’t make sense!

Happy data-ing~

--

--

Shirley Liu

Data Analyst by day, artist by night. Striving towards creativity and happiness everyday.