Sorting Cohort applicants into new Cohorts using Machine Learning

Published in

Chingu

7 min readFeb 25, 2017

Chingu cohorts have been a major success but with more power comes greater responsibility

One day I started discussing with Chance Taken about fading of cohort conversations as people get familiar to each other. Idea of refreshing and automation came up. New members were increasing with every survey and Chance was obviously going to have trouble afterwards sorting them and placing them in cohorts that match their skill.

I had recently learned Machine Learning so it was time to finally apply it!

Most of time a Machine Learning problem is solved in following steps:

Formatting Data
Filtering / Cleaning Data
Examining the Final Data
Selecting a Machine Learning Approach
Applying Machine Learning and chilling as it does the hard work for you!

Formatting Data:

This was easy step because most of the survey questions were compulsory and therefore there were no empty cases.
Also as the data was filled in Google Sheets, it was in a clean tabular form.

For manipulating the data I was going to use Pandas library and it has an easy to use function pd.read_csv('filename') for reading CSV (Comma separated values) files, I downloaded the survey as a CSV file. I soon found out that it could have issues as many answers had , in the submissions. So I re-downloaded data as TSV (Tab Separated Values) format and told the function that delimiter of data is \t.

import pandas as pd
import numpy as np
import pickle
import refilename = 'memesForLife.tsv'
dataframe = pd.read_csv(open(filename, "r"), delimiter='\t')

Pandas loads the file in memory as a Dataframe Object that is tabular data structure like Google Sheets and now I can use Pandas for manipulating the data in next stages.

Filtering / Cleaning Data:

A lot of features in data are insignificant in clustering. These features need to be removed from the data before moving forward, so we have only that data that is useful for the predictions afterwards.

Pandas makes selecting columns easy using .loc[rows,columns]property.

featuresToUse = [
'Where are you on the FCC map?',
'What is the UTC timezone for where you will be coding from?,
"Please check the features you're MOST excited about"
]# Keep only given features in dataframe
dataframe = dataframe.loc[:, featuresToUse]

Whew, now I had to do another cleaning, this time I had to categorise the data properly. E.g. 2-Basic Algorithms becomes 2 in the final dataframe. This was easy as data was consistent and I could just pick the first letter of string and convert to int.

#Loop over all data to clean it
for i in dataframe.index:
   # Progress in FCC
   dataframe.ix[i, featuresToUse[0]] =
               int(dataframe.ix[i,featuresToUse[0]][0])

Coding hours were in a different format, B — 10-19 becomes 1, for this I used a dictionary to map the first letter as a corresponding value in integer.

categoryMapper = {
    'A': 0,
    'B': 1,
    'C': 2,
    'D': 3,
    'E': 4
}for i in dataframe.index:
    dataframe.ix[i, featuresToUse[1]] = 
           categoryMapper[dataframe.ix[i, featuresToUse[1]][0]]

Now for the “Reason to join” feature, there was going to be a more complex handling needed. The reason was that it was multiple choice question and all the answers were separated by commas.

In these cases it is best to use these features individually in data and mark them as 1 or 0 i.e. present or not-present instead of forming a linear representation by calling 1st option as 1 and 2nd as 2. This can cause a false sense that 1 is slightly similar to 2 which is not true. All these options are completely independent of each other. So I was going to place them as column of their own name in the dataframe and mark it 0 or 1.

This problem had regex written all over it! So I imported re library that is the mainstream library to do regex in python. As the data could be anywhere in the string instead of re.match function that only finds from beginning of string, I went for re.search function that finds string at any place.

re.search returns None if string is not present and a group object otherwise. So this one liner could check it all out.

int(1 if re.search(option, dataString) != None else 0)

Now I needed the possible options in the string and stored them in a list

options = [
'1 - Being in a group of friendly coders who share my goals',
'2 - Having access to team project experiences',
'3 - Help when I get stuck on a coding problem',
'4 - Having an "Accountability Buddy" to help me stay motivated',
'5 - Getting out of my comfort zone',
'6 - Video discussions topics and help sessions',
'8 - Pair-Programming opportunities'
]

Now during the loop, I can take the answer string, and add each option as a new column of its own in the dataframe.

for i in dataframe.index:
  dataString = dataframe.ix[i, featuresToUse[2]]  for option in options:
    dataframe.ix[i, option] = 
          int(1 if re.search(option, dataString) != None else 0)

After all this is done, and all the data is handled, we can drop the unnecessary column of “reason of joining cohorts”. Also we don’t need the data in Dataframe format anymore so I convert it into a numpy array using Dataframe.values property.

dataframe = dataframe.drop([featuresToUse[2]], axis=1)# Yes I am a terrible person putting different datatypes
# in same variable
dataframe = dataframe.values

axis = 1 means that we are dropping columns and not rows

Now our data is clean and much more nice looking than we started with!
In following image, each row contains the progress in FCC, amount of coding per week and a Boolean expression for each option for the reason of interest in joining Chingu cohorts respectively.

For making my code modular and sub-divided into several files I like to use a library pickle. What this library does is that it saves the complete variable as a file which can then be loaded in another file.

# Put the data in a data.pkl file, the writing mode is binary
# because that is how pickle works :P
pickle.dump(data, open('data.pkl', 'wb'), pickle.HIGHEST_PROTOCOL)

Ok, we got our data in proper format for feeding to a clustering algorithm. Now comes the Machine Learning part.

Examining the Final Data:

This is also a short step in this case, we don’t have pre-labelled people neither do we have data to train with. So a better approach would be an Unsupervised approach.

Selecting a Machine Learning Approach:

Here we have to just find nodes that are close to each other in every feature aka dimension. This can be done using KMeans Clustering. So let’s import some raw material now.

# Cause why not?
import numpy as np# Importing as algorithm so I can switch between algorithms easily
from sklearn.cluster import KMeans as algorithm# Loading the pkl file in memory 
import pickle# For plotting data
import matplotlib.pyplot as plt# Counting members in each cluster
from collections import Counter

Applying Machine Learning:

First we will load the data into a variable using pickle.load

# Pass the Binary file to pickle and we get numpy array back
data = pickle.load(open(‘data.pkl’, ‘rb’))

Now we initialize the KMeans Clustering object which we aliased as algorithm. We will pass 5 as number of clusters as we are making 5 cohorts out of the data.

cluster = algorithm(n_clusters = 5)

Now pass the data to the object and let it fit it and return back the prediction i.e. the cluster number for each person we passed

pred = cluster.fit_predict(data)

But this doesn’t look very good, right? Chance will die reading numbers like this. So, I made another mapping object that converts these to Animal names. I use an anonymous function / lambda to find the mapping for the value in prediction, and I store the list of those mappings in temp.

cohortMapper = {
  1: 'Robin',
  0: 'Snake',  # Muhahaha OCD!!!
  2: 'Rhino',
  3: 'Panda',
  4: 'Flamingo',
  5: 'Racoon'
}temp = list(map(lambda x: cohortMapper[x], pred))

Still looks ugly :P …..
What if Chance missed one of these people? He will be sending a hell lot of wrong emails. I asked Chance to add email addresses in data, I ran all the code on that data again. Added the emails to data using zip.

result = list(zip(temp, emails))

That’s all folks!
Now let’s visualise the data a little bit more, here is a 2D image with dimensions X-axis representing progress in FCC and Y-axis representing coding hours in a week. For the color I pass a colorMapper object

print(Counter(pred))colorMapper = {
  1: 'red',
  0: 'blue',  # Muhahaha OCD Triggered!
  2: 'green',
  3: 'cyan',
  4: 'yellow',
  5: 'black'
}plt.scatter(data[:, 0], data[:, 1], 
            c=list(map(lambda x: colorMapper[x], pred)))
plt.show()

Output:

Counter({2: 58, 4: 43, 1: 39, 0: 32, 3: 24})

The clusters are looking amazing and counter is also showing a quite nice distribution in categories. Rest of job was done by Chance and now we can make new cohorts in seconds from data with Machine Learning!

Thanks for following through this article, I hope it was useful :-)
Happy Coding!