Steps Before Machine Learning

Looking at the EDA process for a Linear Regression Machine Learning Algorithm

Iftekher Mamun
5 min readAug 10, 2019
Source

Recently, I started a project from the Kaggle competition for predicting molecular properties. This is my first time working with data of this size by myself and I soon realized that I was over my head. The dataset itself is massive enough that I was not even able to upload them to my github through the regular command prompt. It was easier to just put them in a gitignore file and upload only my working files onto github for tracking.

Unlike my other projects, this involved merging of CSV like you would do with SQL but on Pandas (it was just easier than to build a database for querying). Then do some Extrapolatory Data Analysis, which takes some time given the size of the dataset. In today’s post, I just want to focus on the very basic EDA that I accomplished with their codes and the results.

First step is to always load up necessary libraries. For now, I will list the libraries as I use them.

from IPython.display import display
data_path = '../Molecular_Properties'
files_names = !ls $data_path/*.csv
files_names

I import the python display library and then create a variable that holds the main name of my directory. This is where all the CSVs are currently being held. Then I assigned that variable through the ls command and linking it with any files that contains .CSV on it. Basically, the ls command list out all files within the directory and the /*.csv looks for any file that ends with .csv within the data path directory. The result is this:

import pandas as pd
data_dict = {}
for name in files_names:
data_dict[name.split('/')[-1][:-4]] = pd.read_csv(name)

The code above import pandas and through the split function, creates new tables with split being under ‘name’. Following that, you have your dataframe now created. Then it is time to join them. Thankfully someone already did this, so I will just paste the code:

df_complete = data_dict['train'].copy()
df_complete = df_complete.join(data_dict['potential_energy'].set_index('molecule_name'), on='molecule_name')
df_complete = df_complete.join(data_dict['dipole_moments'].set_index('molecule_name'), on='molecule_name', lsuffix='dipole_moments_')
df_complete = df_complete.join(data_dict['magnetic_shielding_tensors'].set_index(['molecule_name', 'atom_index']), on=['molecule_name', 'atom_index_0'], lsuffix='_atom0')
df_complete = df_complete.join(data_dict['magnetic_shielding_tensors'].set_index(['molecule_name', 'atom_index']), on=['molecule_name', 'atom_index_1'], lsuffix='_atom1')
df_complete = df_complete.join(data_dict['mulliken_charges'].set_index(['molecule_name', 'atom_index']), on=['molecule_name', 'atom_index_0'], lsuffix='_atom0')
df_complete = df_complete.join(data_dict['mulliken_charges'].set_index(['molecule_name', 'atom_index']), on=['molecule_name', 'atom_index_1'], lsuffix='_atom1')
df_complete = df_complete.join(data_dict['scalar_coupling_contributions'].set_index(['molecule_name', 'atom_index_0', 'atom_index_1']), on=['molecule_name', 'atom_index_0', 'atom_index_1'], rsuffix='_scc')
df_complete = df_complete.join(data_dict['structures'].set_index(['molecule_name', 'atom_index']), on=['molecule_name', 'atom_index_0'], lsuffix='_atom0_structure')
df_complete = df_complete.join(data_dict['structures'].set_index(['molecule_name', 'atom_index']), on=['molecule_name', 'atom_index_1'], lsuffix='_atom1_structure')
df= df_complete.drop(['id'], axis=1)

I noticed that there was an additional index column at the end of it all, so I decided to drop it. Now, looking at my dataset, I realized that my data contains 4658147 rows × 42 columns of information. That is massive and when I tried to do any form of statistic or modeling it would take too long and my kernel would crash. This was frustrating so I decided to take a subsample of my data set instead.

df_subsample=  df.sample(frac=0.01, random_state=1)

Now my subsample only contains 1% of the total data. That may not seem a lot, but in reality, this subsample still contains 46581 rows × 42 columns. Thankfully, this data is much more manageable without using cloud computing. Then I wanted to do some visualization with seaborn:

import seaborn as sns
sns.pairplot(df_subsample)
This is only a small portion of the actual pairplot graph. As you can see, it is very hard to decipher what it says.

That wasn’t very helpful. Then I decided, let’s look at all possible categorical variables using this line and found 5 total categorical variables

#finding dummies in all dataset
for col in df.select_dtypes(include=[object]):
print(df[col].value_counts(dropna=False), "\n\n")
The molecule_name contained too many unique characters so I decided to drop it. And atom_atom1_structure contained only H. That was dropped as well
df_dropped_sample= df_subsample.drop(columns=['molecule_name', 'atom_atom1_structure'])

After dropping two of the five categorical variables, now I had to decide how to convert those remaining three into dummy variables. There are two main ways to go around it. One is Label Encoder and the other one is One Hot Encoding. Label Encoder would label all these unique variables as 1, 2, 3 etc. depending on their location, however the machine might refer to this as some form of order when that is not true. The secondary option, One Hot Encoder is more viable even though it will add more columns to the data sets.

df_dropped_sample['type']= pd.Categorical(df_dropped_sample['type'])
dfDummies = pd.get_dummies(df_dropped_sample['type'], prefix = 'category')
df_dropped_sample = pd.concat([df_dropped_sample, dfDummies], axis=1)df_dropped_sample['type_scc']= pd.Categorical(df_dropped_sample['type_scc'])
dfDummies = pd.get_dummies(df_dropped_sample['type_scc'], prefix = 'category')
df_dropped_sample = pd.concat([df_dropped_sample, dfDummies], axis=1)
df_dropped_sample['atom']= pd.Categorical(df_dropped_sample['atom'])
dfDummies = pd.get_dummies(df_dropped_sample['atom'], prefix = 'category')
df_dropped_sample = pd.concat([df_dropped_sample, dfDummies], axis=1)

Those lines above converted the targeted categorical variables into dummy variables. I know that I should have just created a for loop function to do it for me, but it was only 3 categories and it was quicker just copy pasting. Next, I had to drop the original 3 categorical so they do not interfere with my dataframe:

df_dropped_sample= df_dropped_sample.drop(['type', 'type_scc', 'atom'], axis=1)df_dropped_sample = df_dropped_sample[df_dropped_sample.columns].astype(float)

The second line of codes convert all my columns to float integers. You can choose to make them simple integers as well. This step is crucial in when doing visualization.

With all of that finished, I decided to use seaborn again to look at a heatmap (remember to import matplotlib inline beforehand):

heatmap of all the data

Wait, what am I even looking at? Where are the correlation? Why are all the colors barely correlated? Are there no significant relationships between each variables? This brought me more questions than answers. So I decided to focus the visual code a bit more and look at the correlational relationship instead:

df_sample_viz= df_dropped_sample.iloc[:,1:]

That showed that majority of my features have little to no correlation at all. There was only one more step I can possibly take: to look at the correlation heatmap instead:

sns.heatmap(df_sample_viz.corr(), center=0, robust=True, square=True)

And look at that. Majority of the data are indeed independent of each other. Very few features are correlated at 0.75 or above. Adding the following snippets below showed that the data frame consisted of no significant correlation:

abs(df_sample_viz.corr())> 0.70

Then what did my data tell me so far? I learned that there were categorical variables that affected my data visualization. I learned that my data set is not correlated at all, so deciding on dimensionality reduction will be tricky. I also learned that the competition held on kaggle is no joke and I need to work and think differently if I want to participate seriously in the future.

I will continue to work on this and update once I finish running the base model. Thanks for reading.

--

--