An important disclaimer, if you are a kind of person who just reads and move forward then this is not for you. Make sure you have your favourite IDE or Jupyter Notebook or Google Colab open to code along. Believe me it’s very easy to read and when time comes to make your hand dirty, we are like lost. So, there are no other options than practicing the EDA. It will get instill on you muscle memory the more you practice.
ACCESSING THE DATA
Before we get into details of EDA, let see how to address the data. What I mean here is, think that your data is in excel sheet and you want to play with different row, column and cells in the excel.
If you master this, EDA would be an easy task.
Please note, there would be multiple ways to do same processing in python. It’s based on which library we are using. In this session I would walk you through
for indexing and slicing of data. Indexing and Slicing are basically getting the subset or entire data from your raw data.
Enough of talking, let get on to code.
• loc (DataFrame.loc[rows, columns])
.loc is primarily label based, but can also be used with a Boolean array. We will be using pandas and numpy library. We have created a random number with help of random function. So, the Data Frames (tabular data in pandas is called data frames) will be of 3 columns and 4 rows.
In below code I’ve explicitly provided index and column name to the data frames.
Please note that loc works on labels. Below is the code and it output.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 3), index =
['1','2','3','4'], columns = ['A', 'B', 'C'])
print(df)A B C
1 0.591924 -0.137679 0.519796
2 1.053280 0.614351 0.797955
3 -0.858993 1.488301 1.037337
4 1.657133 -1.171512 1.270766
Now that we have out data in tabular format. It’s time to access these data. Consider this as data in an excel sheet.
If we need to access A4, below will be the code. Here “4” is the row reference and “A” is the column reference
print(df.loc["4","A"])1.6571329061056486If we need to access multiple rows and columns A1 to B3, below will be the
code. Here “1”:”3” is the row reference and “A”: “B” is the column referenceFile "<ipython-input-11-912b9655f6f3>", line 1
If we need to access multiple rows and columns A1 to B3, below will be the
SyntaxError: invalid syntaxprint(df.loc["1":"3","A":"B"])
If we need to access entire rows/columns with one or multiple row/columns, below will be the code. Here I’m referring to entire row using “:” and with only one column “A”.
Name: A, dtype: float64
You can even give conditions to get Boolean array from the data frame. Below code will throw True if condition met in row 1
Name: 1, dtype: bool
• iloc (DataFrame.iloc[rows, columns])
iloc works on position of the data and rest remains same as we have in loc. Please note that indexing in python starts with 0. Below are the code snippets same as above with iloc. Below is the same data frame we used for loc and represented by df.
If we need to access data with column A and 4th row, below will be the code using iloc. Here, 3 in “df.iloc[3,0]” means the row index for fourth row, as I mentioned earlier, the index starts from 0. And 0 means the first column in the data frame(df)
If we need to access multiple rows and columns A1 to B3, below will be the code. Here as I said, .iloc is primarily integer position based (from 0 to length-1 of the axis). So, you can read it as df.iloc[start:stop,start:stop], and in iloc, “stop” indexing is excluded. So, in the below case, 0:3, it will throw output from row 0 to row 2 and same goes with column, it will throw output from column 0 to column 1
I hope you like it.
No matter what books or blogs or courses or videos one learns from, when it comes to implementation everything can look like “Outside the Curriculum”.
The best way to learn is by doing! The best way to learn is by teaching what you have learned!ç
Never give up!
See you on Linkedin!