Pandas: First Step Towards Data Science
I started learning Data Science like everyone else by creating my first model using some machine learning technique. My first line of code was :
import pandas as pd
Apart from noticing a cuddly bear name, I didn’t pay much attention to this library but used it a lot while creating models. Soon I realized that I was underestimating power of Pandas, it can do more than Kung-fu and that is what we are going to learn through the series of articles where I am going to explore Pandas library to gain skills which can help us analyze data in depth.
In this article, we will understand
- How to read data using Pandas?
- How data is stored ?
- How can we access data ?
What is Pandas ?
Pandas is a python library for data analysis and manipulation. That said, pandas revolve all around data. Data that we read through pandas is most commonly in Comma Seperated Values or csv format.
How to read data ?
We use read_csv() method to read csv file which is first line of code that we all come across when we start using Pandas library. Remember to import pandas before you start coding.
import pandas as pdtitanic_data = pd.read_csv("../Dataset/titanic.csv")
In this article we are going to use Titanic database, which you can access from here. After reading data using pd.read_csv(), we store it in a variable titanic_data which is of type Dataframe.
What is a Dataframe ?
Dataframe is collection of data in rows and columns.Technically, dataframes are made up of individual Series. Series is simply a list of data. Lets understand with some example code
#We use pd.Series() to create a series in Pandas>> colors = pd.Series(['Blue','Green'])
>> print(colors)output:
0 Blue
1 Green
dtype: object>> names_list = ['Ram','Shyam']
>> names = pd.Series(names_list)output:
0 Ram
1 Shyam
dtype: object
We provide a list as parameter to pd.Series() method which create a series with index. As default, index starts with 0. However, we can even change index since index is also a series.
>> index = pd.Series(["One","Two"])
>> colors = pd.Series(['Blue','Green'],index = index)
>> print(colors)output:
One Blue
Two Green
dtype: object
Now coming back to our definition, Dataframe is collection of individual Series. Let us use colors and names series that we initialized above to create a dataframe.
>> df = pd.DataFrame({"Colors":colors,"Names":names})
>> print(df)output:
Colors Names
0 Blue Ram
1 Green Shyam
We used pd.DataFrame() to create a dataframe and passed a dictionary to it. Keys of this dictionary represents the column name and values represents corresponding data to that column which is a series. So from above example you can understand that Dataframe is nothing but collection of series. We can also change index of the Dataframe in same manner as we did with series.
>> index = pd.Series(["One","Two"])
>> colors = pd.Series(['Blue','Green'],index = index)
>> names = pd.Series(['Ram','Shyam'],index = index)# Creating a Dataframe
>> data = pd.DataFrame({"Colors":colors,"Names":names},index=index)
>> print(data)output:
Colors Names
One Blue Ram
Two Green Shyam
So far we have understood how we read csv data and how this data is represented. Lets move on to understand how can we access this data.
How to access data from Dataframes ?
There are two ways to access data from Dataframes :
- Column-wise
- Row-wise
Column-wise
First of all let us check columns in our Titanic data
>> print(titanic_data.columns)output:
Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket','Fare', 'Cabin', 'Embarked'],
dtype='object')
We can now access data using column name in two ways either by using column name as property of our dataset object or by using column name as index of our dataset object. Advantage of using column name as index is that we can use columns with names such as “First Name”,”Last Name” which is not possible to use as property.
# Using column name as property
>> print(titanic_data.Name)output:
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
....
Name: Name, Length: 891, dtype: object# Using column name as index
>> print(titanic_data['Name'])output:
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
....
Name: Name, Length: 891, dtype: object>> print(titanic_data['Name'][0])output:
Braund, Mr. Owen Harris
Row-wise
In order to access data row-wise we use methods like loc() and iloc(). Lets take a look at some example to understand these methods.
# Using loc() to display a row
>> print(titanic_data.loc[0])output:
PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object# Using iloc() to display a row
>> print(titanic_data.iloc[0])
output: same as above>> print(titanic_data.loc[0,'Name'])
output:
Braund, Mr. Owen Harris>> print(titanic_data.iloc[0,3])
output: same as above
As we saw in code above, we access rows using their index values and to further grill down to a specific value in a row we use either column name or column index. Remember as we saw earlier that columns are also stored as list whose index start from 0. So first column “PassengerId” is present at index 0. Apart from this we saw a difference between loc() and iloc() methods. Both perform same task but in a different way.
We can also access more than one row at a time with all or some columns. Lets understand how
# To display whole dataset
>> print(titanic_data.loc[:]) # or titanic_data.iloc[:]
output:
PassengerId Survived Pclass .....
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
...
[891 rows x 12 columns]# To display first four rows with Name and Ticket
>> print(titanic_data.loc[:3,["Name","Ticket"]]) # or titanic_data.iloc[:3,[3,8]]output:
Name Ticket
0 Braund, Mr. Owen Harris A/5 21171
1 Cumings, Mrs. John Bradley (Flor... PC 17599
2 Heikkinen, Miss. Laina STON/O2. 3101282
3 Futrelle, Mrs. Jacques Heath.... 113803
I hope you got an idea to use loc() and iloc() methods, also understood the difference between two methods. With this we come to end of this article. We will continue exploring Pandas library in second part but till then keep practicing. Happy Coding !