The basic function of Python Pandas that will help you in data science
For this couple of weeks, I’ve been working on data cleaning. This is a really easy task if you only need to clean a few of the data. It becomes hideous if you do a thousand rows of data manually. That is where the great Pandas will rescue you from this trouble.
If you wonder how such a cute fluffy animal will help you with this problem. It is not a real panda, but a Python package called Python Pandas. You can read more about what is Python here. Python has a lot of packages, one of them is Python Pandas. This package focused on data analysis. It works like Excel but on Python.
Okay, let’s go to how to use Python Pandas. The most important thing you need to do is install Python on your computer. You can read how to install python here. You can check if your Python is installed correctly by typing this on your terminal:
python
Your terminal should show something like this:
The next step is to install the Pandas. You can do that by simply type on your terminal:
pip install pandas
Now you should be perfectly ready to play with Python Pandas. I will tell you how to use Python Pandas. You need to open your terminal and type:
python
Same as before, you should see something like this:
Then you need to import Pandas package by typing:
import pandas as pd
If you want to read the official documentation of Python Pandas, you can click this link. Here are some of the basic functions of it:
1. Read CSV
You can download the CSV file here. Then go to the CSV file with your terminal, and you can write this script after you run python and import pandas again.
dataframe = pd.read_csv('file_name.csv')
It means you put your CSV data to the dataframe variable. You can access your data with this command:
#if you want to show all rows
dataframe#only show top 5 rows
dataframe.head()
You can select which row you want to see by using this command:
#show only the first row
dataframe.loc[1]#show range of rows
data.loc[range(1,3)]
2. Write CSV
dataframe.to_csv('new_file_name.csv', index=None)
This will insert dataframe to new_file_name.csv.
3. Data Manipulation
This is the one that helps me do the data cleaning task. For this example, We are not using the CSV data. We are going to create the dataframe. To do that, we can type this command:
new_dataframe = pd.DataFrame({“integer_col”: [1,2,3,4,5], “string_col”: [‘hello’, ‘my’, ‘beautiful’, ‘world’, ‘!’], “float_col”: [0.1, 0.2, 3.3, 4.5, 52.2348], “boolean_col”: [True, False, True, True, False]})
After that, you should get this dataframe
If you want to get only the true boolean_col you can use this command:
new_dataframe.loc[(new_dataframe.boolean_col == True)]
And you will get this result
If you want to change the 4th-row boolean col value to True you can simply do this:
new_dataframe.loc[4, 'boolean_col'] = True
So, when you try to find the boolean_col with true value, you will get the 4th-row as well. You can change multiple rows if you want. Example:
new_dataframe.loc[(new_dataframe.boolean_col == True), 'string_col'] = "okay fine :("
By doing that your dataframe will become like this:
If you want to set a null value you need to import Numpy:
import numpy as np
After that you can set the null value by doing this:
new_dataframe.loc[4, 'boolean_col'] = np.nan
There are still many other interesting features, but it should be enough to clean a basic CSV file. Hope this tutorial will help you soon.