Data Manipulation

Understanding Data Manipulation with Python — Pandas Library

Rashmi Duleesha

Published in

LinkIT

5 min readJun 19, 2020

Today I will be focusing on data manipulation with Python and why it is important for data science.

What is Data Manipulation?

So the first question is what is data manipulation. The simple definition is, it’s a process of making data more organized.

Let’s focus on what is data. Data are the set of values. Considering these sets of values, you can discover more about a particular thing. In the discovery process, data manipulation plays a major role because it can get more value from the data.

Data manipulation provides more benefits when you are dealing with data sets. You can easily change datasets in the way you need. It means you can edit, delete, insert whatever you want. Furthermore, you are able to use historical data. Historical data can be involved in future predictions.

Understanding the Pandas Library

In this article, I aim to provide a clear knowledge of the data manipulation process with the Pandas library. Pandas is also an essential package in Python. It was designed for data manipulation and analysis.

Pandas is essential when you are dealing with large and complex datasets. It helps to perform matrices calculations, perform queries and aggregations, discovering incorrect data or missing values, and in data visualizing. We can analyze data with Series and DataFrame in Pandas.

Series

Series is a one-dimensional array object that can hold any data type. Series have the ability to implement values with indexes, unlike List.

Example 1:

import pandas as pd
List1=[10,20,30,40,50]
indexes=[0,1,2,3,4]
series=pd.Series(List1,indexes)
print(series)

Output:

0 10
1 20
2 30
3 40
4 50
dtype: int64

Also, Series acts as a dictionary because it can handle the indices of elements.

Example 2:

import pandas as pd
List1=[10,20,30,40,50]
indexes=[5,1,2,3,4]
series=pd.Series(List1,indexes)
print(series[5])

Output:

DataFrame

Let’s move on to the DataFrames. In contrast series, Dataframe is a two-dimensional array object. In the real world, datasets can come as a bunch of files. It is hard to analyze the data. Therefore, we need to combine these multiple files into one Dataframe to analyze data more effectively.

This why DataFrames are very important in the data manipulation process. Let’s understand the structure of the DataFrame. DataFrame looks like a table, it also has rows and columns.

Example:

import pandas as pd
Employee_table=pd.DataFrame({
    "Id": [100, 101, 102, 103, 104],
    "Name": ['Kamal', 'Sunil', 'Wimal', 'Namal', 'Ranail'],
    "age": [23, 25, 23, 40, 27],

})
print(Employee_table)

Output:

  Id  Name  age
0 100 Kamal 23
1 101 Sunil 25
2 102 Wimal 23
3 103 Namal 40
4 104 Ranail 27

Converting Dictionary into DataFrame

In small purposes, dictionaries are maybe faster. But when you need to deal with more complex datasets, DataFrames are more useful. The dictionary key is used to describe the heading of the column/column name.

Example

import pandas as pd

dict = {'Name': ['Kaml', 'Nimal'], 'ID': [100, 101]}

df = pd.DataFrame.from_dict(dict)

print(df)

Output:

  Name ID
0 Kaml 100
1 Nimal 101

Basic static operations

info()

info() function is used to understanding the summary of DataFrame.

Example:

import pandas as pd
Employee_table=pd.DataFrame({
    "Id": [100, 101, 102, 103, 104],
    "Name": ['Kamal', 'Sunil', 'Wimal', 'Namal', 'Ranail'],
    "age": [23, 25, 23, 40, 27],
    "sal": [100, 150, 100, 200, 150],

})
print(Employee_table.info())

Output:

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 # Column Non-Null Count Dtype 
 — — — — — — — — — — — — — — — 
 0 Id 5 non-null int64 
 1 Name 5 non-null object
 2 age 5 non-null int64 
 3 sal 5 non-null int64 
dtypes: int64(3), object(1)
memory usage: 204.0+ bytes
None

2. describe()

describe() is used to show basic statistical details like mean, median, max, and min.

Example:

import pandas as pd
Employee_table=pd.DataFrame({
    "Id": [100, 101, 102, 103, 104],
    "Name": ['Kamal', 'Sunil', 'Wimal', 'Namal', 'Ranail'],
    "age": [23, 25, 23, 40, 27],
    "sal": [100, 150, 100, 200, 150],

})
print(Employee_table.describe())

Output:

      Id       age       sal
count 5.000000 5.000000 5.000000
mean 102.000000 27.600000 140.000000
std 1.581139 7.127412 41.833001
min 100.000000 23.000000 100.000000
25% 101.000000 23.000000 100.000000
50% 102.000000 25.000000 150.000000
75% 103.000000 27.000000 150.000000
max 104.000000 40.000000 200.000000

3. loc() and iloc()

These methods are used to filter the data. loc() method is based on rows and columns with their labels, while iloc() is based on rows and columns with their indexes.

Example iloc()

import pandas as pd
Employee_table=pd.DataFrame({
    "Id": [100, 101, 102, 103, 104],
    "Name": ['Kamal', 'Sunil', 'Wimal', 'Namal', 'Ranail'],
    "age": [23, 25, 23, 40, 27],
    "sal": [100, 150, 100, 200, 150],

})

print(Employee_table.iloc[0])

Output:

Id  100
Name Kamal
age  23
sal  100
Name: 0, dtype: object

Example 1: loc()

import pandas as pd
Employee_table=pd.DataFrame([[2000, 3000], [6000, 5000], [2000, 3000]],
     index=['Kamal', 'Sunil', 'Nimal'],
     columns=['Gross_sal', 'net_sal'])
print(Employee_table)

output:

     Gross_sal   net_sal
Kamal    2000       3000
Sunil    6000       5000
Nimal    2000       3000

Example 2:

print(Employee_table.loc[['Kamal']])

output:

      Gross_sal    net_sal
Kamal      2000       3000

Performing aggregation functions

Example 1.

import pandas as pd
dataset1=pd.DataFrame([[100,15,4],[200,60,8],[500,10,5],[300,80,7]],
                      columns=['Set1','set2','set3'])
print(dataset1)

output:

  Set1 set2 set3
0    100 15 4
1    200 60 8
2    500 10 5
3    300 80 7

Getting sum & min using an aggregation function

avg= dataset1.agg(['sum', 'min'])
print(avg)

output:

    Set1 set2 set3
sum    1100 165 24
min     100 10 4

Pandas library is the heart of data manipulation in Python. There are more features related to Pandas. It can be used to merge and join datasets, filter data around conditions, arranging data in the ascending or descending order, reading from files with CSV and other formats, etc.

Not only that Pandas library is used in data analyzing and cleaning processes. Moreover, the Pandas library performs like a high-level building block when you are working with data sets.

I hope you learned the basics of Dataframes and its main functions. When you are dealing with real-world scenarios these small concepts are more useful in understanding complex datasets and identifying more effective ways for solving real-world problems.

Thanks for reading!

Data Manipulation

Understanding Data Manipulation with Python — Pandas Library

What is Data Manipulation?

Understanding the Pandas Library

Series

DataFrame

Converting Dictionary into DataFrame

Written by Rashmi Duleesha