This Pandas skill will get you a job !!!

Published in

Lintang Gilang

9 min readNov 17, 2022

A data practitioner really needs to master various skills to complete the data analysis process effectively and efficiently. For this, various tools are needed with their respective functions. One of the tools most frequently used by data practitioners is Pandas, which is used in the Python programming language.

1. What is Pandas ?

what is pandas? Pandas is an open source and licensed Python library which is often used by data analysts and data scientists when exploring data. Pandas is used to create tables, modify dimension data, inspect data, and so on. The basic data structure in Pandas is named DataFrame, which makes it easier for us to read files with many types of formats such as .txt, .csv, and .tsv files. This feature will create tables and can also process data using operations such as join, different, group by, aggregation, and other techniques found in SQL.

A data practitioner really needs to master various skills to complete the data analysis process effectively and efficiently. For this, various tools are needed with their respective functions. One of the tools most frequently used by data practitioners is Pandas, which is used in the Python programming language. Pandas is based on another package called Numpy, which supports multidimensional arrays.

The Pandas library has two types of data structures for the latest version :
1. Series
2. Data Frame

What is Series? Series is like a one-dimensional array as well as a numpy array. Pandas series have indexes and we can control the index of each of those elements.

What is a Data frame? A data frame is a two-dimensional array with rows and columns. This data structure is the most standard way to store data in the form of tables/tabular data.

2. Import Pandas

If you’re going to use pandas, then you need to make sure it is included in your python environment. The way you do think is by importing pandas.

Importing pandas means bringing all of the pandas functionality to your finger tips in your python script or jupyter notebook. For best practice using pandas, we can also import numpy.

import pandas as pd
import numpy as np

Let’s break down what this statement means.

What does import pandas as pd mean?

Import = Bring this functionality or library to my python script
Pandas = The library you want to import, in this case, it’s pandas
As = The python nomenclature for creating as alias. This is a fancy way of taking a long word and referencing it as a short word
pd = The standard short name for referencing pandas

3. Try Data Frames

A DataFrame in Pandas is a 2-dimensional, labeled data structure which is similar to a SQL Table or a spreadsheet with columns and rows. Each column of a DataFrame can contain different data types.

To better understand what’s in the Pandas library, we start by creating a dictionary in python.

cars = {
  'Brand': ['Brio', 'Jazz', 'HRV', 'CRV', 'Civic', 'Mobilio', 'BRV', 'City'],
  'Price': [148, 265, 364, 515, 586, 217, 281, 333],
  'Status' : ['sold','sold','unsold','sold','unsold','unsold','sold','unsold'],
  'ProdYear' : [2020, 2019, 2021, 2020, 2022, 2021, 2023, 2022]
}

This dummy data — used car sales. Let’s convert the dictionary above into a pandas dataframe.

The dataframe in pandas is usually referred to as df

df = pd.DataFrame(cars)

4. Head and Tail

df.head()

used to display the top n data

df.head(n)

used to display the top n data

Tail

df.tail()

used to display the last 5 data

df.tail(n)

used to display the top last n data

5. Slicing

The indexing and attribute selection operators are great because they work just like they do in the rest of the Python ecosystem. If you are a beginner, this makes it easy to pick up and use. However, pandas has its own access operators, loc and iloc.

df.iloc[ row , column] → Index-based selection

This method is used when the index label of a data frame is something other than numeric series of 0, 1, 2, 3….n or in case the user doesn’t know the index label.

df.iloc[:,:]

df.iloc[0:7,:]

df.iloc[0:7,1:2]

df.loc[ row , column] → Label-based selection

Attribute access a group of rows and columns by label(s) or a boolean array in the given DataFrame.

df.loc[:,:]

df.loc[0:7,:]

df.loc[0:7,'Price':'Price']

6. Conditional selection

Another way of slicing is doing conditional filtering. For example, filter data with prices below 400 or sold status.

In SQL it’s called “where clause” and in excel it’s called “filtering table”.

df[ conditional_1 , (and/or), conditional_2, ... ]

df[df.Price <= 400]

df[df.Status == 'sold']

df[(df.Price >= 400) & (df.Status == 'unsold')]

df[(df.Price >= 400) | (df.Status == 'unsold')]

7. Descriptive Statistics

Descriptive statistics are a way of describing the features of a data set by producing summaries about a sample of the data. This is often described as a displayed summary of the data that describes the contents of the data. For example in the price column.

describe() Function gives the mean, std and IQR values. It excludes character column and calculate summary statistics only for numeric columns.

You can do a summary for all data or only part of it.

df.describe()

You can do a summary for all the data or just some of it. If only part of it then call the column name like this.

df.Price.count()
df.Price.mean()

This is a list of explanations for each output of the description function

DataFrame.count → Count number of non-NA/null observations.
DataFrame.max → Maximum of the values in the object.
DataFrame.min → Minimum of the values in the object.
DataFrame.mean → Mean of the values.
DataFrame.std → Standard deviation of the observations.
DataFrame.select_dtypes → Columns based on their dtype.

8. Case Expression

If you are familiar with case expressions in SQL or if-else in excel then pandas can also make the same thing by combining it with a lambda.

What is the case expression? CASE expressions are conditionals that pass conditions and return the value when the first condition is met (like an if-then-else statement). So once a condition is true, it will stop reading and return the result.

If none of the conditions are true, it returns the value in the ELSE clause.

If there is no ELSE section and neither condition is true, it returns NULL.

def fun(x):
    if x < 200:
        return "Low"
  
    elif x >= 200 and x<400:
        return "Normal"
  
    else:
        return "High"
    
df['type_price'] = df['Price'].apply(lambda x : fun(x))

df.head()

9. Grouping and Sorting

The GROUP BY statement groups rows that have the same values into summary rows, like "find the number of type_price in each production year".

The GROUP BY statement is often used with aggregate functions (COUNT(), MAX(), MIN(), SUM(), AVG()) to group the result-set by one or more columns.

In SQL it’s called “group by” and in excel it’s called “pivot table”.

# Make a summary of car sales data sold

df.groupby(['Status']).agg({'Status':'count'})

# Add previous data with average sales

df.groupby(['Status']).agg({'Status': 'count', 
                             'Price': 'mean'})

# Add previous data with max and min sales

df.groupby(['Status']).agg({'Status': 'count',
                             'Price': ['mean','max','min']})

# Reset indexes in pandas to have only one index and sort values by Prod Year

df.groupby(['Status','ProdYear']).agg({'Status': 'count',
   'Price': ['mean','max','min']}).reset_index().sort_values(by='ProdYear')

10. Combining Data

Lets create new data

id_cars = {'idcars': ['1','2','3','4','5','6','7','8'],
        'Brand': ['Brio','Jazz','HRV','CRV','Civic','Mobilio','BRV','City']
        }

df_id = pd.DataFrame(id_cars)
df_id['table'] = 'id_cars'

cars1 = {'idcars': ['1','3','5','7','2','4','6','8'],
        'Price': [148,252,300,467,536,205,251,355],
         'Year' : [2022,2021,2020,2022,2021,2020,2019,2020]
        }

df1 = pd.DataFrame(cars1)
df1['table'] = 'cars1'

cars2 = {'idcars': ['1','2','3','4','5','6','7','8'],
        'Price': [130,240,290,450,520,190,240,340],
         'Year' : [2021,2020,2019,2021,2020,2019,2018,2019]
        }

df2 = pd.DataFrame(cars2)
df2['table'] = 'cars2'

Pandas has three core methods for doing this. In order of increasing complexity, these are concat(), join(), and merge().

We will omit it and focus on the first two functions here concat() and merge(). Most of what merge() can do can also be done more simply with join().

concat() The simplest concatenation method is concat(). Given a list of elements, this function will concatenate those elements along an axis.

df = pd.concat([df1, df2])
df

merge() A Merge / join clause is used to combine rows from two or more tables, based on a related column between them. The joining requirement is the existence of a primary key and a foreign key so that the two tables are joined

The different types of Joints are as follows:

Inner join
Left join
Right join
Full join

df_new = pd.merge(df, df_id, on = 'idcars' , how = 'left' )
df_new.head()

if we look at the results of the join above, there are columns “table_x” and “table_y” because both tables have these columns. The way to handle it is in the second table (df_id) , delete the same column by changing the way it is called when doing a join.

df_new = pd.merge(df, df_id[['idcars','Brand']], on ='idcars', how ='left')
df_new.head()

11. Pandas visualization

The pandas library in python is primarily used for data analysis. It’s not a data visualization library but, we can create basic plots using Pandas. Pandas is very useful and practical if we want to make exploratory data analysis plots. We need to import a data visualization library other than Panda for tasks like matplotlib.

Chart visualization - pandas 1.5.1 documentation

We provide the basics in pandas to easily create decent looking plots. See the section for visualization libraries that…

pandas.pydata.org

import matplotlib.pyplot as plt

ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
ts = ts.cumsum()

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list("ABCD"))
df = df.cumsum()

plt.figure();

df.plot();

plt.figure();

df.iloc[5].plot(kind="bar");

df2 = pd.DataFrame(np.random.rand(10, 4), columns=["a", "b", "c", "d"])

df2.plot.bar();

df4 = pd.DataFrame(
    {
        "a": np.random.randn(1000) + 1,
        "b": np.random.randn(1000),
        "c": np.random.randn(1000) - 1,
    },
    columns=["a", "b", "c"],
)


plt.figure();

df4.plot.hist(alpha=0.5);

Summary

To summarize, in this post we cover how to use the Pandas library.

What is Pandas ?
Import Pandas
Try Data Frames
Head and Tail
Slicing
Conditional selection
Descriptive Statistics
Case Expression
Grouping and Sorting
Combining Data
Pandas visualization

To understand pandas more you can visit pandas documentation

pandas documentation - pandas 1.5.1 documentation

The reference guide contains a detailed description of the pandas API. The reference describes how the methods work and…