Data Analysis With Python

Using Python for data analysis is a great way to speed up a process to answer questions in a fast-paced world.

7 min readMar 25, 2023

From businesses to healthcare, data is driving our understanding of the world and our ability to make informed decisions. Python, a powerful language for data analysis, provides a wide range of libraries and tools that enable us to extract insights from data and solve complex problems.

In this essay, we will showcase a complete data analysis project using Python. We already covered the data collection phase and proceeded to the cleaning completion in this “Data Cleaning With SQL” article.

Now we go to the exploration and analysis phases, ultimately arriving at answering the CEO’s questions. By following along with this article, you will learn how to leverage Python’s data analysis capabilities to answer real questions to real business problems. So let’s dive in!

The Files

Here are the files to access all of the Python code that we’ll discuss here.

The Project Workflow

In every project, I always start by using the same framework to develop a good outcome. It is simple but efficient.

It goes like that:

Understand the problem
Collect the data
Clean the data
Analyze the data
Present Results
Document everything

Today we’ll focus on the last 3 steps of this workflow. The other 3 steps are covered in the article “Data Cleaning With SQL”.

First, we’ll take a look at the problem again.

1. Understanding the Problem

The CEO of the group of companies where I work has requested an analysis of all employees on the construction sites since January 2022. This group of companies includes a real estate developer and construction company, as well as separate administrative offices and construction sites that are treated as individual companies. The group operates in different cities in 3 states of Brazil.

In a meeting, some questions were defined to be answered:

Has the average salary decreased or increased since January 2022?
How effective is our HR program to reduce the gender gap?
How are our salaries distributed across the states?
How standardized is our pay policy across the states?
How experienced is our engineering team?
In what function groups do we spend the most?
What construction sites spent the most in salaries for the period?

He also asked for some recommendations if something needed to be pursued by the company appears.

In this part of the analysis, I’ll perform the second data analysis process to investigate a dataset on construction site employees and explore new ways of manipulating data with NumPy and Pandas, as well as powerful visualization tools with Matplotlib and Seaborn.

We’re going to investigate the dataset result of our first project on construction site employee salaries. By analyzing this dataset, we hope to answer the CEO’s questions, gain insights that can help us make informed decisions, and improve our understanding of employee compensation in the company.

2. Analyze the Data

2.1 Technical Preparation and Exploratory Analysis

First, we need to import the required python libraries.

# I'll use this section to import every library that I'll need to the analysis.

import pandas as pd
import numpy as np
import seaborn as sns
import datetime
import os
import locale
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from matplotlib.ticker import PercentFormatter
from matplotlib import dates as mdates
import scipy.stats as stats

Then we have to load the files to get the information needed

# In Brazil common CSV files come with ";" as a delimiter, so use the sep=";" to read a CSV that has no commas. 
# This is due to decimal numbers in Brazil being separated by commas not dots. Example: USA = 1092.12 Brazil = 1092,12.

df_emp = pd.read_csv('df_employee.csv', sep=";")
df_emp.head()

Now we have to do some data exploration to see if any transformation is needed for the upcoming analysis.

Here we confirm the size of the data and then we see the data types. This step is needed to go further with calculations.

Here we can see that some actions are needed to proceed.

Now we can move on after transforming the salary into float(decimals) type and dates to DateTime type.

2.2 Exploratory Data Analysis (EDA) and Visualizations

After the technical preparation, we proceed with some simple exploratory analysis to get things going.

Let’s explore the data to understand it even better than before.

First, we start with simple statistical inferences about the dataset.

Then we create some more columns needed

This is going to be a very visualization-based analysis because it was simple to understand the dataset. For this purpose, we’re going to create a lot of variables to use further in the analysis.

Averages
Timely dataframes
filtered dataframes
grouped dataframes

# This code block is created to define some dataframes to use in the visualizations

# Create Timely dataframes

df_emp_2022 = df_emp[df_emp['pay_month'].dt.year == 2022]
df_emp_2023 = df_emp[df_emp['pay_month'].dt.year == 2023]

# Current month
latest_month = df_emp["month_year"].max()
df_emp_today = df_emp[df_emp["month_year"] == latest_month]

# Averages
avg_salary = df_emp["salary"].mean()
avg_salary_2022 = df_emp_2022["salary"].mean()
avg_salary_2023 = df_emp_2023["salary"].mean()
last_month_avg_salary = df_emp_today['salary'].mean()

# Average salary by function group
average_salary_by_function_group = round(df_emp.groupby('function_group')['salary'].mean(),2)

# Average salary by function group in 2022
average_salary_by_function_group_2022 = round(df_emp_2022.groupby('function_group')['salary'].mean(),2)

# Average salary by function group in 2023
average_salary_by_function_group_2023 = round(df_emp_2023.groupby('function_group')['salary'].mean(),2)

Now we can plot some charts to understand the data.

I start with histograms, distribution plots, and bar charts.

This code starts creating a dataframe needed to plot the chart. Then the chart is plotted giving the number of employees by category. In this example, you still can see the average line to understand where any of the categories stand in the data.

Another example below showcases one of the questions asked by the CEO:

Has the average salary decreased or increased since January 2022?

Turns out that the average decreased and the first question was answered.

Plotting some other charts and every possible combination, we can see that the information given in the dataset is also valuable to see things that no one is looking.

The chart below gives us some information that wasn’t asked but needs to be addressed.

Looking at the chart we see that:

Tocantins has a higher-paid administration.
The salaries of production supervisors need further attention. It needs to be standardized.
Distrito Federal has the highest-paid engineering team.

That’s one of the recommendations that the CEO was looking for us data nerds to give.

2.3 Results

The examples above are only the tip of the iceberg. This article will not show all of the calculations, you can see the full Python notebook here.

The analysis was presented to the CEO and stakeholders in a presentation that you can see in this pdf file.

All of the questions were answered as the outcome of this analysis.

Showcased in a visual form or as a summary at the end of the file.