Analyse the distribution of ages — Python Data Analysis series part 1

José Fernando Costa
Nerd For Tech
Published in
6 min readJan 25, 2021

--

Cover image
Cover image (source)

After posting a handful of separate articles on data analysis with Python, I’ve decided to share some of the work I did on previous personal projects in the form of a proper series.

This “Python Data Analysis” series will consist of five articles tackling different data problems using the 2020 Stack Overflow Developer survey results dataset. I will show you how to use pandas to overcome issues with numeric and categorical data to create nice visualizations with Plotly (Express) at the end.

Although I only show a Python script here, each article has its own Jupyter notebook with the same code and explanations. In part 5 we’ll take these notebooks and go through the process of uploading them to the cloud in an Azure DataBricks workspace.

By the end of this series you will have grasped new concepts of data transformations and visualizations, as well as getting started with a powerful tool for data science in the cloud.

If you want to run the code yourself, there are some prerequisites you need:

  • Install Python libraries: pandas, Plotly and Jupyter (the last one is for running the notebooks)
  • Download the 2020 Stack Overflow Developer Survey results dataset

For the code itself, you can either use the script provided in this article, or download the Jupyter notebooks from my GitHub repository. These notebooks are, essentially, a replacement for these articles as they contain the same code and explanations. The main difference is that the articles have links to each part of the series.

For these articles, I will start by explaining the objective, i.e., the end-goal and the data transformations needed to arrive at that solution. Afterwards, I will show the complete Python script. At that point, you can either read through it alone, or take a look at the script and continue reading the article for the explanations.

Also, please bear in mind this series is not necessarily aimed at complete beginners with pandas and data analysis with Python. Sure the code is simpler when accompanied with the explanations, but I think you will get more out of these series if you already have some degree of familiarity with pandas and a visualization library such as Matplotlib or Plotly.

Finally, here are some handy links to navigate the series’ contents:

Without further ado, let’s get started with today’s demo.

Analyse the age of respondents

Preview of the data
Preview of the data

For this first article, the objective is to plot a bar chart that showcases the frequency of the respondents’ ages. For that, we need to filter the data for outliers/bad data points, as well as remove floating-point ages (yes, there are ages such as 15.5 years). There is also a line of code for removing blanks which, even though was not needed after applying the filter, I included it to show you some more options for future work you might come across. The last thing needed before the plot is to get the frequency of each age.

Age of Respondents

The first 8 lines of code are exactly the same for each of these demos. We are always going to import the same libraries, and the same dataset from the same location (of course change the CSV location according to where you have it on your machine).

Since we are working with a single column of data completely devoid of context of the other questions and answers in the survey, we can remove all other columns and keep only the one for the ages (line 10).

Lines 12 and 13 are used for some exploratory data analysis. The objective is to find the limits and outliers of the data, i.e., what responses to filter out.

Exploratory Data Analysis of the ages
Exploratory Data Analysis of the ages

The scatter plot shows that respondents close to 100 years or older may not exactly represent accurate data points. Furthermore, there is a threshold at which children will be too young to be replying to the survey.

As such, on line 15 we keep only respondents with ages between 10 and 75, inclusive.

data = data.query("(Age >= 10) and (Age <= 75)")

With a second plot (line 18), we can see the distribution of ages is now more uniform.

Exploratory Data Analysis of the ages after filter
Exploratory Data Analysis of the ages after filter

The next step is to remove floating-point ages (lines 21 and 22). Generally, ages are recorded as integer values, so we’ll simply remove those floating-point values. Keep in mind removing data should always be done with careful consideration. In these articles we are taking the easy way out for the sake of simplicity, but on a real project there are other options such as rounding up/down, replacing by the average value, etc.

is_integer = lambda row: int(row["Age"]) == row["Age"]
data = data[data.apply(is_integer, axis="columns")]

The values in the column are represented as float, so if we go row by row, we can compare the integer representation with the current numerical representation. If they are the same, then the number is an integer (e.g. 23.0 or 45.0); otherwise the age was a floating-point number and we discard it (e.g. 15.5 is different from 15).

Line 25 uses the dropna function to remove blank values even though there were no blanks in this column (they were removed when we filtered for the 10–75 range). My purpose in using the function is to show its potential. With this call

data = data.dropna(axis="rows", how="any", subset=["Age"])

You learn that you can remove blanks from both axes (rows or columns), you can drop based only on a subset of columns, and you can choose how to delete: if the axis has any blank value or if it needs all data points to be blank.

To obtain the age frequencies, that is achieved on line 28 with value_counts.

age_counts = data["Age"].value_counts()

The method returns a pandas Series where the indices are the “Age” values, and the data points are the respective frequencies.

We can pass this Series directly to Plotly and it is able to assign the ages and the frequencies to the correct dimension of the plot.

fig = px.bar(age_counts, title="Age of respondents")
fig.update_layout(
xaxis_title = "Age",
yaxis_title = "Frequency",
title_x = 0.5,
showlegend = False
)
fig.show()
Result
Result

Conclusion

We’ve reached the end of the first part. “Ages” was a rather simple example, but it was fitting for the first part. We went through understanding the limits of the “real” data, cleaning it and finally obtaining the visual we wanted.

In part 2 we will use another column of numerical data, but for continuous data. In other words, we’ll look at binning the data to create an histogram.

Before I leave you, here are some handy links for this series:

--

--