SARAH ADIGWE
4 min readDec 3, 2018

BASIC EXPLORATORY DATA ANALYSIS (EDA).

Hey Guys! I had fun recently volunteering as a tutor in a three days python training for the second cohort of LAUTECH AI-Saturdays . The event was organized by the LAUTECH Data Science Community. I helped in teaching python programming to the attendees. We collected data of the participants for the three days and after everything, I decided to do some basic exploratory data analysis (EDA) on the data we had collected. I wanted to answer some basic questions like:

> Which department in school had the highest participants?
> Which set of students attended more? Were they freshers, final year students, or graduates?

And also, I wanted to have fun! Fun for me is analyzing data. So, Let’s walk through this exploration together.

The first thing I did was to get the data. Then I created a jupyter notebook for the experiment and imported the necessary python libraries I’d need for this data analysis. (Pandas, Numpy and Matplotlib)

Numpy provides me with high performance, multi-dimensional array and the important functions to manipulate these arrays. Pandas gives me access to useful functions for working with the csv dataset and finally, Matplotlib which I use for the data visualization.

Importing the required libraries.

Next, I read in the dataset using the pandas read_csv() function and then plot the first five rows of the data.

Image showing the first five rows.

We can see that the dataset contains nine columns.

Username: Which surprisingly stores email.

Fullname: Name of the participants.

Phone Number: Phone Numbers of the participants.

Department: The current department of study.

Level: The current study level.

Availability: This is column functions like an attendance marker.

Day1, Day2 and Day3: This is 1.0 if the participant is present otherwise it is 0.0.

Value Counts for each levels.

From the count result above, we can immediately see that 400L students had the highest attendeance. Interesting right?

Let’s do a bar plot to make this more visual.

Now that’s better. But I noticed something else. It seems that we have different texts for graduates. This was probably due to the fact that each individual filing the form, used a different text to identify themselves as graduates. I decided to group these category of people into just one group called graduates.

Replacing with one main syntax.

After grouping, I counted the values again, and this time got a better result.

Visual Analysis of value count of each level.

And, now we can clearly see that 400L students attended the most followed by 500L students. Let’s make a pie chart of this information too.

Pie Chart

Following the same steps above, I decided to investigate the participants by their departments.

Well, As you can see, participants filled this field with differently. Some used upper cases while some used lower cases, some used different title for the same department. I decided to categorize these department into groups by replacing similar departments with the same name.

Replacing with one main syntax.

After replacing the fields, I did a value count and got a better and cleaner result.

Value Counts for each department.

And it seems, CSE (Computer Science) has the highest attendance. I kinda expected that since its a technology related field, But I wanted to confirm my hypothesis.

Next, let’s make a bar chart to visualize this properly

And then, a pie chart.

Pie Chart

And finally, more fancy colored plot before I go.

Visual Analysis of value count of each department.

Well, I was able to answer the questions I raised before exploring this data and so decided to call it a day here.

I hope this was worth your time and you probably learnt a thing or two. I would also like to urge participants of the AI Saturdays to make good use of all that they would learn in the upcoming weeks.

Well, Bye for now.