Students Performance in Exams — Data Analysis

Srajan Gupta
4 min readJan 20, 2019

--

Hi. So, this post is about Data Analysis. We will try to get some knowledge about students performance from raw data.

>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns

Importing all the required python libraries.

>>> data = pd.read_csv('StudentsPerformance.csv')
>>> data.head()
>>> data.shape
(1000, 8)

Let us check if the data has some missing values.

>>> data.isnull().sum()
gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

So there are no missing values in the dataset.

Let us check the datatype of all the column values.

>>> data.dtypes
gender object
race/ethnicity object
parental level of education object
lunch object
test preparation course object
math score int64
reading score int64
writing score int64
dtype: object

Let us analyze the values of the columns and check whether they are numerical or categorical.

>>> data['gender'].value_counts()
female 518
male 482
Name: gender, dtype: int64
>>> data['parental level of education'].value_counts()
some college 226
associate's degree 222
high school 196
some high school 179
bachelor's degree 118
master's degree 59
Name: parental level of education, dtype: int64
>>> data['race/ethnicity'].value_counts()
group C 319
group D 262
group B 190
group E 140
group A 89
Name: race/ethnicity, dtype: int64
>>> data['lunch'].value_counts()
standard 645
free/reduced 355
Name: lunch, dtype: int64
>>> data['test preparation course'].value_counts()
none 642
completed 358
Name: test preparation course, dtype: int64

We can see from the above that the columns, “gender”, “parental level of education”, “race/ethnicity”, “lunch”, “test preparation course” are categorical variables.

Since marks are always numerical, therefore the columns, “math score”, “read score”, “write score” contain numerical values and hence are numerical variables.

Adding columns “total” and “average” to the dataset.

>>> data['total'] = data['math score'] + data['reading score'] + data['writing score']
>>> data['average'] = data['total'] / 3

Let us see the distribution of the scores.

>>> sns.distplot(data['math score'])
>>> sns.distplot(data['reading score'])
>>> sns.distplot(data['writing score'])
>>> sns.distplot(data['average'])
>>> sns.pairplot(data)

Analyzing the average score of all the students on the basis of “race/ethnicity”, “parental level of education”, “test preparation course”.

>>> sns.barplot(data['race/ethnicity'], data['average'])
>>> sns.barplot(data['parental level of education'], data['average'])
>>> sns.barplot(data['test preparation course'], data['average'])

Let us the analyze the data on the basis of the no. of students who failed or passed the exam.

>>> data['math_PassStatus'] = np.where(data['math score']<40, 'F', 'P')
>>> data['read_PassStatus'] = np.where(data['reading score']<40, 'F', 'P')
>>> data['write_PassStatus'] = np.where(data['writing score']<40, 'F', 'P')

We can see the no. of students who passed or failed from the below code.

>>> data['math_PassStatus'].value_counts()
P 960
F 40
Name: math_PassStatus, dtype: int64
>>> data['read_PassStatus'].value_counts()
P 974
F 26
Name: read_PassStatus, dtype: int64
>>> data['write_PassStatus'].value_counts()
P 968
F 32
Name: write_PassStatus, dtype: int64
>>> p = sns.countplot(x='parental level of education', data = data, hue='math_PassStatus', palette='bright')
>>> _ = plt.setp(p.get_xticklabels(), rotation=90)
>>> p = sns.countplot(x='test preparation course', data = data, hue='math_PassStatus', palette='bright')
>>> _ = plt.setp(p.get_xticklabels(), rotation=90)
>>> p = sns.countplot(x='parental level of education', data = data, hue='read_PassStatus', palette='bright')
>>> _ = plt.setp(p.get_xticklabels(), rotation=90)
>>> p = sns.countplot(x='test preparation course', data = data, hue='read_PassStatus', palette='bright')
>>> _ = plt.setp(p.get_xticklabels(), rotation=90)
>>> p = sns.countplot(x='parental level of education', data = data, hue='write_PassStatus', palette='bright')
>>> _ = plt.setp(p.get_xticklabels(), rotation=90)
>>> p = sns.countplot(x='test preparation course', data = data, hue='write_PassStatus', palette='bright')
>>> _ = plt.setp(p.get_xticklabels(), rotation=90)

Follow iSrajan on Facebook, Twitter, LinkedIn & Instagram.

--

--