Students Performance in Exams — Data Analysis

4 min readJan 20, 2019

Hi. So, this post is about Data Analysis. We will try to get some knowledge about students performance from raw data.

>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns

Importing all the required python libraries.

>>> data = pd.read_csv('StudentsPerformance.csv')
>>> data.head()

>>> data.shape
(1000, 8)

Let us check if the data has some missing values.

>>> data.isnull().sum()
gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

So there are no missing values in the dataset.

Let us check the datatype of all the column values.

>>> data.dtypes
gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

Let us analyze the values of the columns and check whether they are numerical or categorical.

>>> data['gender'].value_counts()
female    518
male      482
Name: gender, dtype: int64>>> data['parental level of education'].value_counts()
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: parental level of education, dtype: int64>>> data['race/ethnicity'].value_counts()
group C    319
group D    262
group B    190
group E    140
group A     89
Name: race/ethnicity, dtype: int64>>> data['lunch'].value_counts()
standard        645
free/reduced    355
Name: lunch, dtype: int64>>> data['test preparation course'].value_counts()
none         642
completed    358
Name: test preparation course, dtype: int64

We can see from the above that the columns, “gender”, “parental level of education”, “race/ethnicity”, “lunch”, “test preparation course” are categorical variables.

Since marks are always numerical, therefore the columns, “math score”, “read score”, “write score” contain numerical values and hence are numerical variables.

Adding columns “total” and “average” to the dataset.

>>> data['total'] = data['math score'] + data['reading score'] + data['writing score']
>>> data['average'] = data['total'] / 3

Let us see the distribution of the scores.

>>> sns.distplot(data['math score'])

>>> sns.distplot(data['reading score'])

>>> sns.distplot(data['writing score'])

>>> sns.distplot(data['average'])

>>> sns.pairplot(data)

Analyzing the average score of all the students on the basis of “race/ethnicity”, “parental level of education”, “test preparation course”.

>>> sns.barplot(data['race/ethnicity'], data['average'])

>>> sns.barplot(data['parental level of education'], data['average'])

>>> sns.barplot(data['test preparation course'], data['average'])

Let us the analyze the data on the basis of the no. of students who failed or passed the exam.

>>> data['math_PassStatus'] = np.where(data['math score']<40, 'F', 'P')
>>> data['read_PassStatus'] = np.where(data['reading score']<40, 'F', 'P')
>>> data['write_PassStatus'] = np.where(data['writing score']<40, 'F', 'P')

We can see the no. of students who passed or failed from the below code.

>>> data['math_PassStatus'].value_counts()
P    960
F     40
Name: math_PassStatus, dtype: int64>>> data['read_PassStatus'].value_counts()
P    974
F     26
Name: read_PassStatus, dtype: int64>>> data['write_PassStatus'].value_counts()
P    968
F     32
Name: write_PassStatus, dtype: int64>>> p = sns.countplot(x='parental level of education', data = data, hue='math_PassStatus', palette='bright')
>>> _ = plt.setp(p.get_xticklabels(), rotation=90)

>>> p = sns.countplot(x='test preparation course', data = data, hue='math_PassStatus', palette='bright')
>>> _ = plt.setp(p.get_xticklabels(), rotation=90)

>>> p = sns.countplot(x='parental level of education', data = data, hue='read_PassStatus', palette='bright')
>>> _ = plt.setp(p.get_xticklabels(), rotation=90)

>>> p = sns.countplot(x='test preparation course', data = data, hue='read_PassStatus', palette='bright')
>>> _ = plt.setp(p.get_xticklabels(), rotation=90)

>>> p = sns.countplot(x='parental level of education', data = data, hue='write_PassStatus', palette='bright')
>>> _ = plt.setp(p.get_xticklabels(), rotation=90)

>>> p = sns.countplot(x='test preparation course', data = data, hue='write_PassStatus', palette='bright')
>>> _ = plt.setp(p.get_xticklabels(), rotation=90)

Follow iSrajan on Facebook, Twitter, LinkedIn & Instagram.

Students Performance in Exams — Data Analysis

Written by Srajan Gupta