School Brazilian Census: Exploratory Data Analysis (AED)

Published in

blog do zouza

5 min readJan 21, 2022

Contextualizing

School Census : it is a survey of statistical-educational data of national scope carried out annually. With it, INEP verifies from the number of enrollments and student performance to the infrastructure of schools and teaching functions. The data are provided by public and private schools and state and municipal education networks. ( source )

Exploratory Data Analysis : It is considered one of the crucial steps for database analysis. This means that there will be a process to organize, summarize, apply some calculations and visualize the data present in your analysis. In this way, summarizing the main features through visual methods. ( source )

Database and tools used

Database : 2020 School Census (available on 03/08/2021)
Originally containing 224,228 Schools and with 240 attributes, in the study we used only 74 attributes.

Tools : Databricks Community , using SQL and Python .
The data is stored in a Delta Lake and is consumed via SPARK, using SQL commands to generate BI (data visualization with Power BI) and Python for Exploratory analysis and Generation of a Machine Learning model.

Exploratory Data Analysis

Some numbers:

224,228 Schools in the 2020 School Census

181,279 Schools in Operation (we will consider only those in operation)
700 Federals
30,194 State
108,622 Municipal
41,763 Private

Students (49,369,568 Enrolled Students)

410,987 students in Federal Schools
15,708,601 students in State Schools
24,343,799 students in Municipal Schools
8,906,181 in Private Schools

We will deepen the analysis on the Number of Students (QtyStudents)

Looking at the Box Plot above and going into more detail:

Maximum = 819.5 (above this value are considered outliers )
Q3 = 365.0
Median = 158.0
Q1 = 62.0
Minimum = -392.5 (below this value are considered outliers )

We can observe outliers , which may not necessarily be errors, it may be that these Schools actually have this amount of students.

But for our analysis: I removed 18,760 schools that contained less than 25 students and 442 that had more than 2000 students . ( Although there are schools with less than 25 students and more than 2,000, I don’t know if all the ones under 5 are correct or a typo, as are all the ones with more than 2,000 students. )

Generating the Box Plot again , it is now something more visible and we can better analyze the data…

We observed that most Schools have between 88 and 399 students, with a median of 188 students per School.

Maximum = 865.5 (above this value are considered outliers )
Q3 = 399.0
Median = 188.0
Q1 = 88.0
Minimum = -378.5 (below this value are considered outliers )

Now, let’s take a look at the distribution of the Number of Students (QtyStudents) attribute

We observe :

Asymmetry with concentration on the left ( Positive Skew ): 1.94
Leptokurtic Kurtosis: 4.37

We can observe that the great concentration of schools between 25 and 60 students

Business Intelligence

Based on these numbers and previous analyzes shown, a Dashboard was created , where you can iteratively navigate these Schools, by Region, State, Municipality, Search by name or INEP code… Check it out…

Model using Machine Learning (Classification)

Machine Learning is a subgroup of AI (Artificial Intelligence) that uses statistical techniques to give computers the ability to learn from data without being explicitly programmed.

A suggestion for a Machine Learning Model is to use Clustering (group schools with similar characteristics, for example: TP_SITUACAO_FUNCIONAMENTO, TP_DEPENDENCIA and TP_LOCALIZACAO)

Now, let’s build a Machine Learning model , using Classification .

For Model:

A Target variable was created that informs whether the School is Public (Federal, State and Municipal) or Private .
Removed attributes related to School Dependency and other attributes that do not add to the model, they were:
ANO, INEP, ESCOLA, CO_MICRORREGIAO, CO_MESORREGIAO, CO_DISTRITO, CO_MUNICIPIO, CO_ORGAO_REGIONAL, TP_SITUACAO_FUNCIONAMENTO, DT_ANO_LETIVO_INICIO, DT_ANO_LETIVO_TERMINO, TP_DEPENDENCIA, TP_CATEGORIA_ESCOLA_PRIVADA, IN_CONVENIADA_PP, TP_CONVENIO_PODER_PUBLICO, IN_MANT_ESCOLA_PRIVADA_EMP, IN_MANT_ESCOLA_PRIVADA_ONG, IN_MANT_ESCOLA_PRIVADA_OSCIP, IN_MANT_ESCOLA_PRIV_ONG_OSCIP, IN_MANT_ESCOLA_PRIVADA_SIND, IN_MANT_ESCOLA_PRIVADA_SIST_S, IN_MANT_ESCOLA_PRIVADA_S_FINS, TP_REGULAMENTACAO, TP_RESPONSAVEL_REGULAMENTACAO.

Handled null values ( missing values )

Example: TP_OCUPACAO_PREDIO_ESCOLAR and IN_BANDA_LARGA

Algorithm used: Random Forest

Parameters : Sklearn default
Training: 70% of the data
Test: 30% of the data

Results obtained

The algorithm had a reasonable performance in classifying a School as Public or Private, based on some characteristics of this School ( more information in… )

Results

Obtaining an F1-score of 96% for public schools and 87% for private schools.

Model accuracy was 94%.