School Brazilian Census: Exploratory Data Analysis (AED)
Contextualizing
School Census : it is a survey of statistical-educational data of national scope carried out annually. With it, INEP verifies from the number of enrollments and student performance to the infrastructure of schools and teaching functions. The data are provided by public and private schools and state and municipal education networks. ( source )
Exploratory Data Analysis : It is considered one of the crucial steps for database analysis. This means that there will be a process to organize, summarize, apply some calculations and visualize the data present in your analysis. In this way, summarizing the main features through visual methods. ( source )
Database and tools used
Database : 2020 School Census (available on 03/08/2021)
Originally containing 224,228 Schools and with 240 attributes, in the study we used only 74 attributes.
Tools : Databricks Community , using SQL and Python .
The data is stored in a Delta Lake and is consumed via SPARK, using SQL commands to generate BI (data visualization with Power BI) and Python for Exploratory analysis and Generation of a Machine Learning model.
Exploratory Data Analysis
Some numbers:
224,228 Schools in the 2020 School Census
- 181,279 Schools in Operation (we will consider only those in operation)
- 700 Federals
- 30,194 State
- 108,622 Municipal
- 41,763 Private
Students (49,369,568 Enrolled Students)
- 410,987 students in Federal Schools
- 15,708,601 students in State Schools
- 24,343,799 students in Municipal Schools
- 8,906,181 in Private Schools
We will deepen the analysis on the Number of Students (QtyStudents)
Looking at the Box Plot above and going into more detail:
Maximum = 819.5 (above this value are considered outliers )
Q3 = 365.0
Median = 158.0
Q1 = 62.0
Minimum = -392.5 (below this value are considered outliers )
We can observe outliers , which may not necessarily be errors, it may be that these Schools actually have this amount of students.
But for our analysis: I removed 18,760 schools that contained less than 25 students and 442 that had more than 2000 students . ( Although there are schools with less than 25 students and more than 2,000, I don’t know if all the ones under 5 are correct or a typo, as are all the ones with more than 2,000 students. )
Generating the Box Plot again , it is now something more visible and we can better analyze the data…
We observed that most Schools have between 88 and 399 students, with a median of 188 students per School.
Maximum = 865.5 (above this value are considered outliers )
Q3 = 399.0
Median = 188.0
Q1 = 88.0
Minimum = -378.5 (below this value are considered outliers )
Now, let’s take a look at the distribution of the Number of Students (QtyStudents) attribute
We observe :
- Asymmetry with concentration on the left ( Positive Skew ): 1.94
- Leptokurtic Kurtosis: 4.37
We can observe that the great concentration of schools between 25 and 60 students
Business Intelligence
Based on these numbers and previous analyzes shown, a Dashboard was created , where you can iteratively navigate these Schools, by Region, State, Municipality, Search by name or INEP code… Check it out…
Model using Machine Learning (Classification)
Machine Learning is a subgroup of AI (Artificial Intelligence) that uses statistical techniques to give computers the ability to learn from data without being explicitly programmed.
A suggestion for a Machine Learning Model is to use Clustering (group schools with similar characteristics, for example: TP_SITUACAO_FUNCIONAMENTO, TP_DEPENDENCIA and TP_LOCALIZACAO)
Now, let’s build a Machine Learning model , using Classification .
For Model:
- A Target variable was created that informs whether the School is Public (Federal, State and Municipal) or Private .
- Removed attributes related to School Dependency and other attributes that do not add to the model, they were:
- ANO, INEP, ESCOLA, CO_MICRORREGIAO, CO_MESORREGIAO, CO_DISTRITO, CO_MUNICIPIO, CO_ORGAO_REGIONAL, TP_SITUACAO_FUNCIONAMENTO, DT_ANO_LETIVO_INICIO, DT_ANO_LETIVO_TERMINO, TP_DEPENDENCIA, TP_CATEGORIA_ESCOLA_PRIVADA, IN_CONVENIADA_PP, TP_CONVENIO_PODER_PUBLICO, IN_MANT_ESCOLA_PRIVADA_EMP, IN_MANT_ESCOLA_PRIVADA_ONG, IN_MANT_ESCOLA_PRIVADA_OSCIP, IN_MANT_ESCOLA_PRIV_ONG_OSCIP, IN_MANT_ESCOLA_PRIVADA_SIND, IN_MANT_ESCOLA_PRIVADA_SIST_S, IN_MANT_ESCOLA_PRIVADA_S_FINS, TP_REGULAMENTACAO, TP_RESPONSAVEL_REGULAMENTACAO.
Handled null values ( missing values )
- Example: TP_OCUPACAO_PREDIO_ESCOLAR and IN_BANDA_LARGA
Algorithm used: Random Forest
- Parameters : Sklearn default
- Training: 70% of the data
- Test: 30% of the data
Results obtained
- The algorithm had a reasonable performance in classifying a School as Public or Private, based on some characteristics of this School ( more information in… )
Results
Obtaining an F1-score of 96% for public schools and 87% for private schools.
Model accuracy was 94%.
Future works
Suggestions : Tuning of parameters, other Classifiers, new and more details of metrics and etc…