Improving geographic data using the Brazilian National Census

Bruno Belluomini
Analytics Vidhya
Published in
4 min readMar 24, 2020
Photo by Adolfo Félix on Unsplash

Geolocation is an important topic in diverse fields. Delivery services, insurers, and franchises are all very interested in profoundly familiarizing themselves with a region.

Since Creditas offers Home Equity loans, location is an important factor that directly influences the value of the collateral, which affects the sum that can be lent.

Within the context of data science, knowing the ZIP code, neighborhood, city, and state isn’t enough. The most important thing is to find other, more generic, information which conveys the characteristics of the specified location.

Let’s say that a machine learning model was trained using a feature that only receives the neighborhood name from a dataset containing only examples from within the state of Sao Paulo. If this model were to receive a client from Rio de Janeiro as an input to make a prediction, it wouldn’t know what to do with the neighborhood value “Ipanema” and might result in either an error or an erroneous prediction.

On the other hand, if the model receives the number of inhabitants between 18 and 60 years from that same neighborhood, this generalizes the geographical location factor since many other places could have that number accounted for.

Therefore, always try to use features such as average_income_per_capitaor thefts_per_100_inhabitants instead of the name of the Neighborhood or City, for example.

“But where to get this data?” you ask yourself. There are some open data sources, like Geosampa for the city of São Paulo, but on a national scale, one of the best to use is the IBGE national census. Depending on the geographical location you are focusing on, the place where you get this information might vary; however, this is how I did it for Brazil.

About IBGE’s National Census

Every 10 years, the IBGE National Census is the main statistical study that is carried out on the Brazilian population, most recently occurring in 2010. In it, you can find information about literacy, age, urban infrastructure, etc.

The collected data is based on the smallest territorial unit, called a census sector. There were around 314 thousand census sectors mapped in Brazil’s 2010 Census. Each census sector possesses a unique id formatted as <UF><MMMMM><DD><SD><SSSS, where:

UF – Unidade da Federação (Federation Unit)
MMMMM – Município (Municipality)
DD – Distrito (District)
SD – Subdistrito (Subdistrict)
SSSS – Setor (Sector)
Example of the organization of census sectors in Itaim Bibi (in light blue) in the city of São Paulo. The dark blue lines serve as boundaries between each sector. From https://censo2010.ibge.gov.br/sinopseporsetores/ (in portuguese)

The database that the 2010 IBGE Census compiled can be found here using the following path:

Censos
├── Censo_Demografico_2010
├── Resultados_do_Universo
├── Agregados_por_Setores_Censitarios
├── <bases_de_dados>

The data files have iso8859_15encoding and a ; that serves as a separator.

Enough talking — Let’s get down to business

Without further ado, let’s look at an example focusing on Creditas’ address:

Formatted example of the Dataframe above

Our objective is to extract the Census’ variable to this address. When you open the Census database you will come across something similar to the table below:

Cod_setor is the code of the census sector and V001, V002 and V003 are examples of the variables encountered. Their explanations can be found in the Census’ documentation (in portuguese), as well as in the link to the database.

So, the challenge will be to turn the string of the address into the 355030835000017 code to extract the variables. The connection between the two will be done using a geospatial file named Shapefile (.shp), a data frame that contains geographical characteristics, whose format is similar to this:

The two columns that matter to us are CD_GEOCODI, the census code, and geometry, which contains a POLYGON object with the geographic coordinates (longitude and latitude, respectively) of the mapped sector.

The general idea of this process is to convert the address into longitude and latitude in order to get the code sector using Shapefile.

The census Shapefile .shp files can be accessed on the link below:

Converting the address to longitude and latitude

There are diverse services that allow the conversion of an address into a latitude and longitude. In this example, we’ll use the Google Maps API, Geocoding, in conjunction with the googlemaps library.

To use this API, an authentication key is necessary. You can get one for free for some requests here.

Obtaining the census sector’s code using an address code

Now that we’ve got the location with the coordinates, we need to find out in which census sector it is located. The next step is to convert the string of coordinates into a Point object so that Shapefile understands that it is dealing with a geographic location. We’re going to use the Python’s shapely library to do this.

Now, upload the Shapefile of the census sectors using the geopandas library to match up the location with the sector’s polygon.

Finally, the last step is to merge our dataset with the one containing the census variables.

The final step will look like:

Reminder: the meaning of the V001, V002 and V003 variables can be found in the documentation of the Census (in portuguese).

Good! You’ve just enriched your database with more interesting address characteristics. Remembering that you can always search for other sources of data and apply the same concept so that, using geographic factors, you can perform analyses and predictions anywhere =D

Interested in working with us? We’re always looking for people passionate about technology to join our crew! You can check out our openings here.

--

--

Bruno Belluomini
Analytics Vidhya

Data Scientist na Creditas e formado em engenharia civil que trocou o concreto por dados