Image from https://unsplash.com/@hal9001

Flying to competitivity

Enrique Alvarado
MCD-UNISON
Published in
6 min readDec 19, 2020

--

Relation between arrival-departures flights and competitive cities.

Data analisys based in study “Ciudades Resilientes” from IMCO.

Days ago I was hearing a note in a local media about a study called “Ciudades Resilitentes-Indice de competitividad Urbana”. In the interview, IMCO’s Investigator showed data about 73 cities in México and their competitivity rank and mentioned Hermosillo, as a fourth in general ranking in the 500K-1M; hard data, and interesting about the capability of the cities to attract and retain investment and talent that is traduced in wellness for their habitants.

In the present article, we’re going through data of this excellent investigation and their relation with arrivals and departure flights between all those cities with data found in datosmexico.org, to practice EDA and some skills acquired in feature engineering course.

Searching in IMCO site I found the source dataset of competitive cities, and scratching a little in https://datos.gob.mx/ I found data about all arrivals and departures between the study cities and more, so… let’s go to the data!

Getting the data

There’s 48 sheet inside the excel file from the IMCO site, the first one is the well explained contains of the rest, as well as the used methodology to construct the ranking of the cities.

I used the general ranking sheet in order to do simple, to demonstrate the cleaning of data, I’ll merge with another source.

Load the libraries that we’ll need:

Load the dataset from the excel file, the dataset to use came in well-formatted and shape, additionally I’ll load the population size from the same excel file, in a different sheet.

Name the population dataset columns

From the ranking dataset check the column names, doesnt have:

Cleaning Data

I’ll create a series from years to search in columns, create an array with the position of year changes and create an additional column for the replicate year:

Copy the year value in my new column in the correspondent position and fill the rest with fillna (ffill)

Drop empty values and rename columns with first header.

Rename the year column and drop duplicates to remove the repeated headers.

I found that city names in the first and second dataset are different for the same city, I use a not fancy method, but…..

Now I’ll merge with the population dataframe to get the cities size, regions and category, after that read the Flights dataset:

Flights dataset is an excel file contains data with all arrivals and departure flights between all cities with airports in Mexico, including international flights.

Remove totals, remove null and columns from years that I don’t need

I’ll create series from city names to leave only flights between the list of the ranking (there was international flights and to other cities)

Drop observations with no flights in all years to get at least one flight in any listed year, and transform to get a new column with year and flights of the year between two cities and remove again all obs with no flight in a year.

I’ll merge with the ranking data frame to get cities with their rank, population, and flights (departed and arrival) between cities.

There are some NaN values in the flights, they are real zero flight values so I’ll impute with zero, the result is my tidy dataset so I’ll save it in CSV file.

Analizing Data

With my tidy data let get some info about the data and some stats.

Let see in a graphic some more data, like the highest-ranked cities in 2010 categorized by city size.

Great!, Hermosillo is one of them… what about 2018?

Let's take a look at the same cities but along of years…

Now, obtain correlation and plot a graphic between my variables

Finally, plot the relation between the amount of flights by cities, the flights include departed and arrival between the cities.

As we can expect the bigger the city is, the biggest their flight's amount, but not necessarily this amount of flights means that the city is more competitive, small cities with a small quantity of flights like Hermosillo, appeared at the top of competitive cities with frequence.

Conclusion.

There’s a lot of companies, institution, and individuals that collect a big and Amazing amount of data, those persons most time don’t know the potential of that useful data to make decisions; in another hand there's a lot of data available and public, a combination of one of more of these databases can lead us to discover invaluable information about the data and their relation that we cannot easily see.

All data source, source code with tidy data, and dashboard with graphics is available:

Dataset IMCO : https://imco.org.mx/indice-de-competitividad-urbana-2020/

Flights Dataset : https://datos.gob.mx/busca

Source Github link : https://github.com/ealvaradoc72/ICU_Analisys

Dashboard: https://ealvaradoc72.github.io/ICU_Analisys/

--

--

Enrique Alvarado
MCD-UNISON

Data Science master student, Programmer, Tech & Data Passionate