I do care about my City Hall’s Data.

Felipe Bormann
The Data Experience
7 min readJun 14, 2016

I am a Computer Science undergraduate student from Recife, Pernambuco, Brazil. And Recife is a tech hub . I can’t tell my whole story in tech here but just to sum up:

  • Studied in a technichal school of Game Development (which also had lectures on Design and Game Design).
  • Passed the entrance exam for University (ENEM)
  • Entered on the biggest IT company of town (C.E.S.A.R.)
  • Struggled with the math behind the course, left the job and learned about Data Science field.
  • Starts a research on Educational Data Analysis.

Then…I’ve found about http://www.dados.recife.pe.gov.br which is an open data initiative from my City. And there is a dataset about the public health care system called SAMU. The dataset is fully in Portuguese but you can download there, it’s about all the call requests made in 2015. And I decided to analyze all this data to improve my skills, I hope you enjoy a bit of the descriptive analysis I’ve done and some thoughts I’ve raised and tried to answer.

What is it?

The data I’ve analyzed (still am) are about the emergency calls of 2015 made to SAMU (for those who don’t know, it’s the public health emergency system in Brazil). There were 70.011 calls!

You can find this data on this link and it’s divided between 5 datasets “ambulances”, “neighborhood”, “districts” , “expertise”, “removes” and “calls”.

Our variable of focus is desistência (“give up on call”), filled a posteriori, it’s a variable which I’ll be able to use Machine Learning algorithm to predict it’s values, the problem will binary, so it’s values are “give up” or “not give up”.

Exploratory Analysis

Basically, a exploratory analysis is used to understand more about how the variables relate with each other, which insights I can get.

When I split my dataset by sex, I can already plot a simple bar graph with “some” result:

The difference in desistência between man and women is insignificant, the first one gave up 14,5% of the times and the second, 13,3%.

From the variable Sex, I decided to mess up with the age and started playing with the spans, such as, how the olders behave? And I thought how to do a correct segmentation. It was cool to find that there existis a law defition for elder in Brazil(the following is in Portuguese, sorry):

No Brasil, considera-se idoso quem tiver atingido os 60 ou mais anos de idade, homem ou mulher, nacional ou estrangeiro, urbano ou rural, trabalhador da iniciativa privada, de serviço público, livre ou recluso, exercendo atividades ou aposentado, incluindo o pensionista e qualquer que seja a sua condição social (Martinez, 2005, pg. 20).

Source: http://boilerdo.blogspot.com.br/2013/04/quem-pode-ser-considerado-idoso-nos.html

Another source: http://www.planalto.gov.br/ccivil_03/leis/l8842.htm

Data Transformation (And keep up with the analysis)

There exists a column called “solicitacao_data”(call date), then I though about which questions I could answer based on the time, here goes the first one:

Which is the worse day and time to call SAMU?

As it is a date, the first thing I had to do was to transform, was to create a column that given a date, I’d return the week day. The distribution below is there result:

And how did I answer the question from this section?

The following code #easy #butno I calculated the amount of minutes between the call and the arrival at the hospital, even though I still think I have to look for outliers( my bad, I haven’t done it yet):

teste para calcular o tempo de espera -> chegada ao hospital
tempo_1 = solicitacoes_2015[“data_acionamento”][0]
tempo_2 = solicitacoes_2015[“data_chegada”][0]
fmt = “%Y-%m-%d %H:%M” #REGEX
tempo_1_parsed = datetime.datetime.strptime(tempo_1,fmt)
tempo_2_parsed = datetime.datetime.strptime(tempo_2,fmt)
tp1_ts = time.mktime(tempo_1_parsed.timetuple())
tp2_ts = time.mktime(tempo_2_parsed.timetuple())
print(tempo_1)
print(tempo_2)
#print(tp2_ts)
#print(tp1_ts)
print (str(int(tp2_ts-tp1_ts) / 60) + “ minutos “)
index = 0
tempo_de_transporte = []
for data_acionamento in solicitacoes_2015[“data_acionamento”]:
if type(data_acionamento) is not float: #nan is considered to be a float type
data_acionamento_parser = datetime.datetime.strptime(data_acionamento, fmt)
data_acionamento_ts = time.mktime(data_acionamento_parser.timetuple())
if type(solicitacoes_2015[“data_chegada”].iloc[index]) is not float:
data_chegada_parser = datetime.datetime.strptime(solicitacoes_2015[“data_chegada”].iloc[index], fmt)
data_chegada_ts = time.mktime(data_chegada_parser.timetuple())
#print(data_acionamento_parser)
#print(data_chegada_parser)
tempo_de_transporte_instancia = int(data_chegada_ts — data_acionamento_ts)/ 60
tempo_de_transporte.append(tempo_de_transporte_instancia)
print(str(tempo_de_transporte_instancia) + “ minutos”)
else:
tempo_de_transporte.append(9999) #Significando Missing
else:
tempo_de_transporte.append(9999) #Significando Missing
index += 1

To be able to do something about these data, I had to use the function GroupBy, and how the name says, it create groups from the distincts values of a column.

#agrupando por dia da semana, quais deles possui a maior quantidade de desistências?
solicitacoes_2015_por_dia = solicitacoes_2015.groupby(“solicitacao_diadasemana”)
for dia, solicitacoes in solicitacoes_2015_por_dia:
print(“No dia “ + dia +”\n”)
qtd_desistencias = solicitacoes[solicitacoes[“motivodescarte_descricao”] == “DESISTENCIA DA SOLICITAÇÃO”].shape[0]
print(str(qtd_desistencias) + “ Desistências”)
print(str(qtd_desistencias/solicitacoes.shape[0]) + “% de desistências sobre todas as chamadas”)
print(“\n”)

I couldn’t find the hour yet but the SAMU takes way longer on Friday ( Good #Party :/)

#TIP 1

If possible, not on friday

There are many reasons for this (that I can think of but can’t proof though), I haven’t get any other dataset to do so, but it’s possible that eveyone is going out friday night, maybe it’ll be interesting to match the slowest hospitals with their location. (Does somebody dataset about Recife Traffic?)

Which days have the most calls to SAMU?

On decreasing order, the days that had the most call to SAMU were Sunday, Saturday and Monday.

On Sunday, waiting time for SAMU was about 38 minutes in average on Sunday (removing the calls as “9999” , missing values) but it still contains the calls that were disposed. Here is the snippet to get the data:

for dia, solicitacoes in solicitacoes_2015_por_dia:
print(“No dia “ + dia +”\n”)
qtd_tempo_gasto = solicitacoes[solicitacoes[“tempo_de_transporte_minutos”] != 9999][“tempo_de_transporte_minutos”].sum()
print(str(qtd_tempo_gasto) + “ total no dia durante 2015”)
print(str(qtd_tempo_gasto/solicitacoes.shape[0]) + “ de minutos em média”)
print(“\n”)

Something relatively easy to be done, these transformations brought some interesting things, such as an example, our next question:

What are the days with most give ups?

And again, the champion is Sunday, with 1800 give ups, on about 16,7% of all calls on Sunday. This means that SAMU has a margin of almost 17% of WASTED CALLS. Imagine, for example, possible causes: Someone rescued before, the patient was rescued by a private company( some people call both, public and private sometimes) and “mock calls” (when you call but you’re lying).

P.S.: “Mock Call” is a crime and there are still e homo sapiens that does it.

On the section “Quanto custa” of the text above you can read( it is in portuguese) that a the team and equipments of a single SAMU costs 30R$ Thousands!, imagine the cost to move all of this 17% of the time.

#TIP 2

STAY CALM.

Analyze the situation, if it’s possible to someone to rescue faster, don’t CALL SAMU.

What are the major reasons to call?

Between all the patients whom were helped, these are the top 5 reasons:

CAUSAS EXTERNAS            10378
NEUROLOGICA 2962
CARDIOLOGICA 2383
INFECÇÃO 2219
RESPIRATORIA 1960

I don’t know(yet) what causa externas means. Neurologia, in my head ( I haven’t interviewed anyone on SAMU to know more about it), must head injuries and impacts on car accidents. The same to cardiologia, yes, a lot of people suffer for heart strokes, mainly elderly, as I could see from the data(next section).

If i stratify the data from Sex, the Top 5 modifies:

Masculino:

CAUSAS EXTERNAS            1024
INFECÇÃO 753
CARDIOLOGICA 735
NEUROLOGICA 732
RESPIRATORIA 683

Feminino:

CAUSAS EXTERNAS            807
INFECÇÃO 605
CARDIOLOGICA 595
RESPIRATORIA 532
NEUROLOGICA 523

And the neighborhoods?

There is a part of the dataset called “neighborhood_2015.csv” which describes in details the data of the colun “bairrosaude_descricao”. I just asked myself which neighborhood had the most calls:

E para entender o porque, fui dar uma olhada na configuração do IBGE do bairro do CENTRO, é discrepante a quantidade de chamadas comparada à outros bairros, porém no link do IBGE sobre Recife, não existe nenhum bairro chamado “Centro”, ainda tenho de procurar a informação sobre quais “bairros” o CENTRO acopla dentro dele.

Ainda acho que essa variável vai ser de grande valor para o modelo que irei criar.

For the second Part

On the second part I’ll show more data set transformation and the implementation of a few models to predict the variable “withdrawl”.

And also metrics like correlation and how to create a ROC curve( since the it’s a binary problem).

I hope you liked the post and here are my social media links:

Youtube Channel debugasse: https://www.youtube.com/channel/UCey2da8VAlR--glrFCIggmA

Github: https://github.com/fbormann

--

--