Visualising seasonal uncertainty on Brazilian 2018 presidential bid with Python and Google Trends

Wilame
7 min readAug 31, 2018

--

From June 2020, I will no longer be using Medium to publish new stories. Please, visit my personal blog if you want to continue to read my articles: https://vallant.in.

Sunset at Aracaju — Brazilian city. Picture by Diógenes Santos. License: Attribution 2.0 Generic (CC BY 2.0).

This year, in October, Brazil will elect its new president. The problem is that this is one of the most uncertain general elections that Brazil has ever had for a variety of reasons.

The first main reason is that the candidate with most of the vote intentions is jailed. Luis Inácio Lula da Silva appears as the preferred candidate on several pools, but we still don’t know if he will be allowed to run on the bid.

The second problem is that former elected Brazilian president Dilma Roussef suffered an impeachment on 2016.

The second candidate with most vote intentions is Jair Bolsonaro, a former military. Bolsonaro is known by his speeches against women, homosexuals and black people and by his admiral of military dictatorship.

The third candidate is Marina Silva, an ecologist and former Brazilian minister. Marina was a political ally of Luis Inácio Lula da Silva.

Today, we’ll try to visualise interest by these three candidates by using Google Trends data. Let’s see if we are capable to identify a seasonal component on the data. You can download the csv file here: https://github.com/wilame/DataScience/blob/master/data/eleicoes.csv.

Disclaimer: this analysis has no statistical value. It should be considered only from an educational standpoint, so we can learn more about concepts such as trend and seasonality.

Let’s load our data, check some basic information on it and change the header for something more meaningful.

The first thing to notice on this data is that we have ‘<1’ entries. This happens because Google doesn’t show us literal values. Instead, it uses a value between 0 and 100 to represent lowest and highest popularity of a search topic.

We can’t work with ‘<1’ values, so, let’s consider these as being 0. We’ll also convert the columns where these values are present to ‘int’. Another thing to do is to transform the ‘month’ column into ‘DateTime’ type.

Exploratory data analysis

First, let me explain something: Brazilian presidential elections happen every 4 years. On this data, we have Google search interest for the candidates for the 2006, 2010, 2014, 2018 (partial) election years.

Let’s plot a graph to see how interest for the candidates evolves from year to year.

Notice that Marina has some interest peaks in 2008, 2010, 2014, 2016 and in 2018. Bolsonaro has some small peaks throughout the dataset and very high peaks starting from 2016.

For Lula, we have a very high peak of interest in 2006 and very smaller peaks — but higher than the other two candidates — throughout the dataset. Very large peaks are also present after 2016.

Now, it’s time to better understand the problem by analysing Brazilian politics.

Lula has been the Brazilian president for 2 times. We are not going to discuss his biography here. You can read it in Wikipedia (https://en.wikipedia.org/wiki/Luiz_In%C3%A1cio_Lula_da_Silva). The fact is that in 2018 he was incarcerated accused of corruption. Some see the prison as fair while others see it as a political incarceration. By the time I was writing this article, Brazilian Justice hasn’t yet decided if he will be allowed or not to take part on the elections.

Marina was a Brazilian minister and only in 2010 she presented herself for a presidential bid. She retried in 2014 and in 2018 she’s the third candidate with more intentions of vote after Jair Bolsonaro.

Jair Bolsonaro is a Brazilian deputy and he’s very present on media basically because of some of his speeches about guns liberation and minority attacks.

As we can see, these three characters have a very intense political life and they remain very active during the all year.

What we are trying to achieve here is to check if we have seasonal interest for these names relating it to election years.

Analysing trending

We are sometimes capable of finding patterns using temporal data. One of these patterns is the trend, which helps us to see an increasing or decreasing value in the series. The other pattern is called seasonality, which is a repeating data cycle.

The first plot shows us that interest for Bolsonaro and Lula may be increasing, but all these ups and downs make harder to affirm with confidence that this is really happening.

We could better see the trend by plotting it. We do it by taking a rolling average using a window size (for each time point, you take the average of the points on either side of it). Since we want to see a trend on the interest for each presidential candidate let’s use a 4 years window to analyse this data.

Our goal is to smooth out noise and seasonality to better see the trend.

Now we can see what’s happening in terms of interest. It seems to be really increasing for both male candidates. Interest remains stable for Marina.

The problem here is that we don’t know anything about the interest ‘quality’. Google data doesn’t show if the searches about the candidates are positive or negative.

Seasonality

Seasonality is a way to see if data behaves more or less in the same way given a specific time frame. With seasonality, we check if there’s a repetitive nature in data.

To check it, we remove the trend component of the data or by a technique called ‘differencing’, by looking at the difference between successive data points. We can do it using pandas diff() method.

Seasonality usually occurs in the form of repetition. Seasonal data looks like this:

Google Flu Trends

Our data doesn’t seem to have anything like this, except by Marina.

What we saw by far is that:

  • There’s a true increase in interest by Lula and Bolsonaro
  • This increase in interest seems to be something new and sudden.

Could this mean that we actually don’t have a true seasonality component here? For Bolsonaro, we see small peaks in 2004, 2005, 2008, 2011 and from 2014 on. I went to do some search in these years to see what happened in these years.

  • 2004: Bolsonaro says that Brazilian indigenous people are smelly and mannerless.
  • 2005: Bolsonaro defends torture during Brazilian dictatorship.
  • 2008: Search presents more videos than text. Bolsonaro tries to run as president of the deputies chamber.
  • 2011: Bolsonaro says on TV that his kids would never marry a black woman because they were well raised. He also says he wouldn’t support gay pride because he doesn’t support bad behaviour.
  • 2014: Bolsonaro says he wouldn’t rape another fellow woman deputy because she was too ugly.
  • 2016: Bolsonaro announces he will run Brazilian presidential bid in 2018.

For Lula, we see many peaks, mainly because he was president for 8 years. Just recently, we had a lot of bad press for him due to the corruption accusations and his incarceration.

Since we had a lot of noise, in Lula’s case I decided to analyse seasonal component using a 2 years period. We see a huge peak in 2006, during his reelection and a smaller one in 2014, year of the previous elections, when he supported Dilma Roussef’s reelection.

For Marina, peaks correspond to 2010 and 2014, the years that she presented herself for presidential bid.

Periodicity and Autocorrelation

A time series is periodic if it repeats itself at equally n-spaced intervals. In order to see this correlation, we need to start looking at autocorrelation. But first, let’s analyse periodicity in our data by checking the correlation of trends.

What we see here is that all of the variables seem to be positively correlated (even though the correlation itself between certain candidates relationship is very small, like Marina and the others).

What we see here is that there’s a progressive interest in both candidates that appears to grow somewhat together. Notice that Lula has self-declared to be as the candidate the most at left. Bolsonaro did the same representing himself as the candidate the most at right on the Brazilian political spectrum.

We see that the trend is positively correlated. And the seasonality? Let’s plot the first-order differences of these time series and then compute the correlation of those because that will be the correlation of the seasonal components, approximately. Remember that removing the trend may reveal correlation in seasonality.

Let’s re-plot the first-order differences for the data together with the correlation of the seasonal component:

Correlation is positive, but is almost meaningless.

It’s time to check the autocorrelation to check if we can really prove to have a seasonal component for our data. Let’s understand better what autocorrelation means.

Autocorrelation is also known as serial correlation. It represents the correlation of a signal with a delayed copy of itself. We measure here the similarity between observations for finding repeating patterns, such as the presence of a periodical sign or a pattern.

On the plot below, it’s important to notice the dotted lines above and below the graphs. They represent the statistical significance of the correlation.

In this case, we haven’t really found any seasonality component.

Conclusion

We were able to see a trend for at least two candidates, but no seasonal component was found.

I don’t have any pretension of finding who will be the next Brazilian president only by looking at some Google trends data, but it’s kind of fun to revise certain concepts of a time series using an actual and relevant theme.

--

--