This article is the first part of a serie of articles about time series analysis and automated pulses and trends detection.
Social media are a marvellous source of data for learning about our clients. The goal of Synthesio is to collect these data, enrich them and then provide Dashboards to analyze them. But we rapidly came to the conclusion that users do not have the time to make all manipulations to be able to spot insights.
The objective of the Headlines project at Synthesio is to provide a global overview of points of interest across all the variables of a dashboard in one click.
This project is divided in two parts : pulses and trends automatic detection.
In this first part, we will cover the data collection and the first exploratory analysis.
In our case, the data is stored on ElasticSearch data bases. As we have to work on time series, we needed to aggregate the data. We chose 3 intervals to have as much insights as possible : days, weeks and months.
The importance of having these 3 different ranges is that we can spot different types of insights and events. Indeed, some are really concentrated on one day, we can mention for example the Black Friday, which generates many contents on Social media. Others, on the opposite, are present on several days or weeks, for example sales or media campaigns. In these cases, it is interesting to be able to spot insights on periods as months or weeks.
Then, we also give the opportunity to the user to choose variables and metrics. For variables, users will have the choice between topics, countries, sentiment and site types (media, social media etc.) for example.
As for metrics, they will be able to choose among several choices such as volume of mentions, sum of interactions (comments, likes etc.) or reach. The idea is to see, for example, if many people are posting about a subject on a particular date period or if some subjects generate lots of interactions from users.
Once we have chosen our dimensions and metrics, let’s see what the data looks like.
We can clearly see that the data is not similar according to the time period. Some data sets present a strong seasonality while others do not. Generally speaking, we can see that monthly data is the smoothest one. On the opposite, daily data is very unstable and present many asperities.
In the exemple above, which is based on a dashboard about Formula 1 pilots and teams such as Renault F1, Mercedes AMG etc., we can clearly see that there is a strong seasonality related to “Grands Prix” every two weeks. This is the perfect example of unicity of every dashboard. Other dashboards, about tea notably, have more mentions on winter period but with no weekly seasonality.
As we are a Saas company, we need to build a scalable, robust and efficient model which will perform well on different types of time series. The model needs to be usable for many clients at the same time, and it also needs to answer fast to the clients query.
On a organisational point of view, the pulses and trends detection module will be created as an API coded in Python language. As the majority of Synthesio Services are currently in Golang, another API will be created to communicate with the front part.