Image for post
Image for post

Twitter Data Analysis for the Lazy in Elastic Stack (Xbox VS PlayStation)

Maciej Szymczyk
Oct 18 · 6 min read

Twitter data can be obtained in many ways, but who wants to write the code 😉. Especially one that will work 24/7. In Elastic Stack, you can easily collect and analyze data from Twitter. Logstash has input to collect tweets. Kafka Connect discussed in the previous story also has this option, but Logstash can send data to many sources (including Apache Kafka) and is easier to use.

In the article:

  • Saving a tweet stream to Elasticsearch in Logstash
  • Visualizations in Kibana (Xbox vs PlayStation)
  • Removing HTML tags for the keyword with a standardization mechanism

Elastic Stack Environment

All the necessary components are contained in one docker-compose. If you already have an Elasticsearch cluster, you just need Logstash.

Logstash Pipeline

To get tokens and keys you need a developer account and Twitter applications. Here you will take care of the “formalities”.

Configuration of the pipeline is very simple. The tweet stream will be about the words given in keywords. If you want more metadata, just change the full_tweet value to true.

Data

The docker-compose up -d command and after a while the data appear in the tweets index. At the time of creating this entry I already had +- two days of data collection. The index weighs ~430 mb, which is not much. Maybe a different license would allow for a larger data stream. Visualizations in this article include data from two days.

Index tweets are already there. You haveto add Index Pattern to be able to use the data in Kibana.

Tag Cloud — Xbox vs PlayStation

Simple Tag Cloud with hashtags.text.keyword aggregation. PS5 seems to prevail, but we will make sure to use other visualizations

Line — Xbox vs PlayStation

In this case too, I have the impression that PlayStation is superior to Xbox. To be 100% sure, let’s try to group the hashtags. Some people use PS5, others use ps5 and it is the same product.

However, before we go any further, let’s check something. Does the order of the buckets matter? Of course. Here is what will happen if I change the histogram from Terms.

To group hashtags we can use Filters aggregation. We will add a few more hashtags, intentionally omitting the less numerous ones. The syntax in the Filter field is KQL, which is Lucene on steroids.

Filtry to hashtags.text.keyword: (PS5 OR ps5 OR PlayStation5 OR PlayStation)and hashtags.text.keyword: (XboxSeriesX OR Xbox OR XboxSeriesS OR xbox) . Now we are sure that PlayStation has more publicity on Twitter.

Timelion

XBOX VS PLAYSTATION

The same and even more can be found in Timelion. It is an interesting tool for visualizing time series. It differs from the previous one in that it can visualize data from many sources in one place.

You have to get used to the syntax. Below is the code that generated the above graph.

Offset

Timelion provides the possibility to move functions using the offset parameter. Below is an example of how many tweets about PlayStation with the previous day. There is little data, so the effect is not particularly thrilling.

Function Variability / Delta

Using the mentioned parameter and the subtract method we can calculate the variability of the function.

Pie Chart — Client types

Ugly Pie

Let’s identify the clients used to write tweets. It turns out that the case is a little difficult. The client field contains an HTML tag that reduces the readability of the chart.

Pretty pie

Elasticsearch has a lot of text processing capabilities. The html_strip filter allows you to remove HTML tags. Unfortunately, using it will not give us anything, because the analyzers can only be used for text type fields, but we want the keyword field. You can use aggregation on this type of fields.

In case of keyword fields we can use normalizers. They work in a similar way to analyzers, but they give a single token at their output.

Below is the code adding the normalizer to the tweets index. I could not use html_strip, so I used a regular expression. Changing analyzer settings in the index requires closing the index. You can use the following code snippets in Dev Tools in Kibana.

After adding the normalizer, we can update the client property with a new value field.

Unfortunately, that is not all. Data is indexed when it is added to the index (why couldn’t they call it a collection like in MongoDB? 😅). We can reindex documents with the Update By Query mechanism.

The operation will return the task id. It may take a while if you have a lot of data. You can find the task by entering GET _cat/tasks?v

After refreshing the Index Pattern in Kibana, we can see a more readable graph. A similar number of users use iPhone and Android. I was intrigued by the client Bot Xbox Series X .

What next?

I wanted to investigate Spark NLP, but I decided to build a Twitter stream first. I plan to use ready-made Spark NLP models to detect language, sentiment and other parameters, all using Spark Structured Streaming.

Repository

https://github.com/zorteran/wiadro-danych-twitter-elastic-stack

The Startup

Medium's largest active publication, followed by +719K people. Follow to join our community.

Maciej Szymczyk

Written by

Software Developer, Big Data Engineer, Blogger (https://wiadrodanych.pl), Amateur Cyclists & Triathlete, @maciej_szymczyk

The Startup

Medium's largest active publication, followed by +719K people. Follow to join our community.

Maciej Szymczyk

Written by

Software Developer, Big Data Engineer, Blogger (https://wiadrodanych.pl), Amateur Cyclists & Triathlete, @maciej_szymczyk

The Startup

Medium's largest active publication, followed by +719K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store