How to capture and store tweets in Real Time with Apache Spark and Apache Kafka. Using cloud Platforms such as Databricks and GCP (Part 1)
A reproducible End to End Solution with Code and Tutorial
Hi everyone, on this opportunity I’d like to share an example on how to capture and store Twitter information in real time Spark Streaming and Apache Kafka as open source tool, using Cloud platforms such as Databricks and Google Cloud Platform.
This time we’ll use Spark to interact with to Twitter and through Producer’s API we’ll take the data to a Kafka topic.
Before jumping into development, it’s mandatory to understand some basic concepts:
Spark Streaming: It’s an extension of Apache Spark core API, which responds to data procesing in near real time (micro batch) in a scalable way. Spark Streaming can connect with different tools such as Apache Kafka, Apache Flume, Amazon Kinesis, Twitter and IOT sensors.
Apache Kafka: It’s a fast , scalable, durable, and fault-tolerant publication-subscription messaging system. Kafka is generally used in real-time architectures that use stream data to provide real-time analysis.
Let’s start!
- Twitter APP: to get tweets, we must register in TwitterDevs and create an application. Twitter will give us 4 important values that will allow us to consume the information:
- Consumer Key (API Key)
- Consumer Secret(API Secret)
- Access Token
- Access Token Secret
2. Spark Databricks: Databricks Platform allows us to create a free Spark-Scala cluster. We must sign up to Databricks, then create a scala notebook where we’ll write our code.
Before writing our code we must create a cluster and import two libraries, TwitterLibrary will allow us to use the Twitter API with Spark and KafkaLibrary which helps us connect with Apache Kafka.
Here the documentation of how to import libraries using Maven: https://docs.databricks.com/user-guide/libraries.html
Create a cluster:
Then, we must select the option: Import Library
Then, we select Marvel Coordinate as the source and write the name of the library: spark-streaming-kafka-0–10_2.11.
After that, we do the same for spark-streaming-twitter_2.11
Now, we only need to enable our Kafka cluster to start with the execution of our development.
3. Apache Kafka in Google Cloud Platform: to create the cluster we must follow the following steps
a) Create an instance in GCP and install Kafka, follow the next Kafka-Cloud tutorial.
b) Enable and create rules to expose the ports 2181(zookeeper) and 9092(kafka). From the ssh console of our virtual machine enter the following commands: (Remenber to change the variable NAME_VM)
gcloud compute firewall-rules create rule-allow-tcp-9092 --source-ranges 0.0.0.0/0 --target-tags allow-tcp-9092 --allow tcp:9092gcloud compute firewall-rules create rule-allow-tcp-2181 --source-ranges 0.0.0.0/0 --target-tags allow-tcp-2181 --allow tcp:2181gcloud compute instances add-tags NAME_VM --tags allow-tcp-2181
gcloud compute instances add-tags NAME_VM --tags allow-tcp-9092
c) Configure Kafka properties file to expose its services, add the followings parameters in the next configuration file server.properties, from the SSH.
vi /usr/local/kafka/config/server.propertieslisteners=PLAINTEXT://your_internal_ip_vm:9092
advertised.listeners=PLAINTEXT://your_external_ip_vm:9092
d) Finally, restart Kafka services to take the changes made.
sudo /usr/local/kafka/bin/kafka-server-stop.shsudo /usr/local/kafka/bin/kafka-server-start.sh -daemon /usr/local/kafka/config/server.properties
Now we are ready to consume data from Twitter and store it in a topic of Apache Kafka, Let’s go !
In our notebook we enter the following codes:
You can download the full notebook from: jmgcode
To see the messages in Kafka you can launch the following command within the GCP environment: By SSH
sudo /usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server your_internal_ip_vm:9092 --topic llamada
Within Databricks we can do follow the tweets that we process:
That’s all for now, In the second part we will see how to do some tranformation of this information in RT and store the results in a database like Apache Hbase, Cassandra or Apache Kudu.
See you in the next post :) ….