Build a Tweet analysis service on Oracle Cloud using Cassandra & Spring Data

Abhishek Gupta
Oracle Developers
Published in
10 min readJan 9, 2018

Cloud native & microservices based systems benefit from the fact that they can afford to be polyglot i.e. a system can be built as a combination of multiple distributed services, each using a runtime/language specific for their needs. Typically, each service also has a dedicated data store. Again, this is specific to the service needs — a RDBMS or NoSQL database (or maybe NewSQL ?)

One such member of the NoSQL family, is Apache Cassandra which is a distributed (masterless) database with a unique wide row data model. Its horizontally scalable and partitioned/sharded with built-in support for multi data center replication, making it a compelling choice for always-on, resilient cloud based services

The good news is that Oracle Cloud now includes Data Hub Cloud Service which offers Cassandra as a managed solution

This blog introduces you to Oracle Data Hub Cloud Service with the help of a practical yet simple time series app (typical Cassandra use case) based on Twitter (and tweets of course!) where

  • the continuous stream of tweets (high velocity data) is consumed and persisted to Data Hub Cloud
  • and then the same is queried from Data Hub Cloud using another service

The above mentioned (micro) services are deployed to Oracle Application Container Cloud (a Polyglot, Cloud native application development platform) and they enjoy the native integration capability which this platform provides with Data Hub Cloud (more details later)

Oracle Data Hub Cloud Service primer

Here is a quick product overview of Oracle Data Hub Cloud before we dive into the nitty gritty

Oracle Data Hub Cloud is an umbrella service for a variety of data stores offered as a managed PaaS (platform-as-a-service) on Oracle Cloud. As you may have already figured out, at the time of writing, Oracle Data Hub Cloud includes open source Apache Cassandra — it will be gradually extended to include popular open source databases

Key goals

  • Consistent — same admin/management experience across multiple databases
  • Automated — the life cycle operations are completely automated
  • Full control — access (ssh) to the data and service

Some of its features include

  • Cluster Provisioning — it’s a simple, quick and flexible process where you choose the desired attributes including memory, storage, back up etc. and let the platform spin up the cluster for you
  • Complete life cycle management — Scale Up/Down (memory and storage), Scale In/Out (cluster nodes)
  • Patching/upgrade — this process is fully automated and along with a roll back capability
  • Monitoring — a dedicated service console to access cluster/node metrics such as CPU, memory, storage, R/W latency, compactions etc
  • Flexibility — most of the operations can be executed via console, REST API and a CLI
  • Infrastructure choice — choose bare metal servers on Oracle Cloud Infrastructure or go with VM based Compute (OCI-Classic) — details here

Architecture overview

the app is available on Github

A diagram always helps.. so here it is

High level architecture

As evident, the overall solution is pretty simple

  • Tweet Producer is a Java app which uses the Twitter streaming API to consume tweets and push them to Cassandra cluster on Data Hub
  • The Tweet Query service defines a REST API and interacts with Cassandra to fetch tweet data

Solution

Let’s look at some of the relevant details of the solution

Tweet Producer app

  • It’s a Java app and uses twitter4j library to consume the tweet stream
  • Applies user defined filter criteria/terms to filter relevant tweets from the stream
  • Pushes the tweet data to Cassandra asynchronously
  • It provides a REST API to start/stop the app on demand e.g. /tweets/producer

Connectivity to the Cassandra instance on Data Hub Cloud is simplifies by the Service Binding capability in Application Container Cloud (more info in next sub-section)

Tweet Query service

  • Its a basic Spring Boot app which leverages Spring Data and Spring Web
  • The Cassandra module in Spring Data is used to interact with Cassandra
  • spring-boot-starter-web module is used to expose a REST API to query tweet related info

Service Binding to Data Hub Cloud

Application Container Cloud provides out-of-the-box Service Binding for Data Hub Cloud. This gives your app a secure communication channel without you having to do anything explicitly

no port related configuration is required at the database infrastructure level

Here is the documentation for this feature

Few words on the Cassandra data model

Here is what the table (to store tweets) looks like

Data model
  • tweeter — the twitter screen name e.g. abhi_tweeter
  • tweet — the tweet itself (string format)
  • tweet_id — the ID of the tweet e.g.
  • created — the time stamp format of when the tweet was created
  • created_date — the date in text format e.g. 2018–01–01

This table is meant to store tweets in time series style — the primary key is designed keeping this requirement in mind. It consists of a single partition key and clustering columns

created_date is the partition key — it implies that

  • It is used to determine the partition in which a particular tweet will land
  • each partition will contain a day worth of tweets
  • only this column can be used in the WHERE clause (unless you create a secondary index) of your query i.e. it allows you to search for all tweets for a particular day (you will see this in action later)

create and tweeter are clustering columns

  • they determine how data is sorted on disk and returned in queries
  • since created column is specified before tweeter, the tweet time stamp will be used for sorting (i.e. the latest tweet first) followed by the twitter screen name (alphabetical order)
  • you can use the tweeter column in a where clause (as well) by adding allow filtering to the query

Infrastructure setup

Let’s quickly go through how to setup the foundation

  • Setup a Cassandra cluster using Oracle Datahub Cloud console and bootstrap Cassandra (keyspace and table)
  • Create a Twitter app which provides us with the required authentication tokens

Oracle Data Hub Cloud

Provision Cassandra cluster

Start by bootstrapping a new cluster — detailed documentation here

Cluster create options

It is also possible to do this using a CLI

Below snippet shows a basic single node cluster running Cassandra 3.10.0

Oracle Datahub cloud Cassandra Cluster

Create the keyspace and table

SSH into the Cassandra cluster node

Documentation available here

Fire up cqlsh

sudo su oracle — relevant information here

cqlsh -u admin `hostname`

Logged into cqlsh

Create the keyspace

CREATE KEYSPACE tweetspace WITH REPLICATION = { ‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 1 };

Create table to store tweets

USE tweetspace;

CREATE TABLE tweets (
tweeter text,
tweet_id text,
tweet text,
created timestamp,
created_date text,
PRIMARY KEY ((created_date), created, tweeter)
) WITH CLUSTERING ORDER BY (created DESC);

Just to double check — desc tweetspace;

Create your Twitter app

You would need to setup an app in Twitter in order to get access to the security tokens/keys which you will be using in the Tweet Producer service

Start by visiting https://apps.twitter.com/ and Create New App

Create a new Twitter app
Fill in the required details

Once you’re done, please check the Key and Access Tokens section for the required info — you will use it during application deployment

Twitter Keys and Access Tokens

Build & deployment

Start by fetching the project from Github — git clone https://github.com/abhirockzz/accs-cassandra-twitter-timeseries-app

Build

Tweet producer app

  • cd accs-dhcs-cassandra-tweets-producer
  • mvn clean install

The build process will create accs-cassandra-tweets-producer-dist.zip in the target directory

Tweets query service

  • cd accs-dhcs-cassandra-tweets-api
  • mvn clean install

The build process will create accs-dhcs-cassandra-tweets-api-dist.zip in the target directory

Deployment a.k.a push to cloud

With Oracle Application Container Cloud, you have multiple options in terms of deploying your applications. This blog will leverage PSM CLI which is a powerful command line interface for managing Oracle Cloud services

other deployment options include REST API, Oracle Developer Cloud and of course the console/UI

You can download and setup PSM CLI on your machine (using psm setup) — details here

Deploy both the applications

  • Tweets producer

Update the deployment.json with your Twitter access tokens and Oracle Data Hub Cloud instance details

{
“instances”: 1,
“memory”: “2G”,
“environment”: {
“TWITTER_CONSUMER_KEY”: “<as per your app>”,
“TWITTER_CONSUMER_SECRET”: “<as per your app>”,
“TWITTER_ACCESS_TOKEN”: “<as per your app>”,
“TWITTER_ACCESS_TOKEN_SECRET”: “<as per your app>”,
“TWITTER_TRACKED_TERMS”: “cloud,nosql”
},
“services”: [
{
“type”: “DHCS”,
“name”: “<as per your instance>”,
“username”: “<as per your instance>”,
“password”: “<as per your instance>”
}
]
}

Launch that zip into the cloud !

psm accs push -n TweetsProducer -r java -s hourly -m manifest.json -d deployment.json -p target/accs-cassandra-tweets-producer-dist.zip

  • Tweet query service

Update deployment.json for this service as well

{
“instances”: 1,
“memory”: “2G”,
“services”: [
{
“type”: “DHCS”,
“name”: “<as per your instance>”,
“username”: “<as per your instance>”,
“password”: “<as per your instance>”
}
]
}

.. and deploy this app as well

psm accs push -n TweetsQueryService -r java -s hourly -m manifest.json -d deployment.json -p target/accs-dhcs-cassandra-tweets-api-dist

Once executed, an asynchronous process is kicked off and the CLI returns its Job ID for you to track the application creation

After the apps are deployed, navigate to the Oracle Application Container cloud applications page to confirm — note down the application URLs

Tweets Producer and Query services deployed successfully

Here is a snapshot of the Service Binding (explained previously) and the environment variables (there are a couple more which haven’t been included here) created as a result

Oracle Data Hub Service Bindings

Test drive

Everything is set — it’s time to see things in action

Start the tweets producer app

curl -X GET <tweet_producer_app_url>/tweets/producer e.g. curl -X GET https://TweetsProducer-ocloud200.uscom-central-1.oraclecloud.com/tweets/producer

Wait for sometime (~ a minute or so) for the producer to get into the act

If you check the logs in Application Container Cloud, you should see something similar indicating that the tweet stream processing has been initiated

INFO: started producer thread
Establishing connection.
Connection established.
Receiving status stream.

Use the Tweets Query service


{
“tweeter”: “KotlinAndroid_”,
“tweet”: “The Complete Web Development Course — Build 15 Projects\n☞ https://t.co/j0QGTqrWRp\n#Kotlin #Java #Android #iOS\nHk8SWRIeEM https://t.co/GZ2qrWNGaI",
“created”: 1515481616000,
“created_date”: “2018–01–09”,
“tweet_id”: “950624916172308480”
}
{
“tweeter”: “_openknowledge”,
“tweet”: “RT @JAXenter: Enterprise Tales: Java EE 8 — ein politisches Release https://t.co/5JHvvZiTlz #javaee #javaee8 @mobileLarson #java #enterpris…”,
“created”: 1515481083000,
“created_date”: “2018–01–09”,
“tweet_id”: “950622682420269056”
}
{
“tweeter”: “AgrawalSadhuram”,
“tweet”: “RT @lokshaktidaily: IndonesiaVsMalaysia:JavaPrincess acceptsSanatandharm: @Sanjeevarora64 @ghost22090440 @mallikarjun456 @narendrapjoshi @S…”,
“created”: 1515481003000,
“created_date”: “2018–01–09”,
“tweet_id”: “950622344669667328”
},
{
“tweeter”: “AgrawalSadhuram”,
“tweet”: “https://t.co/UD2xTHaZ2d\nhttps://t.co/G2oCCwzW6A https://t.co/flqwPAjVqI",
“created”: 1515481001000,
“created_date”: “2018–01–09”,
“tweet_id”: “950622336973066241”
}

You can run the same queries directly using cqlsh as well

Query using cqlsh on Data Hub Cloud
  • To stop the producer app — curl -X DELETE <accs_app_url>/tweets/producer and restart it again (same way you started it) when required
Stop the tweets producer app

Summary

I realize that this was a rather lengthy blog! Here is a quick recap of what was covered

  • Overview of Data Hub Cloud
  • Details of the Twitter based time series app built using Spring Boot, Spring Data, twitter4j along with some Cassandra data modelling background for our specific use case
  • Infrastructure setup, configuration and deployment to cloud
  • and finally, testing our end to end solution

Don’t forget to…

  • go through the Oracle Data Hub Cloud documentation for a deep dive
  • check out the tutorials for Oracle Application Container Cloud — there is something for every runtime!
  • other blogs on Application Container Cloud

Cheers!

The views expressed in this post are my own and do not necessarily reflect the views of Oracle.

--

--

Abhishek Gupta
Oracle Developers

Principal Developer Advocate at AWS | I ❤️ Databases, Go, Kubernetes