Twitter Data Analysis: Optimising Insertion Throughput With Batching

Published in

Vaticle

5 min readNov 1, 2017

Image by Simon & His Camera is licensed under CC BY 2.0

Grakn is the database for AI. It is a distributed knowledge base designed specifically to handle complex data in knowledge-oriented system — a task for which traditional database technologies are not the best fit.

To ensure that their internal knowledge is the most up-to-date and relevant, AI systems are always hungry for newly updated data. Working seamlessly with streaming data is therefore useful for building knowledge-oriented systems. In this blog post, we will look at how to stream public tweets into Grakn’s distributed knowledge base.

Continuing Where We Left Off

This post is the last one of a three-part series covering how we can leverage Grakn for performing analysis on Twitter data.

Here, we will continue our work from part 2 and look at how we can optimise the throughput even further with batching.

Before we delve into what batching is, let’s just have a bit of recap of what we previously covered in this Twitter data series. Here’s a rundown of the first two posts:

Part 1: Using Grakn To Stream Twitter Data — In this post, we mainly look at two things: how to model and define a schema, and how to insert the actual data into the knowledge base

Part 2: Performing Aggregate Query On Twitter Data — Here we continue further by looking at how we can perform aggregate query in order to obtain meaningful information from the data set.

If you haven’t already, I recommend that you check out these posts. They cover basic concepts of working with Grakn and Twitter data, which will serve as the basis of the remainder of this post.

So, What Is Batching?

Batching is a technique which improves the throughput of data processing. It works by organising the execution unit in batches in order to minimise the amount of associated “plumbing works”.

In order to help us understand what batching is more clearly, let’s do a concrete example of performing HTTP calls:

Imagine we are trying to add new items into a database via HTTP calls. If we’re trying to add 10 items, it makes sense to send them in a batch of 10 items through a single HTTP call, rather than doing one call per item.

This is because the cost of the associated plumbing works — which is in initiating an HTTP connection — is so high that it would reduce the throughput by a significant margin.

That is how batching fundamentally works, and it is such an important technique which can be applied to many different things. In this post, we’re going to look at performing batch insertion in order to improve the throughput of Twitter data ingestion.

Enabling Batch Insertion

Fortunately, Grakn has batching support already built in.

Let’s update GraknTweetOntologyHelper::withGraknGraph() in order to expose GraknTxType parameters. This way, we can chose whether to use WRITE or BATCH depending on the circumstances.

public static void withGraknGraph(GraknSession session, GraknTxType type, Consumer<GraknGraph> fn) {
  GraknGraph graphWriter = session.open(type);
  fn.accept(graphWriter);
  graphWriter.commit();
}

Next, go to the main method to update our schema creation and data insertion to use the appropriate GraknTxType.

An important thing to note: BATCH can only be used for data insertion. Schema creation must always be done with WRITE.

public static void main(String[] args) {
  try (GraknSession session = Grakn.session(graphImplementation, keyspace)) {    withGraknGraph(session, GraknTxType.WRITE, graknGraph -> initTweetOntology(graknGraph)); // initialize schemalistenToTwitterStreamAsync(consumerKey, consumerSecret, accessToken, accessTokenSecret, (screenName, tweet) -> {
      withGraknGraph(session, GraknTxType.BATCH, graknGraph -> {        insertUserTweet(graknGraph, screenName, tweet);        Stream<Map.Entry<String, Long>> result = calculateTweetCountPerUser(graknGraph); // query        prettyPrintQueryResult(result); // display      });
    });
  }
}

That’s it! The change we did was minimal as all we need to do was to change the parameter we need to supply.

Running The Application

Let’s build and run the application with:

$ mvn package
$ java -jar target/twitterexample-1.0-SNAPSHOT.jar

You will see list of users along with the number of times they have tweeted since we started the application:

------
-- user <user-1> tweeted 2 time(s).
-- user <user-2> tweeted 1 time(s).
-- user <user-3> tweeted 1 time(s).
-- user <user-n> tweeted 1 time(s).
------

But it’s exactly the same as what we had built earlier, in part 2!

So What Has Changed?

Well, what we’ve done is an optimisation step for achieving higher throughput. While the external behaviour of our app doesn’t change, it is now able to receive data faster.

How much faster exactly? Well, we’re curious too! To be frank, we don’t yet have the number at hand. We’re still working on a benchmark which will be published very soon.

In big data, batching is an extremely valuable technique which should be considered at various stages of the processing pipeline in order to boost performance.

Specifically for our app, introducing batching makes a lot of sense — we want to be able to receive as much data as possible in the shortest amount of time.

Conclusion

This post concludes the Twitter Data Analysis series! Over the last few weeks, we’ve looked at how we can develop a very simple application with Grakn using the Java programming language.

We’ve chosen to work with Twitter data so that developers of any level can dive straight into Grakn.

In other words, working with Grakn is easy and we want you to know it!

We’ve looked at how to define a schema, how to insert data, and how to perform aggregate query in order to get meaningful information out of it. We’ve also looked at batching, which is an essential technique for working with data at scale.

You should have enough knowledge for developing your first application with Grakn. But there’s more, and we encourage to check out our docs in order to find comprehensive information on working with the schema and query language a.k.a Graql.

Have a look at part one and part two, in case you missed it. Also don’t forget, the sample project is always available for you to download and play with.

If you enjoyed this article, please do find the time to hit the recommend heart below, so others can find it too. Please get in touch if you’ve any questions or comments, either below or via our Community Slack channel.

Find out more from https://grakn.ai.

Feature Image credit: “Industrial — The Factory” by Simon & His Camera is licensed under CC BY 2.0