In about two months from now, I will graduate from Georgia Tech’s Online Masters in Computer Science (OMSCS) program. I’ve been reflecting on the path that led me to join the program and this post captures some of those thoughts. As I started writing this post, it became a more general overview about my career, so if you’re reading just for the raw reasons, scroll down to the bottom.

Not knowing what I didn’t know

It was 2008 and I was in 11th grade. I had boldly signed up for my first Computer Science class, an intro class that was basically Java 101. I remember excitedly…

I recently finished the Functional Programming Principles in Scala course on Coursera. Having primarily programmed in an imperative language (Java), I enjoyed all the goodies that provided by a functional language. Overall, I enjoyed the class very much, and I recommend it to anyone looking for an introduction to functional programming. In this post, I will go through some of the favorite features of Scala, many of which are missing from Java.

Pattern Matching

Pattern matching is a powerful feature in Scala with no easy equivalent in Java. Java has a switch statement to choose an execution paths based on some equality…

I recently finished reading the popular Design Patterns book. As I was reading, I noticed that many of the examples used in the book were a bit difficult to parse. For several patterns, they used the example of building a WSISYG editor, something with which I don’t have much experience. Many of the examples in the “Known Uses” section of each chapter were also a little outdated. As a way to help me understand these patterns in detail, I wanted to create my own examples, modeled after real-world situations. …

Over the past few weeks, I have been working with Apache ZooKeeper for a certain project at work. ZooKeeper is a pretty cool technology and I thought I would document what I’ve learned so far, for others looking for a short introduction and for myself to look back on in the future.

What is ZooKeeper?

ZooKeeper is:

“A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services”.

Informally, ZooKeeper is just a service that distributed systems can rely on for coordination. It allows developers to focus on implementing the core application logic, without having to worry about also…

In celebration of Super Bowl Sunday, this post will examine how HBase can be used to model NFL play-by-play data.

We will be using data from the 2002 season through the 2012 season. The source of the data is CSV files provided by the awesome people at Advance NFL Stats. Each file contains data for a single season and the format of the data is as follows:


The gameid column contains the data of the game and the two teams competing. The description column contains a short natural language blurb about a certain play. The rest of the columns…

HBase is modeled as:

A “sparse, distributed, consistent, multi-dimensional, sorted map”

We will look at what each of these terms mean below. HBase is based on Google’s BigTable and is currently an Apache top-level project. It provides random read/write access to data stored in HDFS (Hadoop Distributed File System). It leverages the capabilities provided by Hadoop and HDFS. In a future post, we will look at the architecture of how HBase stores data. This post will be more of a high-level introduction to the data model used by HBase

We will start by looking at what each of the terms…

In this post, I will go through a demo of using Lucene’s simple API for indexing and searching Tweets. We will be indexing Tweets from the Sentiment140 Tweet corpus. This dataset provides the following data points for each Tweet:

  1. the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  2. the id of the tweet (2087)
  3. the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  4. the query (lyx). If there is no query, then this value is NO_QUERY.
  5. the user that tweeted (robotickilldozr)
  6. the text of the tweet (Lyx is cool)

Here is example of…

Apache Lucene™ is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Lucene is a library that allows the user to index textual data (Word & PDF documents, emails, webpages, tweets etc). It allows you to add search capabilities to your application. There are two main steps that Lucene performs:

  1. Create an index of documents you want to search.
  2. Parse query, search index, return results.


Lucene uses an inverted index (mapping of a term to its metadata). This metadata includes information about which…

Karthik Kumar

Software Engineer at LightStep

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store