3 things I learned when playing with big streaming data coming from an API using AWS Kinesis Firehose

I am doing analysis and try to make sense out of aviation related data. Big time. Now recently I was lucky as I was offered a 1 month free trial of Flightaware’s amazing data feed of live flight positions nicely called Firehose.

In big data, we like to use the word firehose for streaming data (data=water). Already tried to drink from a firehose? Well that’s how you feel when you want to use streaming data.

My 3 main takeaways

I learned a lot trough this exercise:

  1. Don’t try to drink from a Firehose directly, fill a glass or bucket and then drink from there.
  2. It is impossible not to spill something, loosing some drops or even buckets is unavoidable. Choosing a fireman strong enough to hold the firehose is important too.
  3. You will not drink everything you collect. What you don’t drink, throw it away. Keeping it will ruin you.

I will get back to those learning steps as I go through the exercise.

Setting up the stream

I am using Node.js to connect to the Flightaware firehose API (socket in Node.js) and the AWS SDK to send the data to AWS Kinesis Firehose. AWS provides a good tutorial on setting up AWS Kinesis Firehose. My Node.js code can be downloaded here for those who are interested. In my case, a line or record consists of a position of an aircraft in the sky somewhere in the world at that time.

Using a cron job, I call this node app every 5 minutes, collecting the previous 5 minutes which I then put into AWS firehose. Easy.

The working stream looks like this. You can see that I send between 7.5MB and 10MB of data in every 5 minute period to AWS Kinesis Firehose.

Working Firehose Stream

Store before drinking

As I said in my principles, you need time to analyze things, so you need to store them first. I use S3 to store the records as text files. AWS Firehose keeps what you send it until it has around 5MB of data. It will create a file then, time stamp it and put it nicely into a folder in S3 under year/month/day/hour

Data collected 23rd Otober 2015 between 12:00 and 13:00 UTC looks like this for me:

Data stored of a given hour

Now that I have it stored, I can run analysis on it using Hadoop or R or whatever.

Ignore the spilling

What I learned about sockets is that they can break, or stall. Like when you do Skype or watch TV online. It just can stop. Same here. This means that you will have holes in your data, and your analysis needs to take that into account.

There are tons of reasons for this, the most common being the sending server being down, or your server being under dimensioned.

That what happened to me. Here how it looked when it broke:

My broken stream during 19th October

In my case, the sending server was down for maintenance, so my app stalled, never exited, consumed more and more memory and eventually crashed my server, which did not have enough memory.

My fireman, being my collecting server, was under dimensioned. I did actually not check how much memory was necessary for my app to run. It appeared to use around 4GB, so I ended up by using a t2.large EC2 instance with 8GB of memory. That’s the beauty of cloud! Totally scalable!

I also added a routine to my app which kills my process as it turned out not to stop by itself.

Clean your stack

When working in the cloud, keeping an I on your bill and forecasting is important. In my case, I collect around around 100MB of data per hour! At current AWS pricing scheme, I pay around 0.03$ per GB on S3 per month.

So, in 1 month, I will collect 100MB * 24hours * 31days = 74.4 GB of data which will cost $2.23 per month. So as I continue collecting, my S3 bill will increase by $2.23 every month.

So keeping the data will cost me money. Once I have analyzed the data, and created value added information, I will need to decide what to do with it.

As data analysts, we hate throwing data away, always thinking that it could be useful, salable etc. But straming data is different, it totally looses it’s value after some time, where it would need to replaced by a more recent stream.

I try to apply the logic used by a former boss of mine:

What you have not touched for 6 months, throw it away.

This is just the beginning of my big data journey. Still a long way ahead.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.