Building a Big Data Architecture that Scales

A few months ago I started building an AI Marketing Analytics tool at my new job. One of the premises that comes with building an AI is that you need to store a lot of data to train the models. In my case the data that I needed could not be aggregated, so I needed to store millions and millions of records and they needed to be accesible almost real time.

Building a Big Data infrastructure is expensive, both computationally and monetary. Given than we are a bootstrapped startup I could not built my ideal infrastructure which combines a mix of Hadoop, Spark, and ElasticSearch. Running that infrastructure has a huge cost associated with it. I needed an infrastructure that I could have up and running in a couple of weeks without much maintenance work. Whoever has worked with Hadoop understand that managing a highly distributed Hadoop system is hard and takes a lot of time to keep it up and running. My second option was using something like Mongodb for a micro beta, but lets face it that wasnt going last more than a week. The good news is that we live in the 21st century and AWS exists. Using AWS services is great because I dont have to worry much about server maintance.

My final architecture of choice looks like this:

The reazon I choose this infrastructure is because it scales, and it scales really fast. I dont have to worry much about scalability because this infrastructure scales by itself and there is not much maintenance involved. The way it works is that I collect the data in several load balanced collectors that clean and stream the data in a way that I can crunch it down. From there the data is pushed into Firehose. Firehose takes care of the distribution both to my ElasticSearch for indexing and Redshift for running complex queries. From there I use MongoDB for authentication and other minor settings.

On top of Redshift and ElasticSearch I use a Microservices infrastructure with several PostgresSQL databases that leverages several different programing languages for different types of operations. All the microservices are then registered in a discovery service that each instance can leverage to find each other and load balance between them. I also use that discovery service to monitor the services and keep track of my logs.

The above architecture is not something hard to built or mantain as long as you have the right resources. Before starting to built a Big Data architecture you need to plan ahead. Some questions that you have to ask yourself are the following:

  1. The first and most important question is how big do you plan on going. Because if you are dealing with 1 or 2 GB of data you might not need a really complex archicture. In fact you might not need a big data architecture at all.
  2. The second question is about up time, do you need to collect the data every second of the day, is it mission critical? Some developers spend a lot of time and money building crazy infrastructure for working with data you can collect once a day.
  3. The third question you should ask yourself is if you can leverage aggregated data? If the answer is yes, then you are in for a treat. Embrace the world of Hadoop, Hadoop lets you agregate vast amounts of data in an distribute type of architecture. Hadoop is also great because its built from the ground up to scale.
  4. Do you need to index data? (for searching and other type of instant analytics) If the answer is yes look at ElasticSearch.
  5. Do you need the data real time? If yes look at combining spark with elastic search and hadoop. There has been a lot of work specially from Github in real time indexing.

If you liked this article, please click that green ♥ below so others can enjoy it. Also please ask questions or leave notes with any useful practices.