AWS Big Data Journey — from zero to hero

Published in

Serverless Indonesia

15 min readApr 5, 2018

We want to share our experience in building Big Data System that is built using PHP & Laravel with nobody in our team that has experience to Big Data and Scaling it to ~50 Million requests per day.

To put it simply, we build Google Analytics for Mobile with additional features, such as Campaign and Segment. And at the end of this project, the data will be sent to our Ads Platform (RTB/DMP/DSP).

Let’s think from a partner perspective. For example, we have Partner X. Partner X is having mobile apps that already published in Google play store + Apple apps store and having 10000 Users. Partner X business is a convenience store and having hundreds of branches across 3 countries: Indonesia, Malaysia, and Singapore.

They want to monetize their users and having more insight into their users but don’t want to use traditional ads. So they embed our MAAS (Mobile Analytics as a Service) SDK, add new terms & agreement (explain that we are capturing data), publish a new version and it is done. Partner X users won’t feel any changes, they won’t see any ads on their screen, no UI/UX is changing. Partner X can be logged in to our MAAS Dashboard, and see some metrics, for example, Daily/Weekly/Monthly Active/Inactive Users, Total Users, Message Sent/Open/Read, etc.

Campaign and Segment are the most important features in this project. Partner X can create Campaign, for example: “We want to reach our Android users by Push Notification were located in Bandung, Indonesia and give them promotion code, 10% of discount”. From that information, we have segment Android and Bandung, Indonesia. The Campaign itself, is a promotion code, a 10% discount.

Not only that, the partner can schedule when the Campaign will be sent : once, daily, weekly, monthly, at exact time, or by using trigger. For example when devices is detected by geofence, beacon or dwelltime.

This picture below is, to sum up, what we are doing for 18 months. It is an example of Sample App + MAAS SDK push notification.

If you know Sundanese language, you will find this is funny :)

Stage 1 —Initial Version

Features :

Android / IOS SDK
Campaign & Segmentation
Channel: Push Notification

Stack :

Elastic Load Balancer (Classic)
Elastic Beanstalk — REST API ( Lumen) / m4.large
Elastic Beanstalk — Dashboard (Yii) / t2.medium
EC2 for Push Notification (Node Js App) / t2.large
1 RDS PostgreSQL / r3.xlarge
1 Elasticache Redis / r3.large

Why Lumen? At that time, we need to use a framework that most of our team members understand well. And having the best performance for PHP micro-framework. Not only that, but Lumen is also based on Laravel that is actively maintained very well. Because we believe that this part will be becoming an entry point. If this part is not working well, we will not receive any data, instead, we will get tons of complaints, because our SDK is slowing down the partner apps. We have 3 people to work on this part.

Why Yii? Yii is great for creating a Dashboard like a web application, as long as all great features are used, and following the best practices. We have 2 people to work on this part.

Why Node JS for Push Notification? Well, it is because of the management decision and found one of the employees is having this experience, so it will be great boosts to this project.

Why PostgreSQL? The Management said these projects need to have capabilities to ingest any kind of data. So we think our data must be in a JSON format and easy to query. We need to choose between MongoDB or PostgreSQL + JSONB. At that time, we cannot find anyone in the company that having MongoDB experience. We also have some research about PostgreSQL JSONB and GIN index. The performance is not that bad compared to MongoDB. And we are using PostGIS for location features. To us, PostgreSQL is RDBMS with steroids.

Why Redis? For our use case, Redis is used for :

Yii Query Cache and Metrics
Dwelltime features by utilizing GET/SET, INCR/DECR, SETEX, SADD, SREM, SCARD, SMEMBERS, and LPUSH/RPOP
Internal use to calculate how many hit for each endpoint
Delayed job for sending campaign

The Initial version is completed at Fri May 20 17:31:26 2016 +0700. It took about 4 months, and of course, it is not working perfectly. We make the PostgreSQL do everything, from ingest and store it on jsonb + gin index, doing OLTP query and doing OLAP query by querying from jsonb. At that time, this architecture is working as expected. Well, because we haven’t so many partners at that time and the data is not big enough.

NodeJS for sending push notification is looking expensive, too much time & cost to maintain. In the next version, we will utilize SNS to send push notification. The image below is the complete architecture for the initial version.

Stage 2 — New Features & Improve Durability

New Features :

Geofence & Beacon
Dwelltime calculation

What is Geofence? In our project, we only utilize Static Geofence. Josh Baker has created a gif to show how Static Geofence is work.

Static Geofence. From http://tile38.com/

If you notice, there are 3 Red Points. For our use case, when the devices come across the area, the Partner can do something about it. The first Red Points, we can call it as an event, let say, “enter area”. The second Red Point is “dwelltime”. The third Red Point is “exit area”. Then the partner can create a campaign when the devices are entering the area, for example, “Hey, come to our store, and use coupon code BLABLABLA to receive 50% off discount”. When the devices are on an event (enter/exit/dwelltime), the SDK will send a trigger, and the devices will receive a Push Notification. To put it simply, this is our approach to Macro Location.

What is Beacon? From Wikipedia, Bluetooth beacons are hardware transmitters — a class of Bluetooth low energy (LE) devices that broadcast their identifier to nearby portable electronic devices. For our use case, it is similar to how geofence works but we are using beacon proximity to detect devices. For example, Partner X has one huge store in some city, its area is 1000 square meters and has a section to sell Android Phone. So, the Partner can create a campaign, for example, “Hey, want to upgrade your Android Phone? Come to the Phone Section, and show this code, and we will give you the best price for your used phone”. When the devices detect the beacon (enter/exit/dwelltime), the SDK will send a trigger, and the devices will receive a Push Notification. To put it simply, this is our approach to Micro Location.

What is Dwelltime? Let’s continue the example above. When someone goes into a huge store, usually they took 30 minutes, 1 hour or more. When the devices are inside either Geofence or Beacon area for a particular time, it is called dwelltime. The Partner can create a campaign, for example, “Hey, it is your 30 minutes in our store, go to the information centre to retrieve bottled water for free”.

Added Stack :

EC2 for Laravel Queue Worker, using Redis as Broker (m4.large)
SNS for push notification

With this version, we add the Laravel queue. The main reason is, we don’t want to lose any data. It is fine if it slow, at least the data is not lost. And we start with 1 listener for all events. Yes, that is not a good approach. For example, we have an auth event and send_data event. In 1 minute, we receive 10 auth and 50 send_data. If we use 1 listener, then the response for the auth event will be very slow. Next week, we create several Laravel queue listener and categorized it by event group. For example, the auth event needs to be executed fast, then we create a listener with 10 workers and group it as a high priority. For send_data event, because it doesn’t need to be executed at that time, we can use a listener with fewer workers count and group it as low priority. Even though we set 48 to our Supervisor numprocs, the worker is not scale-out. Then we found deliciousbrains blog.

I’m asking the author because we are having the same problem.

Dwelltime is the most challenging feature to us. Thanks to Laravel queue + Redis. We need a special section to discuss this further.

SNS for sending push notification is a good approach. Because we don’t need to manage any server, it is service, paid only for what we use. Fortunately, we haven’t hit any limitation of SNS until this project ended.

The image below is the complete architecture for Stage 2 version.

MAAS Stage 2 — New Features & Improve Durability

Stage 3 — EC2 Worker & New RDS

Added Stack :

SQS + Elastic Beanstalk Worker Tier (m4.large & m3.medium)
New RDS to store send_data event (r3.large)

In this new version, we replace Redis with SQS. With this architecture, we have no worries to scale the worker to consume SQS Messages. We can create Cloudwatch alert to add new 1 instances to worker pool when messages > 10000 and remove 1 instance from the worker pool when messages < 1000.

Please do remember, that number is work for us, because 1000 messages in SQS is can easily consumed by 1 instances in few seconds.

Not only we gain more performance, but we also gain more durability. When there is something wrong with the code, storage, or unknown error, we can always retry it by utilizing the SQS Dead Letter Queue.

After introducing SQS to our architecture, our SQS is receiving 11.2 million messages. We forget how many instances & how much time needed to process these messages, We’re sorry.

Big thanks to Denis Mysenko for creating this awesome package.

Our worker tier code is still the same, but the only changes are at routing. The worker has routing only for consuming the SQS messages.

Lumen Bootstrap file

After we are having no problem with scaling the worker, now we have a problem with RDS. Our storage is almost full.

Only a few days to its maximum storage of 500 GB

We can’t just upgrade the storage of RDS without having downtime. Unfortunately, at that time, RDS Aurora is only available for the MySQL version. Our goal is not to lose any data, so our plan is to add new RDS with bigger storage (2000 GB), normal IOPS, then change the destination of /send_data endpoint to our new RDS. Then our Dashboard has consumed the data from that new RDS with extra date condition because our old data is not there.

Old RDS is still used for OLTP query, like authentication, storing device token, etc.

The image below is the complete architecture for Stage 3 version.

MAAS Stage 3 — SQS and Elastic Beanstalk Worker Tier

Stage 4 — Firehose & Redshift

Added Stack :

New SQS Worker
Firehose + Redshift

Redshift is a product from AWS that specialize for doing OLAP query. Mostly used as Data Warehouse. The challenge by utilizing Redshift in our architecture is the ETL part. Because we are having JSON, while Redshift must be in columnar format, then we needed a specific worker to send data to Redshift. The worker is doing the transformation from JSON to column format. Not only that, but we also need to change, big change, to our query that is used by the Dashboard, it is really a big job.

We made mistake by doing full ETL from RDS to Redshift while the result is not being used by anyone. It cost us development time & cost of cloud provider.

We are also designing the table by adding suffix _{year}{month}.

Separating the tables by year and month.

The drawback part of this method is we need to create separate firehose delivery stream and separate Redshift table manually. Because in our worker, it will automatically detect what month & year right now? and send it to the respective delivery stream. At one moment, We completely forget to create that process, as result, our SQS queue is taking too many messages. Our DLQ contains 7.5 Million messages.

Loading data to Redshift is made easy by using Firehose. Firehose will create a temporary S3 file, and then Firehose will invoke Redshift COPY command.

Please do remember, ETL process before sending data to S3 is really important. Having 1 line of error in random S3 CSV file will result whole COPY command is failing. Please make sure the data is clean.

We are also screwed up with Redshift by using it to do the wrong Query. If We remember correctly, it is OLTP query to search some string with LIKE.

Never-ending query, we need to kill it manually

At that time, Redshift is only available in 3 Region, and one of them is in Ireland. Fortunately (and also unfortunate), most of our service is in Tokyo, because our users are coming from South East Asia. The latency to Ireland from Tokyo is not really high. Then the unfortunate part is data transfer cost because we send data to another region.

Our awesome CTO is invite AWS Solution Architect from Singapore to come to our Bandung office to help us solve the Redshift problem.

The image below is the complete architecture for Stage 4 version.

Stage 5 — Enter Serverless

Added Stack :

Kinesis Stream + Lambda via Zappa
EMR + Athena
Terraform

The story continues when our cost is too much. That being said, we need to think about how we are storing the data. After we understand how to handle Hot data, Warm data, and Cold Data, we are removing Redshift from our architecture, it cost too much while it doesn’t give extra value to the company.

Now we are planning to send raw data to S3. While the metrics for the dashboard is coming from Redis and PostgreSQL in pre-calculated format by EMR and some worker. Our big mistake is doing the calculation for Dashboard metrics from the raw data table and naively said query cache will be working well.

Now we are creating 2 Data Pipeline, Realtime, and Batch. We still use SQS for Realtime Pipeline because we need the retry feature. On the other side, Batch Pipeline is utilizing Kinesis Stream with Lambda to consume the messages. The combination of Kinesis Stream & Lambda is the best choice for us. Because we don’t need to think about the server, especially when we split the Kinesis stream shard, lambda will take care of that problem. We use Zappa to deploy our Lambda function.

It is really fortunate that we move the raw data to S3 because in a few next days we get alert that storage-full on RDS.

When we are building the Batch Pipeline, we are actually doing a double write. Just to make sure we are not losing any data. The consequences is the cost doubling up and it cost more time to validate & cleaning the data between S3 & RDS. It is fine because we are having a dedicated team to work for this part.

The most entertaining part is when we are utilizing Terraform to our architecture. It is really fun to do everything in code. The most important thing is we can easily replicate the architecture and minimizing human error. We have no time to terraforming everything, but the data pipeline part is already in Terraform.

Kinesis Stream is received a small chunk of events from our application. Lambda is doing aggregation and combine it to big chunk then send to firehose via PutRecordBatch. Our Kinesis Stream is located the same as our main application (Tokyo), while Firehose and S3 are located in North Virginia.

Please do remember, Kinesis Stream & Kinesis Firehose is have limitation. I write how to “scaling” Kinesis Firehose. And this is work for us.

How to Scaling AWS Kinesis Firehose

I explore how to scale aws kinesis firehose.

medium.com

After a few days we implemented the Batch Pipeline, we receive 5GB Daily in S3 with CSV format. And it is processed by the Data team by utilizing Apache Spark. Ridwan Fajar is creating a dedicated page about how this is working.

A Story About Managing Tracking Log Files from MAAS with AWS Kinesis Firehose, AWS S3, Apache…

What is AWS EMR

medium.com

We also utilizing Athena for quick & simple queries in S3. The data team is converting the CSV to ORC, so it is easy for us to query. Not long after Athena released, Redshift Spectrum is out. We want to try it, but no luck*.

The highest traffic we receive is ~50 Million hit / Day on Elastic Beanstalk. We are limiting it to Minimum 6 and Maximum 10. The instance type is m4.xlarge. And please remember, that is a PHP Application, Lumen Micro Framework with default Elastic Beanstalk setting. If we tune everything or change to another programming language that is better than PHP, maybe we can cut the cost and improving our response time. But at that time, if we have to choose to improve, we are really considering to make use of Swoole. Because it is running well and it is not broken, we don’t need to fix it. We can invest our time to build another part of this project.

Our S3 Data before Bandung office and its employee is shutting down

After 4 months implemented this new architecture, our management is having a problem and wants to close the Indonesia office which is located in Bandung and Jakarta. We are doing great on cutting cloud provider costs by utilizing S3 and by Reserved Instances, also Spot Instances. But that’s still not enough, too much hidden cost besides the cloud provider cost.

Why not 100% serverless architecture? Our CTO is planned to move everything to serverless, but at that time we don’t have any scaling problem on the Rest API part. We try to calculate our API Gateway cost for 50 Million hits/ day, and the cost is too much high. It is $3.50 / Million hit.

Why not trying containers? If only we had more time. We want to try ECS, because we have some research on it, and asking 3 AWS Solution Architect, they all recommended us to move from Elastic Beanstalk to ECS. The important thing is AWS will manage the container for us, and ECS itself is free.

The image below is the complete architecture for Stage 5 version.

Lesson Learned

If you having a Database, and not planning it to do some scale up/down, then you need to use Reserved Instances, not On Demand.
Try to utilize Spot Instances as much as your team can.
Separate the AWS Account for Development & Production. It is for security concerns and easier to calculate the cost.
If having multiple AWS Account is not possible, try to utilize AWS Tag to track your project's cost.
Pay attention to REST API Response. Do not send too much unnecessary response, it will be accumulated into your AWS Data Transfer bill ($0.09/GB). If you only need only HTTP Header, then go with it.
Make sure the database you are using is the right tool for the right job.
Having/building internal analytics for application is really important, it is to make sure that the feature your team build is used by the client.
If you are having a repeated task that needs to be done for more than 1 minute, automate it.
If you are building a big project, especially from scratch, having too many screenshot/camera photos for those projects is completely fine.
If you are using Slack, you can create a bot to receive every message, and export it to permanent storage.

Conclusion

This is the saddest part of our journey, we haven’t done everything we can, but the management chooses to shut down the operation. We build these projects from zero (even minus) to (not yet) hero. None of us is having experience with AWS, especially the data we are talking about is not small, and will eventually growing as the number of clients is increasing.

We are really grateful for this experience, not every engineer can have the same experience as us, what we can do is share our little experience. We hope nobody is doing the same mistake as us.

That whole architecture is a part of the “Input” process. We have another team to work on “Process” and “Output”. We will create a dedicated page for that.

Many thanks for The Team

This architecture is built by the team. So we have to say thank you to all persons that involved on built this architecture. Here are our friends that built this awesome architecture : Our best CTO Elaia Raj, Our best Manager Toshi Firmansyah & Budi Satria, Tajhul Faijin Aliyudin, fajri abdillah, Wira Sakti, Rinaldi Guarsa & Ridwan Fajar.

AWS Big Data Journey — from zero to hero

Stage 1 —Initial Version

Stage 2 — New Features & Improve Durability

Stage 3 — EC2 Worker & New RDS

Stage 4 — Firehose & Redshift

Stage 5 — Enter Serverless

How to Scaling AWS Kinesis Firehose

I explore how to scale aws kinesis firehose.

A Story About Managing Tracking Log Files from MAAS with AWS Kinesis Firehose, AWS S3, Apache…

What is AWS EMR

Lesson Learned

Conclusion

Many thanks for The Team

Written by Fajri Abdillah