Birds Eye View Of Big Data Landscape — Challenges And Solutions

As the publishing mediums are constantly generating multimedia digital content in many forms — text/document, video, audio and many other formats yet to disrupt the information industry, the consumption is only going to increase going forward, stacking up histories of user activity and interactions on various platforms. Analytics available on the product is one of the foremost reasons why publishers would love it or the subscribers get better experience with the product .

The commonly used analytics such as views/impressions, conversions and advanced applied analytics like a) social graph analysis — algorithms like degree of separation in social networks useful for social networks like facebook, linkedin b) purchase behaviour analysis — algorithms like item to item collaborative filtering for product recommendation in e-commerce industry like amazon, Flipkart etc would beg for a big data platform that powers all the analytics requirements mentioned above. I would like to highlight some of the pointers around properties of such a system in this post.

Reliability:
A cluster is generally composed of commodity hardware in the cloud designed for good performance per unit of money. The expectation is that cluster is resilient to hardware failures. Misconfigured machine, bad disks, network failures are some things to look out for. 
The work load on the cluster itself might be heterogeneous in nature like ad-hoc processing by an analyst, large model training by a data scientist etc. 
The distribution of the data itself might be uneven causing data skew problems leading to performance bottlenecks. Bursty requests is also an issue, so throttling also becomes an important factor around application design for responsiveness. 
 
Throughput
There are various classes of data storage options available; cloud storage (Amazon S3, Google GCS), local storage — EBS volumes in AWS and RAM. 
As the throughput increases respectively, reliability goes down. The art is to find the right trade off in reliability and throughput.

Scalability
Some considerations around the scalability are number of clusters and number of nodes in a cluster. As the size of data and processing increases, the cloud components used must be horizontally scaled in order to handle the workload and SLA requirements. The system should be brought up and health of workers should be tracked under the workload.

Availability
The system might be unavailable due to hardware as well as software upgrades. It is a daunting task to manage it at scale. Zero down time guarantees can be provided only if rolling upgrades and rollbacks are provisioned. It is recommended to have sufficient instrumentation, monitoring and alerting on the health of the system. Fast recoverability from failures is expected. Detect downtimes and auto recover from failures to preserve the state of the system.

Simplicity
It is often overlooked as it is not a functional requirement of the system. It should be easy enough for developers to understand how to use the platform. It should expose logs, profiling information to get better visibility into troubleshooting.

Lower cloud cost: 
Usually services are provisioned on a pay per use model. Adding more and more components to the stack might pile up the cost. The cloud costs can be optimized with the help of following:

1. Compute instance should not be lying idle, should pack as many workloads, jobs as possible to get the optimal resource utilisation.
2. Elasticity — autoscaling compute and storage might help save costs while providing system uptime guarantees.
3. Optimising cost using cloud features without sacrificing reliability can be achieved with spot instances on AWS.

Security
The workers should lay behind a firewall and be protected with the network Access Control Lists. Encryption of data at rest might be needed for compliance and regulation — HIPPA, PCI. There is also lots of temporary data that is generated by ETL jobs and it must be encrypted, even at transit.

Here is one such possible solution that I have been able to design.

Event Sourcing along with data formats such as (old value, new value, event timestamp) to build an analytics platform that is responsive to each and every system generated event in realtime. Getting a stream of every event that has happened in your company ever could be a dream come true. Kafka can be used as an implementation of Enterprise Service Bus Design pattern to orchestrate the data between various systems. This helps mainly in replay-ability of the data. Eventually this data can be moved to a warm storage for online analytics and cold storage for batch processing. There is also a realtime component to the analytics, which can be implemented using Kappa architecture and Complex Event Processing frameworks like Flink and Storm, to name a few. Kafka(>0.11) supports CEP as well as Effectively Once data delivery semantics. This would also involve a visualisation component which can be substituted with many available technology choices some namely Grafana. Confluent/Kafka can also be used for stateless(filter) and stateful(aggregation, joins, windowing) processing of realtime data using kafka streams api with Effectively Once data delivery semantics. Kafka has the ability to horizontally scale based on work load.