Four principles of scalable Big Data systems

Pawel Koperek
Hacking Analytics
Published in
4 min readOct 27, 2018

I while ago I listened to a podcast with Ian Gorton who talked about principles which rule systems processing vasts amounts of data. Even though it has been already 4 yours, I find many of his insights still holding true. In this post I summarize them and try to put them into a present-day technology context.

How “Big Data” used to be handled in the old days ;) Photo by Tobias Fischer on Unsplash

According to Ian, the key problems with the Big Data are:

  • Volume of the data: there is a lot of it, storing it and providing resources for processing is a challenge.
  • Velocity of the data: how fast does it change, the rate of change.
  • Variety of data types which need to be analyzed.
  • Veracity which is about data integrity, quality etc.

They all boil down to a single issue: very large, complex data sets are available now and it is not possible anymore to process them with traditional databases and data processing techniques.

In 2014 (when the talk was recorded) it was indeed true. These days (late 2018) there is one more caveat to this: Moore’s law. 4 years ago CPUs and memory were not as fast/not as optimized for parallel processing as of today. Some size of data which might be considered Big Data back then, right now could be handled by a single machine. This boundary is only going up. This means that for some slower organizations, actually moving to Big Data tech doesn’t make sense anymore, as they can solve the problem by just buying faster hardware for their “data server” instead of building a cluster for Hadoop/Spark.

“The processing systems need to be distributed and horizontally scalable (i.e. you can add capacity by adding more hosts of the same type instead of building a faster CPU).”

If we define Big Data tech this way, we can formulate principles to follow while engineering large data processing systems:

  • You can’t scale your efforts and costs for building a big data system at the same rate that you scale the capacity of the system. — if you estimate that within a year your system will be 4x bigger, you can’t expect to have 4x bigger team.
  • The more complex solution, the less it will scale — choose your tech wisely. The more moving parts it has, the harder it is to understand how to make it work with 10/100/1000x more data.
  • When you start to scale things, statefulness is a real problem because it’s really hard to balance the load of the stateful objects across the available servers— it is difficult to handle failures, because loosing state and recreating it is hard. Using stateless objects is the way to go.
  • “Failure is inevitable — redundancy and resilience to failure is key.” -You have to account for failure and be ready for problems with many parts of the system, very often many of them in the same time.

In our cloud-enabled world, the technical challenges of scalability and resiliency to failure bring in a trade-off. Most of those problems can be solved by simply paying more money for cloud services. Through “outsourcing” the system design and following architecture templates outlined by a cloud provider, you can remain completely focused on using the data to solve actual problems. Unfortunately, this won’t work forever. Once you reach a certain size, you will need to rebuild your system with very specialized components, because off-the-shelf solutions won’t work anymore. You need to make a conscientious decision: choose between building own system slower, but with certainty it will scale well, or start faster with cloud services.

Another concern, which quickly appears when developing a Big Data system is making sure it works properly — testing.

“Hence, how do you know that your test cases and the actual code that you have built are going to work anymore? The answer is you do not, and you never will.”

If you are working with a Big Data system, you can never know how it will behave in production, because recreating the real conditions is too costly. This means that the only reliable and predictable way to build such a system is to introduce a feedback loop which will tell you if you haven’t broken anything as early as possible, what boils down to: continuous, in-depth monitoring of the infrastructure and using CI/CD in connection with techniques like blue/green deployment.

Some closing thoughts:

  • It is important to do proper capacity planning. If it looks too complicated or you think it doesn’t make sense yet, just try to estimate data inflow in the future (e.g. year).
  • Key factor which allows to introduce efficiency and cut the costs of operating a large system is automation (e.g. instead of manually installing servers write scripts which do it).
  • Simplicity allows for better understanding of what is happening in the system and this leads to better understanding bottlenecks and figuring out how to avoid them.

Link to the original talk and transcript.

Originally published at pkoperek.github.io on September 27, 2018.

--

--