Journey to the Cloud (1/3)

Bayu Satria Setiadi
2 min readNov 6, 2019

--

“What the hell is cloud computing?”
“Ok, now with this thing, we are going to get access to the computer out there, on the internet.”
“We already have a data center, why would we do that?”
“Just D.I.Y. with your own server, it’s more challenging and fun.”

That was me…10 years ago,
Just about 1 year after graduation from vocational high school,
Before startup era,
Before K8s,
Before I learned about Big Data with its 3Vs,
Before I hit scalability issue.

“Wait…but Big Data is supposed to deal with that scalability, isn’t?”

The story goes on…

Just like any other corporate, at Link Net, we already have some data centers in place. Once a company increases in size, there are more things to consider such as risk, policy, security, capex/opex, and profits.

Dozens of apps and servers have been provisioned to fulfill our business needs. Each of particular system generates its own data and that’s where the problem arises. A heavy-big-single datawarehouse machine has served us well for years till it reached a threshold. Ok now what should we do? “Oh just scale it up! add more RAMs/faster disks/cpu upgrade etc etc.”.

Single box has a limit…

Our traditional datawarehouse
Our traditional datawarehouse

Scaling a web app/service is easy, just spawn another server, put load balancer in front of the cluster, and you’re done. Not a big deal. But scaling-out an RDBMS is another story.

Of course there are features like SQL Server cluster and Oracle RAC (Real Application Cluster) to achive high availability and performance in term of computing power. However getting started with those kind of clustering technologies can be difficult and impractical to handle very large data for analytical purpose. They are RDBMS after all. Moreover, for some products as the number of cores grows, license cost is linearly increased as well. We are talking about $15K-$47k/core.

Since 2015, a number of ML (machine learning) portfolios have been deployed to help some departments find unseen patterns of the data and make decision with minimal human intervention. R was our first choice because it’s provided with exemplary support for data wrangling, rich of statistical packages, and relatively easy. It works well for a year. As the data exponentially grows, causing performance to suffer. Especially when you need to rebuild your model.

R is single threaded, which means it only uses one core of your CPU. I’ve heard that it’s possible multithreading with R, put your comment down below, I’m eager to hear from your experience. Although we don’t use R anymore, R is still usefull in rapid prototyping.

Again…scalability issue. Part 2

“We need to do something with this data architecture”
“Whatever it takes…”

--

--