BigData Performance— The Hairline Metaphor

The Data Observer
acceldata
Published in
2 min readNov 13, 2018

If you took a walk around Data Operations teams around the world, you would hear loud voices — “all metrics seem fine, graphs are refreshing alright, but we are unable to figure out what is wrong”. And minutes/hour/day later you’ll again hear — “something just went wrong!, we’re unable to connect to the node/network/vpn….”

Sounds familiar?

At Acceldata we call it the “Hairline Fracture”; the symptoms are not visible to naked eyes, but the pain is unbearable. The stress causes the bones to break, unable to bear any more weight.

BigData deployments are very similar in nature, it’s hard to comment on the system’s health with Grafana spark lines flashing past your eyes. The stresses could be of several kinds, such as:

JVM Pauses: Kafka in-sync-replica (ISR) had a JVM pause and went out of Sync. However, before the JVM pause ended the leader of the Topic went down. Such a sequence typically ends up being a cause for data loss.

Degrading Query performance : Number of partitions have increased, too many files, way too many queries, not enough compute, not enough memory. Any operation on conditions as above is bound to fail beyond a certain scale. Such scenarios are really hard for operators to diagnose in real time, there are too many variables which may change.

Container Allocation: Way too many queries have been submitted and now there are no more containers in spare to allocate to this newly submitted Tez job, and therefore, the overall time for the 20s compute job has gone up to 340s, causing havoc for other jobs waiting on this job to complete.

Host Issues: One data partition/table is hot-spotting because the data layout is not ideal, this soon enough becomes a point of failure. The disk gave up, or the network performance gave up. All queries which were reading from this node are stuck and to isolate, it takes a mammoth effort. Node level isolation of such issues are nasty in nature.

Impending Failures: Multiple jobs are lined up in the queue because new business-processes were on-boarded onto the Data Lake, and the capacity was not estimated accurately. Now failure is impending, and the operations teams were not aware of the stress the system was under.

Any of the above are reasons enough for failures. AccelData identifies such Hairline fractures accurately. It cuts the noise from your Operations Bay.

We’re building products that make data operations better.

If you are interested in helping us solve complex problems, drop us a line at hiring@acceldata.io.

--

--