Becoming a pro

The key skill is to acknowledge how little you know

George Shuklin
OpsOps
3 min readMar 12, 2019

--

When one stumbles on a huge problem with subtle complications, it’s a real pro job to have resources and courage to acknowledge own incompetence and inaptitude.

I do this now. I found a task where my frail math knowledge completely inadequate to the challenge.

The problem

I need to compare two test results: from ‘good’ server and from ‘bad’. To clarify the task I will trim down this problem to a single value: the IO latency (for a test with given iodepth/blocksize/span). I need to run my test against herd of good servers, and, having their measurements, to be able to say if a new server is ‘good enough’ or not. It’s good enough if the latency is no worse than the reference (from ‘good servers’).

The benchmark tool is fio, and it’s capable to report many numbers related to latency.

Solutions I don’t want to use

Solution: Use a reference.clat_ns.mean (mean value from reference servers). If new new.clat_ns.mean > reference.clat_ns.mean, server is bad.

Why not: If I do this, about 20% of ‘good servers’ will have their value above mean. How do I knew that? Because I measured. Mean is 80% percentile of good servers. Any random fluctuation will throw a good server in a bad zone. Moreover, even a 0.0001% increase above mean will make server ‘bad’. It’s not what I want. I want all good servers to be ‘good’ and all bad servers to be ‘bad’.

Solution: If new new.clat_ns.mean — new.clat_ns.stddev > reference.clat_ns.mean + reference.clat_ns.stddev

Why not: Here the hard part. I talked to the guy who know statistics better than me. He asked me if this is a really normal distribution. We’ve looked to the percentiles and saw this:

the mean is 16699589, which is about 75% percentile. Moreover, 1% and 10% are almost the same values. It’s not a normal distribution. He proposed that this is a gamma-distribution. The hard part here is to acknowledge that I have superficial knowledge here and I can more-or-less reason about ‘distributions’ just by comparing their graph shapes. I just don’t know.

Solution: Use 80% percentile of new server and compare it with 99% of a good server. If it’s larger, than server is ‘bad’, if not, it’s good.

Why not: Why 80%? Why 99%? Arbitrary numbers with no reasoning. Moreover, it does not work in my real-life tests.

Solution: Tweak formulae until it passes for the all available servers.
Why not: Will it work on a next (new) server? Who knows?

Self-abjection

I don’t know statistics. I don’t know what words to use to describe my problem. I more or less can say that I want to be sure in result with N% error, but than it become shaky. 3-sigma? I heard about this. Confidence interval!

Moreover, My measurements floats a bit between runs. I need to do few runs. How many runs should I do? How can I detect that my statistical hypothesis is no longer true?

I can see the depth of my incompetence. The proper way would be just go and learn all of it, starting from probability theory to statistics to (stuff I don’t know to exist). I have no time. Normally I should finish today, or tomorrow before lunch. If I really want to be tenacious, I can stretch it for a week.

Can I do this in one week (doing some other things in process)? I doubt. What should I do?

--

--

George Shuklin
OpsOps

I work at Servers.com, most of my stories are about Ansible, Ceph, Python, Openstack and Linux. My hobby is Rust.