Hypothesis Testing: An overlooked everyday phenomenon

Aniruddha Mitra
Analytics Vidhya
Published in
4 min readDec 25, 2019
Photo by Estée Janssens on Unsplash

Before delving into the technicalities, let me drive the point in that seldom we make decisions on basis of a ‘part’ of the scenes that we’ve just seen. For example, seeing a part of a graph (X-Y), if it’s upward trending, we don’t hesitate to conclude that Y is growing with X. Once you made your comment, statistics comes into play to support or discredit your opinion.

As statistics has entered the article, you can’t do away with randomness which is again a very natural yet often overlooked phenomenon in daily life. So, in view of randomness, the question comes whether the upward trend you’ve just noticed is purely out of chance i.e. randomness or is there really a trend as seen. Till this point, you must be convinced that whatever we see, we see it through the layer of randomness (True for all cases if you dare to think. My height I measured to be 5'6''. But is it really? Am I sure if I measure it next thousand times, I’m going to get the same figure?).

Now, randomness can’t be removed. Reality Check!! But the good news is that, ‘randomness’ can be ‘modeled’. The first time I heard the term ‘model’ I was taken aback. Let me clarify, it’s nothing more than a mathematical equation. As we’re dealing with randomness, a deterministic equation will not do (e.g. if run with a speed of 10 km/hr., in half an hour you cover 5 km distance.). Rather here the model-equation will give us a measure of chance, known as ‘Probability’, of the random variable of interest to assume values within in a given range. Like, the most boring & useful example is, if we toss an unbiased coin the outcome is masked with randomness & we can comment that chance of getting ‘head’ is half. Thus this becomes the simplest model of randomness. P(H)=0.5,P(T)=0.5.

You’re not to blame if you find it silly. But the interesting point is that, years of research has made it possible to model many interesting phenomenon. E.g., Number boundaries that a batsman hits for given number of deliveries can very well be modeled with Poisson Distribution. Let us not go into the modelling part now.

So the point is, if we can decode the pattern of effect that our mask produce over the reality, i.e. by modelling the randomness if we can get a hint about how it distort the real data, we can check what we see is purely because of the mask of randomness or not.

Getting into hypothesis terms…

To conclude a statement, we’d take the strategy of a legal court i.e. ‘assumed innocent unless evidence is strong enough to discredit the assumption, thus to prove guilty’ .
In our case, the initial assumption of being innocent is ‘Null Hypothesis’. Counterpart of it is ‘Alternate Hypothesis’ i.e. (most of the time) opposite of Null i.e. we’ll prove with evidence that he IS NOT innocent. If I fail to do so, NULL will have to be accepted. To correlate it with our initial example, H_Null: ‘In the graph Y is not growing with X’. H_Alternate: ‘…It’s NOT that Y is not growing with X’ => ‘..Y is growing with X’.

Let’s take an real life example. Say I claim during winter productivity of manufacturing plants fall. Let’s write the hypothesis.
H_Null = Opposite of claim~Status Quo ~‘Productivity in winter is equal that of summer’
H_Alternate = claim~opposite of H_Null~ ‘Productivity in winter is less….’
Modeling the event:
Productivity = function(other deterministic i.e. non-random condition(e.g. machine, material, manpower etc.)) + Randomness
Assumption: everything except randomness remains constant.
Now I take count of daily produced items multiple time over the span of summer & winter.
We know randomness such as above loosely follows Normal Distribution (I’m not saying how, why here). And, a normal distribution is defined by the parameter mean & variance.
Thus the above problem boils down to,
H_Null: Mean(Winter) = Mean(Summer) OR Mean(Winter) — Mean(Summer)=0
H_Alternate: Mean(Winter)>Mean(Summer) OR OR Mean(Winter) — Mean(Summer)>0

Estimate Parameter:
To get the ‘exact’ values of mean (or say any parameter), you need to have theoretically infinite number of data-points. To come back to reality, we observe few data & estimate the parameter with uncertainty.
W.r.t. the estimate attached with its uncertainty, we measure ‘Probability of observing the data in hand given Null hypothesis is true’. Hence in our case, after noting the data, we measure ‘If really productivity is winter would have been same as summer, what is the chance that I get this data?’. If the probability is quite high, you’ve no chance but to accept the H_Null i.e. productivity doesn’t vary. But if the arrived probability is too low to be realistic (say <0.01), i.e. chance is very low that I get this data given H_Null is true. Unrealistic. But I got this data. Totally realistic. That mean the ‘given’ condition I assumed that ‘H_Null is true’ is not realistic. Hence with the evidence of data, we discredit the null hypothesis & accept the alternate one.

Henceforth, whenever you claim i.e. put you ‘hypo’-thesis, check whether it can be discredited by the recipient on basis of data.

By the way the probability on basis of which we accept/reject H_Null, is called p-value.

Please tell me the solution of the graph problem we discussed on the first paragraph.

--

--