Variable sampling plan and Z score

Michael C.H. Wang
GLInB
Published in
7 min readJun 24, 2022

--

Forewords

In previous posts, we discussed some analogies between ML and QA by overhaul of acceptance plan and classifier, both are tools for decision making so risks live for misjudgement. And some evaluation metrics developed like OC curve in sampling plan or ROC curve for classification. Also how to create the ROC curve by walking through the testing result on the chart of true positive rate and false positive rate. And the relative location between binary dataset population and decision threshold decides the success of classifiers. Readers can refer to this great Medium post to have more detail information about ROC curve.

Finding Donors: Classification Project With PySpark | by Victor Roman | Towards Data Science

Recall from attribute sampling plan

Now we back to QA side and we can create OC curve by below steps:

  1. Decide AQL and associate alpha risk (usually 5%);
  2. Required LTPD and accepted beta risk;
  3. Draw OC curve by solving formulae

The sampling number n and non-conforming number c which satisfies the beta risk requirement. Beta risk reduces by increasing the number n. (Like from G-II to G-III in MIL-STD-105)

https://testscience.org/plan-a-test/plan-a-reliability-test/construct-an-operating-curve/

Variable sampling plan MIL-STD-414

So far what we have discussed either the OC curves above would relate to discrete or binary data, like OK/NG, True/False, etc.. How about if the sampling result is continuous or variable data? How we judge the risk of defect level from sampling measurement result? That’s the specification “Variable sampling plan MIL-STD-414” would solve.

Since there is no goal of non-conforming number c in variable data, the analogy to attribute sampling plan would change: To find the sampling number n and what called acceptance constant k (k-method) or the maximum allowable portion defective M (M-method) for comparison. From the term itself, it might not be so comprehensive for what “constant” or “portion” means, but the same as non-conforming number c as they are all dimensionless. Another important property which is different to bare binary data is these measured variable data usually forms a symmetric “bell shape” distribution and brings more information like where it locates, how it spreads and also usually the tolerance to define the defects.

So to achieve the same operating characteristic, a variables sampling plan requires fewer samples than an attribute plan and one disadvantage of variables sampling plans is that they are based on the assumption that the measurements are normally distributed (at least for the plans available in published tables or through pre-written software).

There are another one issue needs to be clarified more. Usually the specification is defined as two sides, so surely the result can be smaller, or larger and even violates both sides simultaneously. But just because of this non-linear bell shape characteristic, the linear operator is not feasible so k-method would not be used in the situation where both upper and lower limit exists at the same time. We can understand more when we explore the k-method more in following sections.

For the k-sample method mathematical deduction, I would pay tribute to this great article “An Introduction to Acceptance Sampling and SPC with R” by John Lawson and below material would come there and please go reference at: https://bookdown.org/lawson/an_introduction_to_acceptance_sampling_and_spc_with_r26/variables-sampling-plans.html, and you can find detial M method there also. Here we only take k-method as an example.

For the k-method example, it starts like attribute sampling plan by formulating the alpha and beta risk in continuous probability distribution as below equation (3.1) and (3.2), then solve equations set simultaneously and get k and n corresponding to the alpha and beta risk specified.

k-method sample
formulae 3.1 to 3.3
formulae 3.4 to 3.5
formulae 3.6 to 3.7

For example, in a case if AQL=1% or 0.01, the RQL=4.6% or 0.046, α=0.05 and β=0.10, by using formula (3.7) we can come out n=20.8162 (How?) and using formula (3.5) we can have k=1.967411.

Thus, conducting the sampling plan on a lot of material consists of the following steps:

  1. Take a random sample of n items from the lot
  2. Measure the critical characteristic x on each sampled item
  3. Calculate the mean measurement
  4. Compare (−LSL)/σ to the acceptance constant k=1.967411
  5. If (−LSL)/σ>k, accept the lot, otherwise reject the lot.
type k letter code B single variable sampling plan

What is a Z-score?

In above section, there is a question we haven’t anwer yet. What is the variable Z introduced but not explicitly specified by? We use Z=(x-μAQL)/(σ/sqrt(n)) in counting k. In fact, we usually define Z=(x-μ)/σ called Z-score and σ is from the population. There is another very similar statistics in sampling distribution as t-score from student t-distribution which is the Z used in above (3.4) and use σ/sqrt(n) to approximate standard deviation of the population.

Z-score normalization refers to the process of normalizing every value in a dataset such that the mean of all of the values is 0 and the standard deviation is 1.

Z-score example by stats.zscore()

There are several ways to count z-score. For example, we can count it by Python scipy library with method “stats.zscore()” as below example.

Python code example for zscore
plt.hist(a)

Z scores by Z table

Or we can use Z table which is widely available on internet. Usually there are a pair of Z tables, negative Z score table (see below) and positive Z score table. Which one to use depends on the interested Z scores sit on the left or right side of the distribution mean.

Source: Z TABLE — Z TABLE

Example: Negative Z score table

The negative Z score tables include the values which are on the left side of the mean so being marked with negative scores. the values in the table represent the area under the bell curve to the left of z which is no more than 0.5 (And positive Z score would cover the area between 0.5 to 1). Based on the value of area, the first two digits of Z value will sit on the most left column horizontally and the corresponding third digit would be on the top row vertically.

And we can verify the same case where AQL=1% or 0.01, the RQL=4.6% or 0.046, α=0.05 and β=0.10, by finding corresponding Z values and calculating n in (3.7). Except for Z|1-β=Z|0.9=1.28 is from positive Z score table, we can find the other three individually below:

Z|0.01=-2.33, Z|0.046=-1.68, and Z|0.05=-1.64, so n=(-2.92/-0.65)² =20.18 so take integer as 21.

Source: negativeztable.png (699×774)

SAS Distribution Function to count the defect rate of Zscore

SAS can compute the cumulative probability of an observationfrom several different distributions of random variables. For example, the function ‘probnorm(z)’ computes the probability that an observation from the standard normal distribution is less than or equal to the observed value ‘z’. If, for example, z = 1.96, then the value returned by probnorm(1.96) is approximately 0.975. If you want the probability of observing a z-statistic greater than z = 1.96, it would equal:

1 — probnorm(1.96) = 1 — .975 = 0.025.

This is a ONE-SIDED p-value. If you want a TWO-SIDED p-value, then for the standard normal distribution, you would double the one-sided value. In other words,

2*(1 — probnorm(1.96)) = 2*(1 — .975) = 2 * 0.025 = 0.05.

SAS code example for probnorm
SAS output

The mission of GLInB is to bring most value by virtualization of your supply chain quality function to fit for challenges in today’s business environment.

Please visit us at http://www.glinb.com for more information.

--

--

Michael C.H. Wang
GLInB
Editor for

❤️‍🔥Passionate in blending QA and ML. Enjoying in problem solving.🔍🔧 Co-founder of GLInB. 📝Bio at Michael Chi Hung Wang | LinkedIn