When industrial classes in statistical techniques began to be taught by those without degrees in statistics it was inevitable that misunderstandings would abound and mythologies would proliferate. One of the things lost along the way was the secret foundation of statistical inference. This article will illustrate the importance of this overlooked foundation.
A naive approach to interpreting data is based on the idea that “Two numbers that are not the same are different!” With this approach every value is exact and every change in value is interpreted as a signal. We only began to emerge from this stone-age approach to data analysis about 250 years ago as scientists and engineers started measuring things repeatedly. As they did this they discovered the problem of measurement error: Repeated measurements of the same thing would not yield the same result.
…
Comments
Do not pass go, do not collect $200
Great article. When I teach statistics, I first ask the class to rattle off every statistic and graph that they know of. I write them all on the board. And then when we get done, I ask them which is the most important and what must be done first. Rarely will they guess a SPC/Process Behavior chart or time series chart (I start there and then go forward). I make a huge ordeal of it to make sure it is drilled into their heads that they must plot the data serially and establish the idea of homogeneity (I don't call it that) before they do anything else. Anything else, they land on the "Go to Jail" square in Monopoly. "Do not pass Go, do not collect $200" until they have established "homogeneity".
It infuriates me to see people coaching others in establishing whether the data is normal or not at the outset, or whether they should consider transforming data to get an accurate baseline capability index. Well, at the outset, the likelihood is strong that there are multiple populations in the data so of course it won't be normal. Duh. Wrong question, wrong time. So many problems addressed by plotting the data serially. Thanks for the article!
Homogeneity
A first class paper from Don, as always. It is wonderful how he always finds new slants on the 80 year or so old roots of quality.
I'd be interested in comments regarding the homogeneity of global temperature data. Despite the majority of temperature data over the past century being recorded to an accuracy of +/-0.5 deg C (http://www.srh.noaa.gov/ohx/dad/coop/EQUIPMENT.pdf) and over 90% of data having measurement errors of >1.0 deg C (http://www.surfacestations.org/), it is claimed that global temperature is known to an accuracy of +/- 0.001 Deg C. This is based on P Jones' paper http://www.st-andrews.ac.uk/~rjsw/PalaeoPDFs/Jonesetal1997.pdf It strikes me that if this were true, could we gain an accurate estimate of global temperature by having 7 billion people put their index finger in the air? Intuitively I would think that global temperatures are very non homogeneous?
Great Post IID Assumptions
Don, this is a great contribution. The IID assumption is big when considering Shewhart's postulates from his 1931 book, especailly number 2 which should end the calculations of inference and begin with the control chart instead:
Shewhart (Shewhart, 1931) stated three postulates relating to control which formed the rationale for the control chart:
Postulate 1 - All chance systems of causes are not alike in the sense that they enable us to predict the future in terms of the past.
Postulate 2 - Constant systems of chance causes do exist in nature (but not necessarily in a production process).
Postulate 3 - Assignable causes of variation may be found and eliminated.
As you know, based on these postulates, a process can be brought into a state of statistical control by finding assignable causes and eliminating them from the process.
The difficulty comes in judging from a set of data whether or not assignable causes are present. Thus, there is a need for the control chart. The examples were great.
Best regards,
Cliff Norman API
i.i.d versus exchangeability
As usual Don makes great good points and very clearly.
However, I'm not sure he does Shewhart justice. I.i.d. is a very troubled concept in this situation (see Barlow & Irony, 1992 "Foundations of Statistical Quality Control" or De Finetti "Theory of Probability" vol 1 p160). When we are collecting observations one by one then they can only be "independent" if conditioned on the (unknown!) distribution from which they are drawn. That is because every observation yields additional information about the process and changes our expectation of its successor.
That is why Shewhart designed his charts to examine "exchangeability" rather than independence. Exchangeability is a more robust concept and forms the basis of very important and general theorems about making predictions from data, the representation theorems.
The concept of echangeability comes from the third volume of W E Johnson's "Logic" published in 1924. I don't think Shewhart had read it as it is not cited in SMVQC. Shewhart's work was independent.
Thanks for the References!
Cliff
Add new comment