Lessons From a Statistical Analysis Gone Wrong, Part 3

Keep asking questions

If you’ve read the first two parts of this tale, you know it started when I published a post that involved transforming data for capability analysis. When an astute reader asked why Minitab didn‘t seem to transform the data outside of the capability analysis, it revealed an oversight that invalidated the original analysis.

I removed the errant post. But to my surprise, John Borneman, the reader who helped me discover my error, continued looking at the original data. "I do have a day job, but I‘m a data geek,” he explained to me. “Plus, doing this type of analysis ultimately helps me analyze data found in my real work!”

I want to share what Borneman did, because it’s a great example of how you can take an analysis that doesn’t work, ask a few more questions, and end up with an analysis that does work.

…

Want to continue?

By logging in you agree to receive communication from Quality Digest. Privacy Policy.

Create a FREE account

Forgot My Password

Comments

Or you can just watch it run

It's probably worth emphasizing that you're dealing with winning times. At around 20 horses per Kentucky Derby, every five years we have 100 or so horses that are slower than Secretariat.

How might non-winning times inpact the analysis? Sham's second-place finish in 1973 is one of the fastest times in the Kentucky Derby, it just happened to go up against Secretariat that day (and yes, non-winning times are not generally available).

The Kentucky Derby is billed as the most exciting two minutes in sports. Secretariat became the first horse to finish in less than two minutes, and still holds the record time for each of the Triple Crown races. How many track and field or swimming records from 1973 hold today?

The slowest winning time (1970) in the Belmont Stakes was turned in by High Echelon in the mud. Still, it's interesting to see that the slowest and fastest winning times (over the current length) are separated by just three years. Special causes, anyone? ;) The range of winning times is 10 seconds, with Secretariat 2 seconds faster than the next-best winning time. It's worth noting that the Belmont Stakes was held on a different track (Aqueduct, I think) from 1963-1968.

Take a few minutes to find some clips of Secretariat winning the Triple Crown events. At times it looks as though his portion of the recording has been set to fast-forward.

NT3327

Great learning

Eston, thanks for sharing your oversight n learnings with all of us. I was going through all 3 posts and awaited for the next day to know more, this was a thriller!.

You point has been well noted and even i faced this issue sometime due to system inability to track decimal point for effort data on software dev.

Im your big fan and love your posts.
Thank you again.

Prashant
India

Data

Hello, Interesting post. Definitely need to look closely at everything. The issue can be seen in the first part of this series. Definitely a rounding issue that is evident with the individuals control chart. You can see where the rounding to the second makes more a step-type chart in that chart. That was clue 1. Plus, you are not dealing with homogeneous data - the first 11 points are above the average - so a special cause. Probably should not include those in the data set. Would be nice if you shared the original data somehow. I copied the Belmont winning times from Wiki since 1929, which rounded to the nearest .1 seconds. The first eight points are above the average on the individuals chart. Including that data gives a p value of 0.16 for normality. Removing those, gives a p value of almost 0.5. Did not check that other two races. Thanks for the post. Bill

Lack of Discrimination

Here's another error. The times are not ordinal, they are still ratio. Ordinal implies that the distances between the values are not equidistant, and that is patently not true. A lack of discrimination does not change the form of the data. 120 seconds is twice as fast as 60 seconds, no matter how many decimal places you have.

It is also important to point out that when you have "rounded" data, it does not change the standard deviation of the data. You can run a quick simulation in Excel to prove that out. Yes, graphs are more "digital" but it doesn't diminish the power of using the data to perform almost any analysis you choose.

Took me a while to figure out why there was a LSL; could have been clearer on the reason you were running a capability analysis was not to calculate a capability indice but to integrate underneath the curve and hence your rationale for transforming the data (which is probably the only legitimate reason to transform when the data is significantly skewed).