Outliers are values that don’t “fit in” with the rest of the data. These extreme values are commonly considered a nuisance when we seek to summarize the data with our descriptive statistics. This article will show how to turn these nuisances into useful information.
ADVERTISEMENT |
The earliest statistical tests were ones for detecting outliers. The idea was that by deleting the outliers, we could compute “better” descriptive statistics for our data. As a result, we have generations of statisticians who have been taught to remove outliers prior to their analysis. After all, the theoretical underpinnings of our statistical computations don’t tell us how to deal with outliers. When our statistics are contaminated by outliers, they change the model used to describe the data. Therefore, we commonly remove the outliers to polish up the data so we can obtain useful and appropriate models.
Our example will use the 100 values written in columns in Figure 1. These are the values obtained in the weekly weighings of a 10-gram standard known as NB10 at the National Bureau of Standards during 1963 and 1964. The values have been coded by dropping the first four digits so that a weight of 9.999591 grams is recorded as 591 micrograms.
The average for these 100 weighings is 595.41 micrograms, and the standard deviation statistic is 6.47 micrograms.
Figure 1: 100 weighings of NB10
So we have a known standard being weighed to a millionth of a gram by a master scale at the Bureau of Standards. Measurements rarely get any better than this, yet the histogram in Figure 2 shows some potential outliers.
Figure 2. Histogram for 100 NB10 values
Due to the nature of these data, the variation has to be measurement error, and due to the central limit theorem, the only reasonable model for measurement error is a normal distribution, so we might use our statistics to fit a normal curve to these data. A normal distribution with a mean of 595.4 micrograms and a standard deviation of 6.47 micrograms is shown with the histogram in Figure 3.
Figure 3. Histogram for 100 NB10 values with fitted normal distribution
Here the fitted model doesn’t fit. It doesn’t rise up enough to meet the central mound, nor does it stretch out enough to cover the tails of the histogram. The extreme values have inflated the standard deviation statistic and contaminated the model we’ve fitted to these data.
If we trim off the three highest values and the four lowest values, we get an average of 595.63 micrograms and a standard deviation statistic of 3.74 micrograms. Using these values, we get the model shown in Figure 4, which does a much better job of fitting the histogram of the central 93 data.
Figure 4. Histogram of 93 NB10 values with fitted normal distribution
Now we can feel good about these data. We’ve packaged them up neatly with a probability model that we can use in answering questions about this process. Unfortunately, this model doesn’t describe the past, nor does it predict the future. It’s simply the result of some mathematical manipulations.
When analyzing data, it’s the data that are the facts. Whatever we do, our analysis has to be reconciled with the data. And when we start deleting outliers to improve our model, we’re leaving the data behind and entering the realm of make-believe. When we add the outliers back into Figure 4, we see how our “improved” model still doesn’t fit these data.
Figure 5. Histogram of 100 NB10 values with improved normal
So here we have a standard that’s kept in a controlled atmosphere under glass jars when it’s not being weighed. This standard is repeatedly measured by the same two people using a master scale maintained by the Bureau of Standards. The variation in these readings has to be pure measurement error. If there ever was a dataset we should be able to model, this is it. Carl Friedrich Gauss and Pierre-Simon Laplace proved that the normal distribution is the correct distribution for measurement error, so we know we’re using the correct model. Yet we can’t get a model that fits the whole data set, whether we delete the outliers or use them in our computations.
The problem
The problem is in the assumption that we can fit a model to these data. All probability models are limiting conditions for an infinite sequence of independent and identically distributed random variables.
When we attempt to fit a model to any set of data, we’re implicitly assuming those data are observations of identically distributed random variables. And when this assumption is correct, our data will be homogeneous.
“When do we make this assumption?”
Whenever we use the descriptive statistics as the basis for fitting a model, or for some other statistical inference, we’re making an assumption of homogeneity for the original data. And outliers undermine this assumption of homogeneity. As we’ve seen above, outliers make a mess out of our efforts to fit a model to our data.
“Well, what can we do?”
Since the data, along with any outliers, are the reality with which we have to work, we need to change our approach. Rather than trying to fit a model or perform some statistical inference, we need to start with the question of whether the data are homogeneous.
Homogeneity
The best way to examine the data for homogeneity is to use a process behavior chart. Here, we’ll use a chart for individual values and moving ranges (an XmR chart). Computing limits based on the average of 595.43 and a median moving range of 4 micrograms, we get the XmR chart shown in Figure 6.
Figure 6. XmR chart for NB10 weights
Looking back over this two-year period, we see that this measurement process at the Bureau of Standards was operated inconsistently. Week 36 is the first outlier. Week 63 is another. Weeks 85–88 show a high outlier followed by three low outliers. And Week 94 was way outside the limits. Additional runs starting in weeks 15 and 55 confirm the inconsistent nature of these measurements.
During weeks 1–35, the weights averaged 597.7 micrograms. After the first upset in weeks 37–62, the weights averaged 594.2 micrograms, 3.5 micrograms less than before. This difference is twice the difference we might have expected due to measurement errors alone. This process not only had momentary upsets, but it also suffered systematic changes.
So once again, the outliers mess up the story and complicate things. But this is because the underlying reality is messy. It’s better to tell a messy truth than a neat lie.
Outliers are pure gold
While outliers tend to complicate the story we seek to tell about the data, they also provide opportunities to change that story. The outliers and systematic changes that they herald will have a cause. Moreover, in order for this cause to create an outlier, it has to have an effect that’s large enough to rise up above the effects of all the other causes that affect your process. Thus, it’s reasonable to interpret outliers as signals of the presence of some dominant cause-and-effect relationship. As Walter Shewhart said, the outliers are too large to be attributed to chance. For this reason, they shouldn’t be dismissed as being of no consequence.
If we want to identify any such dominant cause-and-effect relationships that affect our process, then, as Aristotle taught, we should examine those points at which the process changes. As we discover these dominant causes and take steps to remove their effects from our process, our process will operate more predictably with fewer outliers and with less variation.
This age-old idea of studying the change points has a proven track record. It’s not about to become obsolete. But you have to have a way to identify the change points, and nothing does this better than a process behavior chart.
So, before you throw the data into the computer and start your statistical analysis, pause to listen to the voice of your process by using a process behavior chart. You can learn a lot by listening.
Good limits from bad data
“But don’t the outliers mess up the limits?”
Only to a slight extent. They don’t inflate the limits in the same way or to the same extent as they affect the descriptive statistics. The computations of a process behavior chart are robust so that the technique will be sensitive. They allow us to get good limits from bad data.
Figure 6 shows limits based on the median moving range of 4.0. More commonly, the average moving range is used with an XmR chart. For our data, the average moving range is 5.73 micrograms, resulting in the XmR chart of Figure 7.
Figure 7. XmR chart for NB10 weights using average moving range
Here, we still find six of the seven outliers and one of the runs beyond one sigma that we found in Figure 6. So while some of the details change, the overall story remains the same with either version of the chart.
While we generally prefer to use the average moving range because it’s more efficient, using a median moving range is appropriate when very large ranges inflate the average moving range.
A guideline on when to shift from using an average moving range to using a median moving range is to use the median when two-thirds or more of the moving ranges fall below the average. Here, 68 of the 99 moving ranges are smaller than the average moving range value of 5.73. So, by using the median moving range in Figure 6, we reduced the impact of the outliers upon the limits, resulting in a more sensitive chart.
Summary
The whole operation of deleting outliers to obtain a better fit between our model and the data is based on computations that implicitly assume the data are homogeneous. When you have outliers, this assumption becomes questionable.
As soon as the assumption of homogeneity comes into question, any action which seeks to eliminate or minimize the effects of outliers is an action that raises questions about the integrity of the analysis. All such actions have to be justified by the context for the data. They can’t be justified on mathematical grounds alone.
“Are these data homogeneous?” must be the first question of any analysis. Process behavior charts provide the easiest way to address this question. Hence, any analysis that doesn’t begin by organizing the data in some rational manner and placing those data on a process behavior chart is inherently flawed.
Shewhart gave us an operational definition of how to transform outliers from a nuisance into pure gold. They signal process changes and upsets. Aristotle told us to study the changes and upsets to gain new knowledge. Generations of users of process behavior charts have proven that this approach works. The only question is whether you’ll learn how to use the wisdom of the ages to mine the gold.
Add new comment