Outliers And Statistics

Hi Jason,

It was an interesting point you made about statistics and outliers - and certainly one that rings true with the use of statistics in mainstream media etc.

However, one of my former lecturers (Jochen Trumpf - ENGN2226, Systems Analysis) would disagree with the suggestion that statistics is curve-fitting, and that outliers are dismissed and left out of models.

First, statistics is actually reasonably robust from an epistemology perspective. It states that you can never actually know the underlying distribution of a set of data - but you can get a sample of that data, albeit one distorted by measurement noise, that can provide a representation of the underlying distribution to a certain confidence level. I think both epistemology and quantum physics agree with statistics in the idea that you can’t take an observation without interfering with the actual state.

So measurement noise distorts every datum taken - the question is how much. There are a few basic rules used to analyse potential outliers.

  1. Usually, if you have a big set of data, you can establish the mean and standard deviation of the noise reasonably accurately (central limit theorem). So you have one standard of judging potential outliers - it might be more than 3 standard deviations away from the mean, which means there’s 99.7% of measurements should fall within that range. This range can vary depending on the tolerance required.

  2. It doesn’t fit the qualitative shape of the data

  3. There’s a physical explanation for how this outlying point may have come about - this is particularly the case in statistical process control - being able to conclude that something is going wrong with the process that produces the data.

If and only if all three criteria are satisfied, you can call it an outlier.

Of course, in reality, curve fitting goes on all the time!

Cheers, Ed McDonald


“statistics … states that …”

There are many competing theories of statistics. They disagree on fundamental principles, and on details about which calculations you should do and how. Different statisticians get very different answers from the same set of data. Or even the same statistician in different moods. Sorry — I know I keep saying this sort of thing — but it’s true.

I like the three criteria you mention above. It’s interesting (and good!) that you state it for a big set of data. In some fields (e.g. most types of psychology) data sets tend to be tiny.

You might be interested in this book I’m writing: Statistics Manuscript. It doesn’t have anything specifically about outliers, but it does talk about some of the other things you mention.

Oh, and by the way, I’m often very critical of statisticians, but not of statisticians who are careful about what they do, and the ANU has particularly excellent statisticians, so don’t take any of my criticisms to necessarily apply to people here.

Jason

orpeth.com