Outliers In Statistics

In today’s lecture, Jason mentioned that in statistics, whenever we (and by we, I mean people who do statistics) encounter points that don’t seem to fit the remainder of the data set, then this is simply ignored. I know that this is a bit of a generalisation but it really got me thinking about outliers (and influential points) are treated in statistics.

From the courses that I have done in statistics, it actually depended on the lecturer’s view on leaving out such points. Often we will check using a t-test to see whether a point (or group of points) is significant at the 5% level etc., but even if we find it to be not significant at the 5% level, we will often leave the point(s) in because we might be excluding a significant part of the data set (and often we can’t exclude data because the sample size might be a bit small etc.). My lecturer for STAT3015 (Generalised Linear Modelling) hated the removal of data points simply if we just removed data points did not seem to fit in with the remainder of the data set. He would even mark us down for not justifying the removal of data from a data set without a good enough reason and sometimes, using a purely statistical result was not enough good. With a scientific data set, we have to be in communication with the scientist to determine whether there might be some sort of hidden variable that might describe the outliers/influential points, the design of the experiment and the underlying theory for the experiment. Removing data points can often remove vital parts of the story and often the science might explain the apparent outliers/influential points.

Sorry if I misunderstood you Jason - I just didn’t want people to get the idea that statisticians are homogeneous rather than heterogeneous group of people. There are many different ways in which statisticians would tackle data analysis.

Sally Jones


Great points. I especially like your point that statistical methods depend on what you call “the underlying theory of the experiment” … although remember that in this course we’re mostly talking about observational data not experimental data, which causes huge problems for some theories of statistics (but not for mine, bwa ha ha).

Yes, statisticians are heterogeneous. Especially in a good maths department like the one at the ANU; less so at some other universities and in applied science departments (in my experience).

Did someone really suggest using a t-test to decide whether to treat data points as outliers? That’s certainly not how t-tests were originally meant to be used. Hm. (I take your point that the lecturer in STAT3015 wasn’t saying you should always do that.)

See Statistics Manuscript if you want to read another 500 pages on this!

Jason


Great topic! Yeah, what I experienced is that you can never trust a statistic right away. I learned that no statistics possesses an absolute expressiveness itself – they are relative depending on many factors. E.g. you need to know the way it was made (what algorithms were used and what outliers were ignored for what reason) so that you know what this particular statistic predicates. We have a saying in German – I don’t know if it exists in English, but if I translate it literally it means: “Don’t trust any statistic that you haven’t faked yourself”. Of course that’s harsh but I think it expresses very well how carefully you have to treat statistics.

During my astronomy labs last year I had to gather lots of data points (we had to do both: experimental statistics and observational statistics). Our professor told us we should NEVER leave out any data points as it would alter the whole meaning of our results. Nevertheless reality showed that everybody just modified their results so that they looked like they were ought to look like. My partner was a guy who finished studying maths before and he applied lots of tricks to get the result he wanted to get. So that’s just one example for the relativity of statistics.

On the other hand I also used the student-t-test last summer when I did an internship at the ÖWF (Austrian Space Forum). We needed to evaluate the data we obtained from our experiment (“Statistical evaluation of contamination vectors during EVAs”). Doing that, we realised that every dataset contained outliers. BUT we didn’t just ignore them. We tried to find an algorithm – a way to treat every dataset equally – that allows us to leave them out. It was very important for us that we could justify to ourselves as well as in my bachelor thesis that I finally wrote about this experiment WHY and HOW (= the algorithm) we could neglect those outliers. That’s another example – this time for a statistic that you could comprehend afterwards.

For those who are interested and can read German: http://blog.oewf.org/2011/09/labormarathons-fur-die-astrobiologie/

Julia Heuritsch

That’s great. Thanks, Julia.

Also, you get an extra mark for the first use of an umlaut this year.

Jason

Thanks for the link Julia! My German is a bit rusty (Ich habe vier Jahren Deutsch gelernt) but I think I have the general idea from that page :) And true, statistics can be manipulated so easily. One of my psychology friends has been talking to me about how this renown psychologist fudged the statistics on purpose so that he could get the statistics to tell the story he wanted to. Statistics should be taken with a grain of salt.

And what was that expression about statistics in German? German has some great expressions - “Warum ist die banane krumm?” (lit. Why is the banana bent?)

And that statistics manuscript looks interesting. I didn’t do the statistical inference course here at the ANU but I’m sure I would learn something from it :P

Sally Jones


The german phrase states: “Traue keiner Statistik, die du nicht selbst gefälscht hast.” - And actually I found out that it is said that this is a quotation of the british politician Winston Churchill…

Julia Heuritsch


Ooh, what’s the statistical inference course?

Jason

The course is “Statistical Inference” (STAT3013) with Steven Stern and it runs in semester 2. http://studyat.anu.edu.au/courses/STAT3013;details.html :)

Sally Jones


Thanks.

Jason


Yes when it comes to statistic it is very easy to intercept the truth as Sally has stated in the example of psychology. Although I would like to defend that in psychology, the outliers are not ignored but are examined in the results as they are important for the whole picture.

So I really think that statistics from one discipline to another vary in their application and this must be kept in mind when examining any data. Also it may be the case that the individual who has examined the data has employed an inappropriate method and therefore the results that have been revealed are not adding up because of this problem.

This is not to say however that statistics should not be used, but more we need to examine thoroughly the detail of the statistical data.

Bernadette


Yes, good point, thanks. Part of the variation in statistital methodology is due to different traditions in different disciplines. I’m not sure it’s very ratonal, though; often it’s just because of different historical accidents in different disciplines (especially, who happened to be influential in different disciplines when they initially started using statistics).

Jason

Sorry Bernadette. I didn’t intend to bag psychology out there. The point that I was trying to make that scientists (from any discipline) are often affected by confirmation bias - playing with the statistics to get the results they want and ignoring any other possibility.

One extreme example that springs to mind is Sir Ronald Fisher (this story has been told many times in my statistics classes), a renown figure in statistics and evolutionary biology. Nowadays, it known that smoking causes lung cancer but Fisher strongly believed that smoking did not cause lung cancer. The statistics that Fisher provided in support of his hypothesis could not be verified. His confirmation bias is said to have been caused by being a smoker himself and a strong dislike of being proven wrong. Confirmation bias can be very polarising as it closes off the mind to other possibilities.

Anyway, there are many issues with how people may use statistics but at the same time, it can be very useful as well (but as long as it is used right) :P

Sally Jones


Yes, but nobody agrees about what “used right” means! I’m afraid it’s not just a matter of people who “know statistics” versus people who don’t. See my book, or I can recommend things by other people if you’re interested. (Even things by Fisher, or equally things by Fisher’s worst enemies, which make the same point!)

Jason

orpeth.com