I was asked to illustrate how outliers can affect the standard sample correlation coefficient and show how the use of robust measures of correlation (association) could help when there is a need to automate the analysis. The post may be of interest to people with little background in statistics or data analysis.
Outliers and the correlation coefficient
Let’s begin with a dataset showing Highway death rates (per 100 million vehicle miles of travel) and maximum speed limits in 10 countries, as previously published by Rivkin (1986). The dataset is likely to be from the time when the United States had a maximum speed limit of 55 miles per hour.
Among the questions that can be asked here are whether there is a correlation between speed limit and death rate and whether lower speed limits reduce the highway death rate.
It can be shown that the sample correlation coefficient for death rate and speed limit is 0.55
but if Italy alone is removed, it drops to 0.098
and if then Britain is removed, it jumps to 0.70
By removing only Britain, one can have the sample correlation as high as 0.81, and unlike with all the above cases, the coefficient is statistically significant:
The above shows that outliers can easily deflate or inflate the sample correlation coefficient. Usually, an outlier that is consistent with the trend of the vast majority of the data will inflate the correlation (see the bottom-right quadrant of Fig. 1 below), and an outlier that is not consistent with the rest of the data can substantially decrease the correlation (see the top-right quadrant).
This high sensitivity of the sample correlation coefficient to outliers is well known and may complicate the analysis that otherwise could easily be automated. There are generally two strategies here: 1) remove the outliers prior to using the standard correlation coefficient, and 2) use measures of correlation and association that are robust to outliers. Of course, the two strategies can always be combined, but when you need to automate the analysis using the robust measures alone would often be a preferred approach.
It will be convenient to introduce the robust versions of the Pearson’s product-moment correlation coefficient by considering another dataset, the one that shows average brain and body weights for 28 species of land animals. R package MASS has a number of datasets useful for demonstrating various aspects of statistical methods. To see them all, just type data(package=“MASS”).
The linear plot on the left is not very informative thanks to a few outliers (elephants and humans), but the log-log plot shows rather a strong correlation between the two weights, which can be readily quantified with the R’s cor.test() function. Without the use of attach(), there are three common ways to use the function with data:
The last approach uses the so-called formula interface, for which the function has a registered S3 method. All three give
showing practically zero correlation, thereby contradicting the relationship clearly present on the log-log plot. This situation can be reconciled by using correlation (measures of association) other than that of Pearson. The cor.test() function can also compute Spearman’s rank correlation rho and Kendall’s rank correlation tau. All what is required is overwriting the default value of the method parameter:
The two results are now consistent with what the log-log plot suggests. The effect of exact=FALSE could also be achieved by the use of suppressWarnings() function to drop the warnings about the ties on observations that prevent calculation of exact p-values (so only approximations are returned), the issue that is of little interest to us now.
To conclude, outliers can distort the output of classical statistical methods to the extent that wrong conclusions can be made. There are generally two approaches to remedy this problem: detect and then exclude the outliers from an analysis to be conducted with standard methods, or make use of the methods that are robust to outliers and other influential observations. A number of robust methods exist, for example, for regression modelling, of which the calculation of the sample correlation coefficients can be considered a special case. The second approach may often be preferred for automated solutions, as correctly detecting and removing outliers is in most cases difficult.
Rivkin, D. J. (1986). Fifty-five mph speed limit is no safety guarantee. New York Times (Letters to the Editor), 25, p. 26.