I was asked to illustrate how outliers can affect the standard sample correlation coefficient and show how the use of robust measures of correlation (association) could help when there is a need to automate the analysis. The post may be of interest to people with little background in statistics or data analysis.

Outliers and the correlation coefficient

Let’s begin with a dataset showing Highway death rates (per 100 million vehicle miles of travel) and maximum speed limits in 10 countries, as previously published by Rivkin (1986). The dataset is likely to be from the time when the United States had a maximum speed limit of 55 miles per hour.

cs<-c("Norway","United States","Finland","Britain","Denmark",
      "Canada","Japan","Australia","Netherlands","Italy")
dr<-c(3,3.3,3.4,3.5,4.1,4.3,4.7,4.9,5.1,6.1)
sl<-c(55,55,55,70,55,60,55,60,60,75)
df<-data.frame(death_rate=dr,speed_limit=sl)
rownames(df)<-cs
df
##               death_rate speed_limit
## Norway               3.0          55
## United States        3.3          55
## Finland              3.4          55
## Britain              3.5          70
## Denmark              4.1          55
## Canada               4.3          60
## Japan                4.7          55
## Australia            4.9          60
## Netherlands          5.1          60
## Italy                6.1          75

Among the questions that can be asked here are whether there is a correlation between speed limit and death rate and whether lower speed limits reduce the highway death rate.

It can be shown that the sample correlation coefficient for death rate and speed limit is 0.55

with(df,cor.test(death_rate,speed_limit))
## 
##  Pearson's product-moment correlation
## 
## data:  death_rate and speed_limit
## t = 1.8546, df = 8, p-value = 0.1008
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1241625  0.8756458
## sample estimates:
##     cor 
## 0.54833

but if Italy alone is removed, it drops to 0.098

with(df[-10,],cor.test(death_rate,speed_limit)) #or df[rownames(df)!="Italy",] 
## 
##  Pearson's product-moment correlation
## 
## data:  death_rate and speed_limit
## t = 0.26013, df = 7, p-value = 0.8022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6056285  0.7154765
## sample estimates:
##        cor 
## 0.09784921

and if then Britain is removed, it jumps to 0.70

with(df[-c(4,10),],cor.test(death_rate,speed_limit))
## 
##  Pearson's product-moment correlation
## 
## data:  death_rate and speed_limit
## t = 2.3869, df = 6, p-value = 0.05425
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01332986  0.94019351
## sample estimates:
##       cor 
## 0.6978986

By removing only Britain, one can have the sample correlation as high as 0.81, and unlike with all the above cases, the coefficient is statistically significant:

with(df[-4,],cor.test(death_rate,speed_limit)) 
## 
##  Pearson's product-moment correlation
## 
## data:  death_rate and speed_limit
## t = 3.7102, df = 7, p-value = 0.007553
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3267378 0.9594924
## sample estimates:
##       cor 
## 0.8141862

The above shows that outliers can easily deflate or inflate the sample correlation coefficient. Usually, an outlier that is consistent with the trend of the vast majority of the data will inflate the correlation (see the bottom-right quadrant of Fig. 1 below), and an outlier that is not consistent with the rest of the data can substantially decrease the correlation (see the top-right quadrant).

par(mfrow=c(2,2))
plot(df,main="all ten countires present: r=0.55")
plot(df[-10,],main="only Italy removed: r=0.098")
plot(df[-c(4,10),],main="Italy and Britain removed: r=0.70")
plot(df[-4,],main="only Britain removed: r=0.81")
Figure 1. The effect of outliers on the sample correlation

This high sensitivity of the sample correlation coefficient to outliers is well known and may complicate the analysis that otherwise could easily be automated. There are generally two strategies here: 1) remove the outliers prior to using the standard correlation coefficient, and 2) use measures of correlation and association that are robust to outliers. Of course, the two strategies can always be combined, but when you need to automate the analysis using the robust measures alone would often be a preferred approach.

Robust alternatives

It will be convenient to introduce the robust versions of the Pearson’s product-moment correlation coefficient by considering another dataset, the one that shows average brain and body weights for 28 species of land animals. R package MASS has a number of datasets useful for demonstrating various aspects of statistical methods. To see them all, just type data(package=“MASS”).

require(MASS)
par(mfrow=c(1,2),mar=c(2,4,2,2))
plot(Animals$body,Animals$brain)
plot(Animals$body,Animals$brain,log="xy")
Figure 2. Linear and log-log plots of the data

The linear plot on the left is not very informative thanks to a few outliers (elephants and humans), but the log-log plot shows rather a strong correlation between the two weights, which can be readily quantified with the R’s cor.test() function. Without the use of attach(), there are three common ways to use the function with data:

cor.test(Animals$body,Animals$brain)
with(Animals, cor.test(body, brain))
cor.test(~body+brain,data=Animals)

The last approach uses the so-called formula interface, for which the function has a registered S3 method. All three give

## 
##  Pearson's product-moment correlation
## 
## data:  body and brain
## t = -0.027235, df = 26, p-value = 0.9785
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3776655  0.3684700
## sample estimates:
##          cor 
## -0.005341163

showing practically zero correlation, thereby contradicting the relationship clearly present on the log-log plot. This situation can be reconciled by using correlation (measures of association) other than that of Pearson. The cor.test() function can also compute Spearman’s rank correlation rho and Kendall’s rank correlation tau. All what is required is overwriting the default value of the method parameter:

cor.test(~body+brain,data=Animals, method ="spearman",exact=FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  body and brain
## S = 1036.6, p-value = 1.813e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7162994

and

cor.test(~body+brain,data=Animals, method ="kendall", exact=FALSE)
## 
##  Kendall's rank correlation tau
## 
## data:  body and brain
## z = 4.6042, p-value = 4.141e-06
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.6172191

The two results are now consistent with what the log-log plot suggests. The effect of exact=FALSE could also be achieved by the use of suppressWarnings() function to drop the warnings about the ties on observations that prevent calculation of exact p-values (so only approximations are returned), the issue that is of little interest to us now.

Conclusion

To conclude, outliers can distort the output of classical statistical methods to the extent that wrong conclusions can be made. There are generally two approaches to remedy this problem: detect and then exclude the outliers from an analysis to be conducted with standard methods, or make use of the methods that are robust to outliers and other influential observations. A number of robust methods exist, for example, for regression modelling, of which the calculation of the sample correlation coefficients can be considered a special case. The second approach may often be preferred for automated solutions, as correctly detecting and removing outliers is in most cases difficult.

References

Rivkin, D. J. (1986). Fifty-five mph speed limit is no safety guarantee. New York Times (Letters to the Editor), 25, p. 26.

Time series models: VAR model definition

The Vector AutoRegression (VAR) family of models has been widely used for modelling and forecasting since the early 1980s. A VAR model is...… Continue reading

Time series models: ARIMA model definition

Published on September 30, 2015

Types of forecast: ex ante vs ex post

Published on September 06, 2015