A Brief Introduction to Correlation

By IDI Staff

The IDI Blog covers a broad spectrum of technical knowledge and ability.  Our topics range from introductory to advanced, pragmatic to philosophic.  This entry seeks to provide a quick refresher on correlation.

Correlation is an aspect of inferential statistics that seeks to determine whether or not a relationship exists between two or more variables.  For example, in examining output from a simulation of an industrial process with underlying random variables, do we observe a relationship between changes in queue capacity and process performance?

A graph can then be plotted designating one variable, such as capability level, as the x variable or independent variable; and the other variable, mission performance, as the y variable or dependent variable.  This graph is called a scatter plot.

After the scatter plot is drawn, the analyst would analyze the graph to see if there is a pattern.  If there is a noticeable pattern, such as the points falling in an approximately straight line, then a relationship between the two variables may exist.  The next step, then, would be to determine the strength of the relationship.  In order to do this, the analyst would calculate what is called the correlation coefficient, r.  The correlation coefficient is computed from the paired data values in order to determine the strength of a relationship between two variables.

There are several types of relationships that can exist between the x values and the y values.  These relationships can be identified by looking at the pattern of the points on the graphs.  Some example types of patterns are shown in a set of graphs.

A positive linear relationship exists when the points fall approximately in an ascending straight line from left to right, and both the x and y values increase at the same time.  A negative linear relationship exists when the points fall approximately in a descending straight line from left to right; that is, as the x values increase the y values decrease.  An r close to zero means there is no discernable linear relationship – but it does not mean for certain that there is no relationship at all.  You can have an r close to zero and have a non-linear relationship.

The correlation coefficient is calculated in such a way that when there is a strong positive linear relationship, its value will be close to positive one.  When there exists a strong negative linear relationship, the value of the correlation coefficient will be close to negative one.  When the value of the correlation coefficient is near zero, the linear relationship between the x variable and the y variable is weak or nonexistent.  Hence, the value of r always ranges from -1 to +1.

OBTW, all of these graphs were generated in Excel using the NORMINV() and RAND() functions, using three linear equations and the equation describing a circle.

Given the reference to Excel, the reader might ask how to obtain correlation for a given set of data.  If your data exists in a table of x and y columns, then use CORREL().  The help menu will tell you that you need two arrays of identical dimensions, one for the x-values and one for the y.

If you want to look at it graphically, you won’t get r directly, but rather its cousin R-squared.  For simple x, y data this is as you would expect just the correlation coefficient squared.  Right click your data series and you will see an option to “Add Trendline…”.  You can add a linear trend line and under Options select “Display R-squared”.   Here are the same graphs now showing R-squared.  Not that with random data, you will never get r perfectly equal to either 1, -1, or 0.

This brief discussion does open up the whole argument as to what does it mean for things to be correlated.  Does x cause y, vice versa, or is there some hidden variable driving both?

For a funny webcomic addressing correlation versus causality – and much more – take a look at xkcd.com.

Reader beware:  xkcd.com does occasionally cross-over into mature topics.

Other web entries that pick-up the discussion where we drop off:

http://www.socialresearchmethods.net/kb/statcorr.php

https://en.wikipedia.org/wiki/Correlation_and_dependence

And a recent news article that picks up causation vs correlation in a health care context:

http://www.huffingtonpost.com/dr-jonny-bowden/epidemiological-studies_b_3825141.html

Comment