|
Correlation
Interpreting the "Correlation Coefficient"
("r" & "r2")
Definition:
"Correlation" is a quantitative index, a standard statistical measurement of
the degree of relationship or association between two sets of numbers (variables) to
describe how closely they track or are related to one another. The notion does not
necessarily imply causation since no direction of influence is known or can be assumed.
Iin fact, often both variables are "caused" by some other
independent variable(s) not being measured. The concept has been in wide use
across the sciences since the 1880's when first popularized by Sir Francis Galton,
explaining the relationship between children's height to that of their parents.
It is perhaps the most widely used single analytic procedure in the behavioral
sciences.
Explanation:
Correlation is most commonly measured by a "Pearson Product Moment Correlation
Coefficient" (normally referred to in shorthand by the symbol "r").
It is a calculated as a number ranging between -1.00 and +1.00. A measure of +/-
1.00 represents a perfect positive or negative correlation, indicating that the two sets
of numbers form an identical pattern. (One example of a +1.00 correlation
coefficient might be a comparison of two sets of numbers where one set represents the
inches in height of a group of individuals while another set represents the centimeters in
height of the same group of individuals.) A measure of -1.00 represents perfect
negative correlation, indicating that the two sets of numbers form a perfect inverse
relationship. (One example of a -1.00 correlation coefficient might be a comparison
of two sets of numbers where one set represents the total number of dollars in your bank
account(s) for each day of the month and the other represents the total number of dollars
being spent from your bank accounts on each day of the month.) It is rare to find
correlations of + or - 1.00 in social or educational research. A correlation of 0.00
means there is no relationship whatever between the variables. The Pearson's
correlation coefficient is named after Karl Pearson, who further developed the concept and
its formal calculation in the 1890's following the pioneering work of Galton.
Limitations & Characteristics:
"Correlation" assumes that there is a linear relationship between the two
sets of numbers. If the relationship is curvilinear, the "r"
will give false and misleading readings that substantially underestimate the
relationship.
The easy way to test and see whether the relationship is linear is to plot a scatter
diagram and see if the "points" scatter in a more or less linear
direction. On a scatter diagram, the coefficient measures the slope of the general
pattern of points plotted and the width of the ellipse that encloses those points. The
width of the ellipse indicates the extent of the relationship and hence, the magnitude, or
absolute value of the coefficient.
Some analysts advise removing any "outlier" cases from consideration and
treat them a priori as aberrations so that they do not bias the relationship remaining
among the more "normal" cases.
The two variables being correlated must always be paired observations for the same set
of individuals or objects, such as height and weight of a single group of individuals.
Each case to be included must be represented by a value in each variable.
The variables being correlated must be measured on an interval or ratio scale.
Categorical data cannot be properly measured with this tool. A
"point-biserial correlation coefficient" (rPb) can
be used to compare interval or ratio data in one variable to nominal or dichotomous data
in the other. [Other tools are available to measure the relationship between two nominal
variables.]
The homogeeity of the group can effect the correlations. If a group is
sufficiently homogeneous on either or both variables, the variation will tend toward zero.
In this case, one would be, in effect, dividing by zero and the formula becomes
meaningless. The variable will have been reduced to a constant. In other
words, there must be enough variation or heterogeneity in the scores to allow a
relationship to manifest itself.
While the number of observations used in the calculation does not influence the value
of the coefficient, it does affect the accuracy of the relationship.
Typical Interpretation:
One old classic and typical interpretation of "r" uses five
easy "rules of thumb" to answer the question "When is a correlation
coefficient "high" and when is it "low"? as follows:
"r" ranging from zero to about .20 may be regarded as indicating no or
negligible correlation.
"r" ranging from about .20 to .40 may be regarded as indicating a low
degree of correlation.
"r" ranging from about .40 to .60 may be regarded as indicating a
moderate degree of correlation.
"r" ranging from about .60 to .80 may be regarded as indicating a marked
degree of correlation.
"r" ranging from about .80 to 1.00 may be regarded as indicating high
correlation.
[A. Franzblau (1958), A Primer of Statistics for Non-Statisticians, Harcourt, Brace
& World. (Chap. 7)] Italics in original.
Other more recent scholars explain, simply, "as a rule of thumb, we can say that
correlations of less than .30 indicate little if any relationship between the
variables."
[See:
Hinkle, Wiersma, & Jurs (1988), Applied Statistics for the Behavioral Sciences,
2nd ed., Houghton Mifflin Co.]
Advanced Interpretation:
A more precise interpretation arising from the correlation coefficient is recommended
by some statisticians and requires one further calculation. If the "r" is
multiplied by itself or "squared," the quotient, commonly known as "r2"
(read "r square") will indicate approximately the percent of the
"dependent" variable that is associated with the "independent"
variable. Technically, "r2" is called the
"coefficient of determination." Thus, for example, a correlation
coefficient ("r") of .50 would yield a coefficient of
determination ("r2") of .25 so that in such a case,
25% of the variation in the dependent variable might be considered as being associated
with the variation in the independent variable. The coefficient of determination is
indicating the proportion of "shared" variance between the variables,
irrespective of causality, It is the proportion of the variance in one variable
associated with the variance of the other variable.
Predictive Validity:
With regard to predicting a dependent variable from the values of an independent
variable, Franzblau says "the relationship between the size of a correlation
coefficient and its predictive value is not a directly proportional one. The lower
correlation figures are of almost no value in prediction; the moderate ones are only
slightly better; the marked coefficients are somewhat but not very much better. Only
as we advance into the high correlation range do the predictive values rise to usable
levels... Coefficients below .40 do not yield a guess even 10% better than chance.
To yield a prediction which is 25% better than a chance or random guess, the
correlation must be at least .66; to be 50% better than chance, a correlation of at least
.86 is needed; to be 75% better than chance, the coefficient must rise as high as
.97." (ibid. p. 88).
Savannah State University
Office of Institutional Research & Planning
Summer, 2002
|