Old Dominion University
A to Z Index  |  Directories


John Ritz




FOUN612

SEPS636

STEM730/830

SEPS785/885




OTED635


Information Sheet and Assignment Number 8

Methods of Linear Correlation Objectives

Upon completion of this package, you will be able to:

  1. Explain the difference between statistical correlations and true causality.
  2. Evaluate different sets of data as to the appropriate type of correlation needed for analyzing the data.
  3. Compute Pearson Product Moment Correlations (r).
  4. Compute Spearman Rank Order Correlation (rho).
  5. Interpret the results of coefficients of correlation.

Reading Assignments

    Urdan, pp. 79-92.
    Lang and Heiss, pp. 30-31.
    Materials presented in this package.
    Turney & Robb, pp. 95-100.
    Isaac & Michael, pp. 176, 177, 202, 204.
    Tuckman, pp. 276-279, 288-289.

Evaluation

  1. Cite the difference between statistical correlations and true causality. 
  2. When and why would you employ Pearson's r? 
  3. When and why would you employ Spearman's rho? 
  4. Compute Pearson's r for the following data:

    Subject X Y   Subject X Y
    S1 1 7   S5 9 10
    S2 3 4   S6 11 22
    S3 5 13   S7 13 19
    S4 7 16  

  5. Compute Spearman's Rho for the following data:

    Q Rank Leadership Rank   Q Rank Leaderhip Rank   Q Rank Leadership Rank
    1 4   6 10   11 11
    2 2   7 8   12 6
    3 9   8 13   13 12
    4 1   9 5   14 15
    5 7   10 3   15 14

  6. Explain what the results of problems 4 and 5 indicate to you and their use in research.

    1. Problem 4 - Pearson's r

        H1: Workers who score well on safety instruction will compare on a job satisfaction survey.

    2. Problem 5 - Spearman's Rho

        H1: Students with high quality point averages while in college will display leadership qualities on the job.


METHODS OF RESEARCH IN EDUCATION

OTED 635
Information Sheet

Methods of Linear Correlation

LINEAR CORRELATION

     When we speak of correlation, we are interested in the relation between two or more variables. One frequent assumption in psychology is that two variables have a linear relationship. This means that the relationship between the two variables can be portrayed as a straight line. This package will introduce you to two methods of linear correlation.
     Coefficients of correlation measure the degree of relationship between variables in terms of the change in those variables. As the days get progressively colder, the number of insects left in the air becomes smaller and smaller. If we let the temperature on a given day be X, and the number of flying insects on a given day be Y, we can take measurements of both X and Y over a period of several days and see if there is any statistical relationship between these measurements.
     The two types of correlations we are going to discuss are used for relating two variables, one X and one Y. Our data will consist of pairs of numbers, an X measurement and a Y measurement.

PEARSON'S r

     The most commonly computed statistical coefficient of correlation is r. The basic requirement of this method of correlation is that both sets of measurements be at least interval data. We do not use Pearson's r with ordinal numbers.

COMPUTATION OF r

     The Pearson product-moment correlation (r) is used to determine if there is a relationship between two sets of paired numbers. Generally, the paired numbers are (1) two different measures on each of several objects or persons, or (2) one measure on each of several pairs of objects or people where the pairing is based on some natural relationship, such as father to son, or on some initial matching recording to one specific variable, such as IQ score.
     Assume that an experimenter wishes to determine whether there is a relationship between the grade point average (GPA's) and the scores on a reading-comprehension test of fifteen college freshman.
     The basic computational formula for the Pearson product-moment correlation is:

where: N = number of pairs of scores

    XY = sum of the products of the paired scores
    X = sum of scores on one variable
    Y = sum of scores on the other variable
    X2 = sum of the squared scores on the X variable
    Y2 = sum of the squared scores on the Y variable

Step 1.     The scores must be paired in some meaningful way in order to use the Pearson r. In the present example, the two different scores -- reading comprehension and grade point average -- are paired and recorded for each of the fifteen students:

Student Reading
Score (X)
Freshman
GPA (Y)
  Student Reading
Score (X)
Freshman
GPA (Y)
  Student Reading
Score (X)
Freshman
GPA (Y)
S1 38 2.1   S6 61 3.7   S11 48 3.4
S2 54 2.9   S7 57 3.2   S12 46 2.5
S3 43 3.0   S7 25 1.3   S13 44 3.4
S4 45 2.3   S9 36 1.8   S14 39 2.6
S5 50 2.6   S10 39 2.5   S15 48 3.3

Step 2.     Multiply the two numbers in each pair; then add the products.

(38 x 2.1) + (54 x 2.9) +...+ (48 x 3.3) = 1889.4

Step 3.     Multiply the number obtained in Step 2 by N, the number of paired scores (15 in this example).

1889.4 x 15 = 27341

Step 4.     Square each number in the first column, and add the squared values.

382 +542 + ... + 482 = 31327

Step 5.     Multiply the sum of Step 4 by the number of paired scores (N = 15 in this example).

31,327 x 15 = 469905

Step 6.     Add all the scores in the first column (in this example, the reading-comprehension scores).

38 + 54 + ... + 48 = 673

Step 7.     Square the value obtained in Step 6.

6732 = 452929

Step 8.     Square each number in the second column, and add the squared values. (Note: The value of Step 10 can be obtained at the same time, as explained in Step 4.)

2.12 + 2.92 + ...3.32 = 116

Step 9.     Multiply the result of Step 8 by the number of paired scores (N = 15 in this example).

110.2 x 15 = 1740

Step 10.     Add all the scores in the second column (in this example, the freshman GPA's).

2.1 + 2.9 + ... + 3.3 = 40.6

Step 11.     Square the value obtained in Step 10.

39.62 = 1648.36

Step 12.     Multiply the final value of Step 6 by that of Step 10.

673 x 39.6 = 27323.8

Step 13.     The numerator of r is now computed by subtracting the value obtained in Step 12 from that obtained in Step 3.

28341 - 27323.8 = 1017.2

Step 14.     Subtract the value of Step 7 from that of Step 5.

469905 - 452929 = 16976

Step 15.     Subtract the final value of Step 11 from that of Step 9.

1740 - 1648.36 = 91.64

Step 16.     Multiply the result of Step 14 by the result of Step 15.

16976 x 91.64 = 1555680.6

Step 17.     Take the square root of the result of Step 16.

Step 18.    Divide the value of Step 13 by that of Step 17. This yields the value of the Pearson product-moment correlation.

SPEARMAN's Rho

     Spearman's Rho is a correlation coefficient built on the same structure as Pearson's r, only for use with ordinal data. It is very useful for correlating rankings and other ordinal data.

COMPUTATION OF Rho

     Spearman's rho is used when an experimenter wishes to determine whether two sets of rank-ordered data are related.
     The experimenter first asked an experienced teacher to rank twenty children in her class according to what she believed their intelligence to be. Then the children were tested and their actual Wechsler IQ scores obtained. The data were as follows:

Child Teacher's
Ranking
IQ Score Child Teacher's
Ranking
IQ Score
A 1 116 K 11 116
B 2 111 L 12 109
C 3 97 M 13 103
D 4 122 N 14 103
E 5 116 O 15 96
F 6 105 P 16 90
G 7 108 Q 17 134
H 8 95 R 18 87
I 9 124 S 19 96
J 10 98 T 20 91

The following formula for rho (p) describes the computational procedure.

Where d = difference of score between each X and Y pair

    N = number of pairs of scores

Step 1.     Rank the IQ scores so that both variables are ranked.

Note: When two children have the same IQ score, the same rank is given to each. But notice that the rank given is the mean value of the two ranks for the two tied scores. For example, Child M and Child N both have IQ scores of 103. These scores should fall into ranks 11 and 12, but since both children have the same IQ score, both are ranked as 11.5 and the next score (Child J. with and IQ score of 98) is ranked as 13.

Child Teacher's
Ranking
IQ Score IQ Rank on
the basis of
test score
Child Teacher's
Ranking
IQ Score IQ Rank on
the basis of
test score
A 1 116 5 K 11 116 5
B 2 111 7 K 11 116 5
C 3 97 14 M 13 103 11.5
D 4 122 3 N 14 103 11.5
E 5 116 5 O 15 96 15.5
F 6 105 10 P 16 90 19
G 7 108 9 Q 17 134 1
H 8 95 17 R 18 87 20
I 9 124 2 S 19 96 15.5
J 10 98 13 T 20 91 18

Step 2.     Compute the difference between the two ranks for each child. The resulting value is called the d value. List these values in a column, making sure to note whether they are positive or negative.

Child d value Child d value Child d value Child d value
A -4 (1-5) F -4 (6-10) K +6 (11-5) P -3 (16-19)
B -5 (1-5) G -4 (6-10) L +6 (11-5) Q -3 (16-19)
C -11 (3-14) H -9 (8-17) M +1.5 (13-11.5) R -2 (18-20)
D +1 (4-3) I +7 (9-2) N +2.5 (14-11.5) S +3.5 (19-15.5)
E 0 (5-5) J -3 (10-13) O -0.5 (15-15.5) T +2 (20-18)

Step 3.     Square all the d Values from Step 2, and add all the squared values.

(-42) + (-52) + . . . +22 = 668

Step 4.     Multiply the result of step 3 by the number 6. (Note: The number 6 is always used, regardless of the number of ranks, etc., involved).

668 x 6 = 4008

Step 5.     Compute N(N2-1). (In our example N = 20).

20(202 - 1) = 20(400 - 1) = 20(399) = 7980

Step 6.     Divide the result of Step 4 by the result of Step 5.

Step 7.     Subtract the final value of Step 6 from the number 1. (Note: The number 1 is also always used.) This yields the value of Spearman's rho. Be careful to record whether it is positive or negative.

rho = 1 - 0.50 = + 0.50

Here is an example of a direct relationship:

X Y
2 15
4 20
6 25 r = +1.00
8 30
10 35

Once again we have a perfect set of data, but you can easily see the relationship. As one variable increases, the other also increases, or as one variable decreases, the other also decreases. This is what we call a direct relationship.

A FURTHER NOTE ON MAGNITUDE

     It is difficult to say how large a correlation coefficient should be to be "meaningful". The strength of the relationship represented must be interpreted in the context of that relationship. In general, however, the student can follow this interpretation:

CORRELATION VALUE APPROXIMATE MEANING
less than 0.20 Slight, almost neglible relationship
0.20 to 0.40 Low correlation; definite bur small relationship
0.40 to 0.70 Moderate correlation; substantial relationship
0.70 to 0.90 High correlation; very dependable relationship
0.90 to 1.00 Very high correlation; very dependable relationship

CAUSALITY

     It is important for the student to realize at this time that when we figure correlation, we are figuring the statistical degree of relationship between variables. We are not, much as you might think, determining whether one variable causes or influences the other. A very common mistake in social science is to interpret a high correlation as meaning causality. Thus, if A and B are highly correlated it does not necessarily follow that A causes B or B causes A. A common explanation would be that a third variable, C, causes both A and B thus making it look like they are related. Correlation is extremely useful for finding relationships and predicting trends, but is only the first step in trying to determine causality.

INTERPRETATION OF rho AND r

     Correlation coefficients give us two pieces of information concerning the relationship between two variables; the strength of that relationship and its direction.
     If computed correctly, the value of the correlation coefficient will always fall between -1 and +1. This is true for both rho and r. The strength of the relationship is shown by how large the coefficient is, that is, how close it is to plus or minus one. Here are some examples of "weak" coefficients:

-0.05
0.10
0.08
-0.12

Note that all of these coefficients are close to zero. This means that there is little statistical relationship between the two variables. Here are some examples of "strong" coefficients:

0.85
-0.94
-0.78
0.97

Note that all of these coefficients are close to one (positive or negative). Thus, we see that the size of the number (between zero and one) is an indication of the strength of the relationship. The sign of the coefficient has nothing to do with strength. The sign of the coefficient merely reflects whether the relationship is direct (positive sign) or indirect (negative sign).

Here is an example of an indirect relationship:

X Y
5 1
4 2
3 3 r = -1.00
2 4
1 5

Of course, not all sets of data are this obvious, we very seldom get a perfect relationship between groups of data. The important thing to notice is that as one variable gets larger, the other gets smaller. This is what we mean by an indirect relationship.

 

Critical Values of the Pearson Product Moment Correlation Coefficient

  Level of significance for one-tailed test
df = N - 2
(degrees of
freedom)
0.05 0.025 0.01 0.005 0.0005
Level of significance for two-tailed test
0.10 0.05 0.02 0.01 0.001
1 .9877 .9969 .9995 .9999 1.0000
2 .9000 .9500 .9800 .9900 .9990
3 .8054 .8783 .9343 .9587 .9912
4 .7293 .8114 .8822 .9172 .9741
5 .6694 .7545 .8329 .8745 .9507
6 .6215 .7067 .7887 .8343 .9249
7 .5822 .6664 .7498 .7977 .8982
8 .5494 .6319 .7155 .7646 .8721
9 .5214 .6021 .6851 .7348 .8471
10 .4973 .5760 .6581 .7079 .8233
11 .4762 .5529 .6339 .6835 .8010
12 .4575 .5324 .6120 .6614 .7800
13 .4409 .5139 .5923 .6411 .7603
14 .4259 .4973 .5742 .6226 .7420
15 .4124 .4821 .5577 .6055 .7246
16 .4000 .4683 .5425 .5897 .7084
17 .3887 .4555 .5285 .5751 .6932
18 .3783 .4438 .5155 .5614 .6787
19 .3687 .4329 .5034 .5487 .6652
20 .3598 .4227 .4921 .5368 .6524
25 .3233 .3809 .4451 .4869 .5974
30 .2960 .3494 .4093 .4487 .5541
35 .2746 .3246 .3810 .4182 .5189
40 .2573 .3044 .3578 .3932 .4896
45 .2428 .2875 .3384 .3721 .4648
50 .2306 .2732 .3218 .3541 .4422
60 .2108 .2500 .2948 .3248 .4078
70 .1954 .2319 .2737 .3017 .3799
80 .1829 .2172 .2565 .2830 .3568
90 .1729 .2050 .2422 .2673 .3375
100 .1638 .1946 .2301 .2540 .3211

 

Table VI - Critical Values of rs, the Spearman Rank Correlation Coefficient

N Significance level (one-tailed test)
0.05 0.01
4 1.000
5 .900 1.000
6 .829 .943
7 .714 .893
8 .643 .833
9 .600 .783
10 .564 .746
12 .506 .712
14 .456 .645
16 .425 .601
18 .399 .564
20 .377 .534
22 .359 .508
24 .343 .485
26 .329 .465
28 .317 .448
30 .306 .432

CORRELATION

     Some problems in educational or psychological research necessitate the comparison of two sets of measures, such as test scores, in order to determine whether the measures show a relationship. That is, it is necessary to find out if there is a correlation between the variables. If the variables are found to rise or fall together in such a way that an increase in one is accompanied by an increase in the other or a decrease in one is accompanied by a decrease in the other, there is a positive correlation. The correlation is negative or inverse if an increase in one variable is accompanied by a decrease in the other. An example of a positive correlation is found in a comparison of height and weight. Generally speaking, taller people tend to be heavier than shorter people. An example of a negative relationship is the correlation between horsepower and gas mileage in automobiles: generally speaking, as horsepower increases, mileage decreases.
     Karl Pearson, a mathematician of great renown, developed a statistical procedure for obtaining a coefficient of correlation that is commonly employed today when a numerical value is needed to express the degree of relationship between variables. The symbol for a Person product-moment coefficient of correlation is r. When r is zero, no correlation exists between two variables, such as X and Y. When r is +1.00, there is a perfect positive correlation, and when it is -1.00, there is perfect negative correlation. We shall discuss the interpretation of a correlation coefficient a little later in this chapter.
     The use of a scatter diagram or scattergram will permit the researcher to show graphically the relationship between two sets of measures, such as scores from Test X and Test Y. The scattergram also will reveal whether the relationship between the scores is linear (straight line) or nonlinear. The Person r applies only to the linear relationships. A scattergram for the following sets of data appears in Figure 1.

Figure 1. A scattergram for scores on Test X and Test Y.

Student Number Score on Test X Score on Test Y
01 20 19
02 17 18
03 15 16
04 14 14
05 13 13
06 12 15
07 12 14
08 10 12
09 8 10
10 7 10
11 5 5
12 2 3

     In the scattergram a point has been positioned on the graph to show the corresponding values of X and Y for each of the 12 students. For example, student number 01 has a score of 20 for Test X and a score of 19 for Test Y. A single point was made on the graph to show where the two values intersect. It can be seen that the pattern of points is linear and could be represented by a line running from the lower left of the graph to the upper right. This tells us that the relationship is linear and positive. Had the pattern run from the upper left to the lower right the correlation would have been negative. Three examples of scattergrams for positive, negative, and zero correlation are shown in Figure 2.

Figure 2. Examples of scattergrams.

PRODUCT-MOMENT COEFFICIENT OF CORRELATION

The basic formula for finding the Pearson r is written

where

    x = X-
    y = Y-
    N = The number of pairs of scores
xy = The sum of the products of the deviations from the means of X and Y respectively

A variation of the basic formula is the following:

     Since the use of calculators and computers is becoming more and more widespread, the following raw-score formula is especially useful for computing r.

     We shall use the raw-score formula to demonstrate the calculation of a coefficient of correlation for the scores made by students selected at random from a larger group (population).

ENGLISH AND SOCIAL STUDIES ACHIEVEMENT TEST SCORES FOR TWELVE STUDENTS

ENGLISH TEST SCORES (X) SOCIAL STUDIES TEXT SCORES (Y)
40 38
36 36
32 40
30 35
28 30
25 20
25 32
25 28
22 25
20 22
20 20
15 18

In order to compute r for the preceding data, we will need the following values:

  1. The number of pairs of scores, or N.
  2. The sum of the products of the pairs of scores, or XY
  3. The sum of the scores for Test X.
  4. The sum of the scores for Test Y.
  5. The sum of the squared scores for Test X.
  6. The sum of the squared scores for Test Y.

X X2 Y Y2 XY
40 1600 38 1444 1520
35 1225 36 1296 1260
32 1024 40 1600 1280
30 900 35 1225 1050
28 784 30 900 840
25 625 20 400 500
25 625 32 1024 800
25 625 28 784 700
22 484 25 625 550
20 400 22 484 440
20 400 20 400 400
15 225 18 324 270
 X=317  =8917  Y=344  Y2=10,560 XY=9610

INTERPRETATION OF A CORRELATION COEFFICIENT

How should be interpret the r of .88 just computed?

First of all we must determine whether it is statistically significant. A formula can be used for the purpose, but it is simpler to use a table, such as Table D in the Appendix. In Table D the first column shows degrees of freedom (df). In the previous problem, df=10, because we find df by the formula N-2, where N equals the number of pairs of scores. After locating 10 in the df column, we can find the figure .576 in the 5 percent column and .708 in the 1 percent column. Our computed r of .88 exceeds both these values, so it is significant at the 1 percent level. We can conclude that this r represents an estimate of a population coefficient of correlation between the variables X and Y that is greater than zero.

Since we have accepted the coefficient of correlation as statistically significant, we can proceed with our interpretation. The interpretation of r must take into account two things: the sign of r and the size of it. The sign (positive or negative) tells us about the direction of the relationship. Our computed r of .88 is positive, so we know that the correlation is positive. It is relatively difficult, however, to interpret the size of a correlation coefficient. It should be emphasized that r's are not to be interpreted as are percents. Our r of .88 definitely does not imply that X and Y are correlated 88 percent of the time. Furthermore, we cannot say that an r of .88 shows twice as strong a relationship between two variables as an r of .44. Actually, .88 shows more than twice the relationship that .44 shows. If we square an r we can determine the amount of variation in a second variable that is associated with variation in the first. For example, if we square an r of .88 we get .77. Now we can say that 77 percent of the variation in out Test X scores is associated with the variation in Test Y scores, or vice versa.

As a rough guide to providing qualitative descriptive terms for coefficients of correlation, the following system is suggested:

CORRELATION VALUE APPROXIMATE MEANING
less than 0.20 Slight, almost neglible relationship
0.20 to 0.40 Low correlation; definite bur small relationship
0.40 to 0.70 Moderate correlation; substantial relationship
0.70 to 0.90 High correlation; very dependable relationship
0.90 to 1.00 Very high correlation; very dependable relationship

It should be emphasized that correlation does not necessarily imply causation. Or, in other words, even if a very high and statistically significant correlation coefficient is obtained the assumption should not be made that X causes Y or that variables are affected by another variable or variables and do not affect each other directly.