Chapter 4: Analysing the Data
Least-squares regression line
Regression generates what is called the "least-squares" regression line. The regression line takes the form: = a + b*X, where a and b are both constants, (pronounced y-hat) is the predicted value of Y and X is a specific value of the independent variable. Such a formula could be used to generate values of for a given value of X. For example, suppose a = 10 and b = 7. If X is 10, then the formula produces a predicted value for Y of 45 (from 10 + 5*7). It turns out that with any two variables X and Y, there is one equation that produces the "best fit" linking X to Y. In other words, there exists one formula that will produce the best, or most accurate predictions for Y given X. Any other equation would not fit as well and would predict Y with more error. That equation is called the least squares regression equation.
But how do we measure best? The criterion is called the least squares criterion and it looks like this:
You can imagine a formula that produces predictions for Y from each value of X in the data. Those predictions will usually differ from the actual value of Y that is being predicted (unless the Y values lie exactly on a straight line). If you square the difference and add up these squared differences across all the predictions, you get a number called the residual or error sum or squares (or SSerror). The formula above is simply the mathematical representation of SSerror. Regression generates a formula such that SSerror is as small as it can possibly be. Minimising this number (by using calculus) minimises the average error in prediction.
It is possible to derive by hand computation the values of a and b that minimise SSerror. To do so, all you need to know is , , the standard deviation of X, the standard deviation of Y, and the correlation between X and Y. Or if you have the original data, you can apply the formulas discussed in every statistics textbook. But computers do it all much more easily. Output 4.4 provides the printout from SPSS for a linear regression predicting scores on the creativity test from scores on the logical reasoning test.
Output 4.4. Regression É LinearÉselecting Descriptive statistics
More detailed comments on Output 4.4
Descriptive statistics comes from selecting this option in the regression procedure. The most important information about a variable is the mean, sd, and N.
Correlations comes as part of selecting Descriptive statistics and is the same as we saw earlier. The correlation matrix describes how all the variables that are in the analysis are related to each other.
Variables Entered/Removed indicates which variables have been selected to do the predicting and also gives the name of the variable being predicted. You should check that these are what you intend them to be.
Model Summary summarises the main information from considering the regression line a "model" for fitting the data. The correlation, variability explained, and standard error are given here.
ANOVA gives the results of the significance tests on this model. Ignore all except the "Sums of Squares" information until a later chapter.
Coefficients gives the information we need to identify our regression line. The value for a (the constant or Y-intercept) is given under "B" as 5.231 and the value for b (the slope of the line) is also given under "B" as .654. The standardized Coefficient is the slope of the line when the logical reasoning scores and the creativity scores are transformed to z-scores before computing the regression line. The "t" and "Sig." can be ignored for now.
So the least squares regression formula is:
predicted creativity score = 5.231 + 0.654*reasoning test score
Someone who scored 10 on the logical reasoning test would be predicted to score 5.231+0.654*10 or about 11.77 on the creativity test. This is the best fitting equation because it minimises the sum of the squared differences between the predicted values and the actual values. That value, called SSerror, is displayed on the printout under "Sum of Squares" in the row labelled "Residual." So SSerror is 116.587. No other values of a and b would produce a smaller value for SSerror.
The variability in creativity that can be predicted by knowing logical reasoning can be found by
Notice the value for "Beta" under "Standardized Coefficients" is the same as the correlation between the two variables.
The slope of the regression line (symbolised by "B) can be computed from
Plotting the regression line
Figure 4.10 The regression line "fitted to" the scatterplot of values shown in Figure 4.9.
This regression equation (or line of best fit) is depicted graphically in Figure 4.10. The value of b in the regression equation (also called the regression weight) is the "slope" of the regression line, when moving from the lower left to the upper right. The slope of the line is defined as the amount by which Y is predicted to increase with each one unit increase in X. So the regression weight for the logical reasoning score in this regression formula, b, can be thought of as the predicted difference in the creativity score associated with a one unit increase in X (logical reasoning score). In other words, two people who differ by 1 point on the logical reasoning test are predicted to differ by roughly .654 points on the creativity test. The slope is positive, which means that as you move to higher values of X, also goes to higher values. As you go to lower values of X, also goes to lower values. The height of the line over any value of X is the predicted value of Y for X. So this regression line could be used as a rough way of finding . Pick a value of X. Place your pen tip at that X value on the X axis, and then move the pen tip straight up to the regression line. Then, turn 90 degrees to the left, moving the pen straight over to the Y axis. The value from the regression model is the Y value where your pen hits the Y-axis.
© Copyright 2000 University of New England, Armidale, NSW, 2351. All rights reserved
Maintained by Dr Ian Price