# Regression

Simple linear regression tells us whether using a linear model is a better fit of the data than just using the mean. A regression model Y=B0 + B1*X1 would test for whether the slope B1 is significantly different from zero. Let's say we wanted to determine if the mean score of the first two tests would be a good predictor of how people will do on the fourth test. Let's first define column 9 to be the mean of the first 2 tests, like this:

`MTB> RMEAN 'TEST1'-'TEST2' C9`

MTB> NAME C9='TESTMEAN'

Then we could run a regression with 'TEST4' as the dependent variable and our new variable 'TESTMEAN' as the independent variable. Type:

`MTB> REGRESS 'TEST4' 1 'TESTMEAN'`

This says to regress the dependent variable 'TEST4' on one independent variable, 'TESTMEAN'.

The regression equation is

`TEST4 = 59.2 + 0.278 TESTMEAN`

Predictor Coef Stdev t-ratio p

Constant 59.23 14.50 4.08 0.001

TESTMEAN 0.2778 0.1833 1.52 0.147

s = 10.66 R-sq = 11.3% R-sq(adj)= 6.4%

Analysis of Variance

SOURCE DF SS MS F p

Regression 1 260.8 260.8 2.30 0.147

Error 18 2045.0 113.6

Total 19 2305.8

Unusual Observations

Obs.TESTMEAN TEST4 Fit Stdev.Fit Residual St.Resid

4 57.0 100.00 75.06 4.53 24.94 2.59R

19 45.0 70.00 71.73 6.51 -1.73 -0.20 X

R denotes an obs. with a large st.resid.

X denotes an obs. whose X value gives it large influence.

First we are given the regression equation. Then there is a section that contains t-ratios and p values for the variables in the equation. Our regression equation only contains one variable, so this will be the same as the analysis of variance table (since a t value is equivalent to the f value in the case where there is one independent variable). However, if we had a multiple regression with more than one independent variable, this section becomes important. The t value would then be the effect of adding an additional variable given that all the other variables are in the model.

The analysis of variance table tells us the significance of the regression model. The p value is .147. Since this is not less than .05, we cannot conclude that the model is helpful in allowing us to predict values of the dependent variable. In other words, knowing the mean of the first two tests is not a good measure of what someone's score for the fourth test will be.

The regression procedure then gives us outliers. These are the cases that had a large effect on the result. Observation 4 went from a mean of 57 on the first 2 tests to a 100 on the final. Observation 19 went from a mean of 45 on the first 2 tests to a 70 on the final. These cases probably contributed to not finding a significant model.