SAS - Online Documentation (no frames or JAVA)

Preliminaries

In order to use the DEC Alpha, you must first set up an account. Contact ITS Computer Consulting Center in ITTC 36 (Phone 319-273-5555) for an application.

To sign on, press <RETURN> until you get a list of computers. Choose 6 for the DEC Alpha (Acad). Then enter your username and password at the prompts.

If you are new to the DEC Alpha (Acad), you should read the document "Introduction to the DEC Alpha". It can be obtained from the Consulting Center.

Some useful DCL command strings

DCL stands for Digital Command language. The DCL prompt is the dollar sign. From it you can enter DCL command strings, and run SAS. DCL command strings are composed of commands, parameters, and qualifiers. A command is an instruction to the operating system. A parameter defines what the command is to operate on. A qualifier defines how that action will occur.

Here are some useful DCL command strings:

DIRECTORY -- list the files in your directory
PRINT <filespec>-- print a file to the printer
TYPE <filespec>--display a file on the screen
LOOK <filespec>and TYPE/PAGE <filespec>-- display a file on the screen, a page at a time
DELETE <filespec>-- delete a file
PURGE --delete all but most recent version of your files
HELP -- invoke the help screens
COPY <filespec> <filespec>--copy a file
ED <filespec>--enter the EVE text editor
LOGOUT --log off the system

All DCL commands may be abbreviated. All that is needed are enough letters to make it unique. Four characters are always sufficient.

Complete file specifications <filespec> have the following form:

device:[directory]filename.type;version

The device is a logical name for a disk pack. For faculty, this is FAC. For students, this is STU. The directory is your Username,or a subdirectory you have created. Note that the brackets are required.

Both filenames and types can be up to 39 letters. They may consist of letters, numbers, the underscore, the hyphen, and the dollar sign. File names may start with either a letter or a number, but not the underscore, hyphen, or dollar sign. Version numbers are not required in most instances. Usually, all that is needed is a filename and a filetype.

Entering and Exiting MINITAB

Minitab is an interactive package. To enter it, type:

$ MINITAB

You should see the following:

===================================================================
MINITAB Statistical Software, StandardVersion
Release 9.1 for Open VMS
(C) Copyright 1992 Minitab Inc.- All Rights reserved
March 10, 1994 - UNI - Information ystems and Computing Services

Worksheet size: 7407661 cells

For information on: Type:
_____________________ ______________________
How to use Minitab HELP
Customer service HELP OVERVIEW 14
Documentation HELP OVERVIEW 15
What's new in this release NEWS
MTB>

"MTB>" is the Minitab prompt. From here you can type in any of the Minitab commands.

To exit Minitab, type "STOP", like this:

MTB> STOP


***Minitab Release 9.1 *** Minitab Inc. ***


Worksheet size: 7407661 cells


$

The dollar sign is the prompt for the VMS operating system. To sign off Acad, type "LOGOUT", or just "LO".

Saving your Work

Since MINITAB is an interactive package, results are shown on the screen. You may wish to make a hard copy of your results, such as to hand in if you are in a class. The OUTFILE and NOOUTFILE commands in Minitab are used to do this. The OUTFILE command in Minitab instructs it to copy everything that appears at the screen into a file on your account. The NOOUTFILE command instructs Minitab to stop copying output to the file.

To start copying, type:

MTB> OUTFILE 'filename'

"filename" would be the file name that you want to save your work under, for example "project1.lis". If the name does not exist on your directory, a new file will be created. If the name does exist on your directory, the new output will be appended to the existing file.

After you give the OUTFILE command, you can then do the commands in Minitab that you wanted to save. Then to stop the copying, issue the NOOUTFILE command, like this:

MTB> NOOUTFILE

Upon returning to the dollar sign prompt, you may wish to check the contents of the file using the LOOK utility, like this:

$ look project1.lis {pressing "q" will return you to the dollar sign}

To print the output, use the PRINT command, like this:

$ print project1.lis {prints to BUS 19}

OR

$ locprint project1.lis {prints to a printer attached to a PC}

If you want to delete the file containing the output, use the DELETE command, like this:

$ delete project1.lis;*

Entering Data Directly into MINITAB

Minitab operates on columns of data rather like a spreadsheet. These columns are labeled C1,C2,C3,C4,C5, etc. You can refer to a range of columns by using a dash between column numbers. For example, "C1-C5"refers to C1, C2, C3, C4, and C5.

To enter data from the keyboard of your terminal into a Minitab worksheet, enter the READ command. Each column of the data would contain a different piece of information, called a variable. Each row of the data would represent a different record or case. Often a case contains the information for one person, such as one person's responses to a survey, or one person's grades in a class.

At the Minitab prompt, type READ followed by the column numbers that you want to enter data into, such as "READ C1-C5". Then press <RETURN>. Minitab will respond with a subprompt, DATA> . Next to this subprompt, you should enter the data for the first case. Then press <RETURN>. Another DATA> subprompt will occur, and you can enter the data for the second case. DATA> subprompts will continue to occur until you enter "end" . At that point, Minitab will tell you the number of rows read, and you will get the MTB> prompt again. Suppose that we have the student id, and 4 test scores for 6 students. A typical sequence might look like this:

MTB> READ C1-C5
DATA> 444555 71 84 75 92
DATA> 222333 82 65 83 84
DATA> 483222 94 95 95 94
DATA> 666777 62 52 61 100
DATA> 456899 73 84 72 71
DATA> 333231 95 72 100 91
DATA> END
6 ROWS READ

If you have made a mistake and are still on the same data line, you can use the backspace key, and then retype the line correctly. If you are past the line the error is in, just keep typing. Then when you are all done you can fix the error. For example, if the 5th row of column 2 should be a 71, you could fix it by typing this:

MTB> LET C2(5)=71

Reading One Column at a Time

The SET command in MINITAB allows you to enter one column at a time. The syntax is this:

SET column

With the SET command, you can enter more than one number on a line. All the numbers will be placed in one column, until you type END. The following example reads a series of test scores into column 2.

MTB> SET C2


DATA> 71 82 94 62 71 95


DATA> END

Looking at the Data

To view the data, use the Minitab PRINT command. For example, to print columns 1 to 5 of the data, issue this command:

MTB> PRINT C1-C5

The data would then appear like this:

ROW  C1      C2  C3  C4  C5
1 444555 71 84 75 92
2 222333 82 65 83 84
3 483222 94 95 95 94
4 666777 62 52 61 100
5 456899 71 84 72 71
6 333231 95 72 100 91

Entering Data into a File

When you have a great deal of data, you may prefer to enter data directly into a file, then read the file into MINITAB.

To enter the file, from the DCL dollar sign prompt use the EDIT command, like this:

$ EDIT <filespec>

For example, to call the file "grades.dat", you would type:

$ edit grades.dat

Then you would enter the data. The simplest way to enter data is to leave a space between fields, since MINITAB can read this data without having to specify formatting.

You should have just numbers in your file, with no labels or other characters. You should make sure that the [End of File] marker in the editor is directly below the last line in the file, with no blank lines, and that the data lines start on the first line in the file.

Reading Data from a File

If your data is separated by spaces in the file, the syntax of the command is the following.

MTB> READ 'filespec' columns

The following command reads five columns of data from the file GRADES.DAT' and puts them into the MINITAB spreadsheet.

MTB> read 'grades.dat' c1-c5

The SET command can also be used in reading from a DATA file, like this:

SET 'filename' column

Adding Labels to the Columns

You may wish to attach a more descriptive name to a column. A name in Minitab may be up to 8 characters long. The NAME command in Minitab assigns names to columns. The syntax is this:

NAME column= 'name' column='name'

The following example assigns the names TEST1,TEST2,TEST3, and TEST4 to columns C1 through C4.

MTB> NAME C1='STUDENT' C2='TEST1' C3='TEST2' C4='TEST3'


MTB> NAME C5='TEST5'

The name of a column can be changed by another name command. To erase a name (but not the values in the column), assign a null name, like this:

MTB> NAME column=''

When refering to columns by their names, you must enclose the name in single quotes, like this:

PRINT 'STUDENT' 'TEST1'

Adding Rows

The INSERT command in MINITAB allows you to insert rows into a worksheet. The syntax of the INSERT command is the following:

INSERT ['filename'] [between row row] columns

To add rows to the end of the worksheet, just specify the columns, like this:

MTB> INSERT C1-C5


DATA> 345888 90 80 75 76


DATA> 577577 100 90 77 80


DATA> 344999 89 87 89 90


DATA> END

To add rows to the beginning of the worksheet, specify 0 and 1 as the rows to insert between, like this:

MTB> INSERT 0 1 C1-C5

To add rows to the middle of a worksheet, specify the rows to insert between. The following example would insert rows between rows 5 and 6 of the worksheet.

MTB> INSERT 5 6 C1-C5

Deleting Rows

The DELETE command in MINITAB deletes rows in a worksheet. The syntax is:

DELETE row cn-cn

To delete row 5 of the worksheet, type:

MTB> DELETE 5 c1-c5

Deleting Columns

The ERASE command in MINITAB can delete columns and constants in a worksheet. The syntax is:

ERASE column

To delete column 1 of the worksheet, type:

MTB> ERASE C1

Creating a new column

You can calculate new columns based on old ones in Minitab. The LET command in Minitab is used to calculate a column. The syntax is this:

MTB> LET column= arithmetic expression

The following example calculates the improvement from TEST1 to TEST4.

MTB> LET C6=C5-C2

Since C5 and C2 are named, we could have also done this:

MTB> NAME C6= 'IMPROV' MTB> LET 'IMPROV'='TEST4' -'TEST1'

Arithmetic operators include +, -, * (for multiply), / (for divide), and ** (for exponentiation). Also parenthesis can be used to clarify complex operations. The following example calculates column 7 as the square of column 6.

MTB> LET C7= C6**2

Rowwise statistical functions can also be used to compute a statistic across a series of columns, and to put the result in the corresponding locations of the new column. The following example computes C8 as the mean of C2 to C5.

MTB> RMEAN C2-C5 C8

More complex operations are possible. The following example would compute a weighted mean if the first three tests were worth 50% of the grade and the final test were worth 50% of the grade.

MTB> LET C8=(.50*((C2+C3+C4)/3))+(.50*C5)

The following are MINITAB rowwise statistics.

RCOUNT col-col resultcol -- gives the count of values in the row.
RN col-col resultcol -- gives the number of non-missing values across the row.
RNMISS col-col resultcol -- gives the number of missing values across the row.
RSUM col-col resultcol -- gives the sum of the values across the row.
RMEAN col-col resultcol -- gives the mean of the values across the row.
RSTDEV col-col resultcol -- gives the standard deviation of the values across the row.
RMEDIAN col-col resultcol -- gives the median of the values.
RMINIMUM col-col resultcol -- gives the minimum value across the row.
RMAXIMUM col-col resultcol -- gives the maximum value across the row.
RSSQ col-col resultcol -- gives the uncorrected sum of squares.

Computing Constants

Variables that begin with the letter "K" in MINITAB are considered constant rather than column variables. Constants are used when you want to store only one number rather than a series of numbers. The syntax is this:

MTB> LET constant= arithmetic expression

Your arithmetic expression must yield only a one number, as opposed to a column of numbers, as the result. The following example put the sum of column 6 into the constant K1.

MTB> LET K1= SUM(C6)

The following functions result in a single number, and are often associated with constant calculations.

COUNT(column) -- gives the number of items
N(column) -- gives the number of non-missing values
NMISS(column) -- gives the number of missing values
SUM(column) -- gives the sum of the values in a column
MEAN(column) -- gives the mean of the values in a column
STDEV(column) -- gives the standard deviation of values in a column
MEDIAN(column) -- gives the median of the values in a column
MINIMUM(column) -- gives the minimum value in a column
MAXIMUM(column) -- gives the maximum value in a column
SSQ(column) -- gives the uncorrected sum of squares of a column

Saving the worksheet in a MINITAB file

The SAVE command in Minitab saves the worksheet, along with names. MINITAB worksheets saved with the SAVE command can only be accessed through MINITAB, so do not attempt to look at these files at the DCL prompt or in the text editor. The syntax is this:

SAVE 'filename'

The default file extension is MTW. To save the worksheet as 'GRADES', you would enter:

MTB>SAVE 'GRADES'

The filename on your directory would be "GRADES.MTW".

Retrieving a saved worksheet

To retrieve a worksheet previously saved with the SAVE command, use the RETRIEVE command. The syntax is this:

RETRIEVE 'filename'

To retrieve the worksheet 'GRADES', enter:

MTB> RETRIEVE 'GRADES'

Producing Basic Statistics: Describing the Data

The DESCRIBE command prints descriptive statistics for each column. The statistics produced by this command are:

N -- the number of items in the column
N* -- the number of missing values in the column
MEAN -- the mean of the values in the column
MEDIAN -- the median value in the column
TRMEAN -- the "trimmed" mean of the column- removes the smallest 5%, and the largest 5% of the values, then takes the mean of the rest.

To produce descriptive statistics of the data, type:

MTB> DESCRIBE C2-C6

Here are the results:

                    N     MEAN   MEDIAN  TRMEAN    STDEV   SEMEAN
TEST1 20 78.00 77.50 78.89 14.62 3.27
TEST2 20 78.05 80.00 78.39 14.29 3.19
TEST3 20 77.15 76.50 77.11 12.19 2.73
TEST4 20 80.90 80.00 81.00 11.02 2.46
IMPROV 20 2.90 0.50 2.22 14.56 3.26

MIN MAX Q1 Q3
TEST1 40.00 100.00 70.00 89.75
TEST2 50.00 100.00 66.25 89.50
TEST3 55.00 100.00 70.00 86.00
TEST4 60.00 100.00 70.25 90.75
IMPROV -20.00 38.00 -3.50 8.00

Computing Confidence Intervals

Suppose that the 20 people in our study are a sample of a larger population. We might desire to infer things about the larger population from the sample. We might wish to estimate what the mean of the improvement would be if the whole population had taken these tests rather than our sample. Usually this estimate is presented as a range rather than an exact number. This range is called a confidence interval.

The TINTERVAL command in MINITAB computes confidence intervals for the mean on a column of data. This interval goes from:

mean(column) - t*(s/(sqrt(n))

to:

mean(column) + t*(x/(sqrt(n))

where s is the sample standard deviation, n is the sample size, and t is the value from the t table corresponding to the percent confidence desired and (n-1) degrees of freedom. The syntax of the command is:

TINTERVAL confidencepercent column

The following example computes a 95% confidence interval for the data in the column C6, the column where we calculated the improvement from test 1 to test 4.

 

MTB> TINTERVAL 95 C6

N MEAN STDEV SE MEAN 95.0 PERCENT C.I.
IMPROV 20 2.90 14.56 3.26 ( -3.92, 9.72)

Note that if a confidence interval of an improvement in test scores includes the value 0, a T-Test on the same data at the same confidence interval will show that we cannot conclude that there has been any improvement. Here the confidence interval goes from -3.92 to 9.72. We would say that we are 95% sure that the true mean of the improvement from test 1 to test 4 was between -3.92 and 9.72.

T Test on a column of Data

The MINITAB TTEST command is used to determine if the true mean of a column is different from a hypothesized mean. In the case of our column which contains the improvement from TEST 1 to TEST 4, we might wish to know if there actually has been an improvement from one test to the other. To test this, we could test whether the improvement column is different from zero. It is possible that scores actually went down from TEST 1 to TEST 4. A two-sided T-TEST would use the null hypothesis that the improvement is different from zero, negative or positive. A one-sided test would test that the improvement is different from zero, and furthermore is positive or negative depending upon the hypothesis.

The syntax of the T Test in MINITAB is the following.

TTEST hypothesized-mean column

A T Test on the improvement column, column 6, would be the following.

MTB> TTEST 0 C6

TEST OF MU = 0.00 VS MU N.E. 0.00

N MEAN STDEV SE MEAN T P VALUE
IMPROV 20 2.90 14.56 3.26 0.89 0.38

We see that our improvement variable was computed using 20 cases. The mean is 2.90, the standard deviation is 14.56. The standard error of the mean, which is used in computing the T statistic, is 3.26. The calculated T value is .89. The P value determines the significance of the calculated T without having to refer to tables. We would reject the null hypothesis that the true mean of the improvement variable is zero in favor of the alternative hypothesis that the true mean is different from zero when the P value is less than .05, for 5% alpha level (or 95% confidence level). Since the P value is .38, we would not reject the null hypothesis and cannot conclude that test scores changed from the first test to the fourth test. This is consistent with our confidence interval results.

Plotting a Single Column

A histogram of the data can be produced by using the DOTPLOT command in MINITAB. Type:

DOTPLOT column

For example, we might want to plot our improvement in test scores. Type:

MTB> DOTPLOT 'IMPROV'

        :    .      . .:.:..:.   .   .    .      .      .
-+---------+---------+---------+---------+---------+-----IMPROV
-24 -12 0 12 24 36

We see that the improvement scores fall on both sides of zero, indicating some people did worse on the fourth test than on the first one. Also, there are only a few people that gained a great deal from the first test to the fourth one: the gain for most was very small.

Plotting More than One Column

One column can be plotted against another column in a two dimensional scatterplot using the PLOT command. Type:

PLOT column column

To plot TEST4 on the Y axis and TEST1 on the X axis, type:

MTB> PLOT 'TEST4' 'TEST1'

       105+
-
TEST4 - * *
- - * *
90+ * *
- *
- *
- * * * *
- *
75+ *
- *
- * 2 *
-
-
60+ *
-
+---------+---------+---------+---------+---------+------TEST1
36 48 60 72 84 96

We can see from the plot that those people who did well on the first test tended to do well on the fourth test, the relationship is not perfectly linear and there is a lot of variation in the results.

Plots are very useful for spotting outliers. A perfect linear trend would be when you could fit a straight line through the data. This is usually not possible. Techniques such as regression analysis fit lines that will minimize the squared distance of every point from the line.

In this data you can see several points that do not follow the overall trend. One person scored a 40 on Test 1 and a 70 on Test 4. Another person scored a 52 on Test 1 and a 100 on Test 4.

Correlation

The Pearson product moment correlation coefficient is often referred to simply as the correlation. It is a measure of the association of two variables and usually designated by the letter r. A correlation matrix between pairs of columns can be produced using the following syntax:

CORRELATION column,column,column

To produce a correlation matrix of Tests 1, 2, 3, and 4, type:

MTB> CORRELATION 'TEST1','TEST2','TEST3','TEST4'

This would be the result.

                TEST1    TEST2   TEST3
TEST2 0.704
TEST3 0.730 0.579
TEST4 0.382 0.237 0.657

Here we see that Test 1 is correlated with Test 2 .704, Test 1 with Test 3 .73, and Test 1 with Test 4 .382. Test 2 is correlated with Test 3 .579, and Test 4 .237. Test 3 is correlated with Test 4 .657. This indicates that the scores on Test 4 are not as closely linearly related to Test 1 and Test 2 as the other tests are to each other.

When the correlation is positive, two columns are closely related the closer the correlation is to 1, and two columns are not closely related the closer the correlation is to 0. When the correlation is negative, two columns are closely related the closer the correlation is to -1, and two columns are not closely related the closer the correlation is to 0.

Test 1 is correlated with Test 2 .704. This means that people who did well on Test 1 tended to do well on Test 2, and people who did poorly on Test 1 tended to do poorly on Test 2. Since the correlation is not exactly 1, this is not a perfect linear relationship. That may mean that a person who did poorly on Test 1 scored better than expected on Test 2.

Regression

Simple linear regression tells us whether using a linear model is a better fit of the data than just using the mean. A regression model Y=B0 + B1*X1 would test for whether the slope B1 is significantly different from zero. Let's say we wanted to determine if the mean score of the first two tests would be a good predictor of how people will do on the fourth test. Let's first define column 9 to be the mean of the first 2 tests, like this:

MTB> RMEAN 'TEST1'-'TEST2' C9
MTB> NAME C9='TESTMEAN'

Then we could run a regression with 'TEST4' as the dependent variable and our new variable 'TESTMEAN' as the independent variable. Type:

MTB> REGRESS 'TEST4' 1 'TESTMEAN'

This says to regress the dependent variable 'TEST4' on one independent variable, 'TESTMEAN'.

The regression equation is

TEST4 = 59.2 + 0.278 TESTMEAN

Predictor       Coef       Stdev   t-ratio        p
Constant 59.23 14.50 4.08 0.001
TESTMEAN 0.2778 0.1833 1.52 0.147

s = 10.66 R-sq = 11.3% R-sq(adj)= 6.4%

Analysis of Variance

SOURCE DF SS MS F p
Regression 1 260.8 260.8 2.30 0.147
Error 18 2045.0 113.6
Total 19 2305.8

Unusual Observations
Obs.TESTMEAN TEST4 Fit Stdev.Fit Residual St.Resid
4 57.0 100.00 75.06 4.53 24.94 2.59R
19 45.0 70.00 71.73 6.51 -1.73 -0.20 X

R denotes an obs. with a large st.resid.

X denotes an obs. whose X value gives it large influence.

First we are given the regression equation. Then there is a section that contains t-ratios and p values for the variables in the equation. Our regression equation only contains one variable, so this will be the same as the analysis of variance table (since a t value is equivalent to the f value in the case where there is one independent variable). However, if we had a multiple regression with more than one independent variable, this section becomes important. The t value would then be the effect of adding an additional variable given that all the other variables are in the model.

The analysis of variance table tells us the significance of the regression model. The p value is .147. Since this is not less than .05, we cannot conclude that the model is helpful in allowing us to predict values of the dependent variable. In other words, knowing the mean of the first two tests is not a good measure of what someone's score for the fourth test will be.

The regression procedure then gives us outliers. These are the cases that had a large effect on the result. Observation 4 went from a mean of 57 on the first 2 tests to a 100 on the final. Observation 19 went from a mean of 45 on the first 2 tests to a 70 on the final. These cases probably contributed to not finding a significant model.

Using Categorical Data

Categorical data refers to data which is coded to represent some group. The numbers used to code the data are usually arbitrary and thus statistics that use the values of these numbers in calculations, like correlation, would not be appropriate. For example, suppose we wanted to add the sex of the student. We could code 1 as male, and 2 as female. We could enter a new column containing the data like this:

MTB> SET C7
DATA> 1 2 2 1 1 1 1 2 2 2 2 1 1 1 1 1 2 1 2 2
DATA> END

It is possible to have more than 2 categories. For example, the class the student is in might have four categories: 1= freshman, 2=sophomore, 3=junior, and 4=senior. To enter this data, type:

MTB> SET C8
DATA> 4 4 1 1 3 1 2 1 3 4 2 2 1 1 1 1 2 1 3 4
DATA> END

Then we could add labels to our new data. Type:

MTB> NAME C7 'SEX' C8 'CLASS'

Frequency Tables

When you have categorical data, you frequently desire to know how many of each group there are. A frequency table will provide this information. The TALLY command in MINITAB produces frequency tables. The following example tallies the 'SEX' column and the 'CLASS' column.

MTB> TALLY 'SEX' 'CLASS' ;
SUBC> ALL.

Here we specified the 'ALL' subcommand to produce cumulative counts, percents, and cumulative percents in our output.

The results would be as follows:

sex  COUNT CUMCNT PERCENT  CUMPCT     class  COUNT CUMCNT PERCENT  CUMPCT
1 11 11 55.00 55.00 1 9 9 45.00 45.00
2 9 20 45.00 100.00 2 4 13 20.00 65.00
N= 20 3 3 16 15.00 80.00
4 4 20 20.00 100.00
N= 20

Remember that we coded our sex variable to be 1 for males and 2 for females. The data shows us that we have 11 males for 55% and 9 females for 45%. For our class variable, we coded 1=freshman, 2=sophomore, 3=junior, and 4=senior. The table shows that we have 9 freshman, 4 sophomores, 3 juniors, and 4 seniors. The cumulative numbers are also helpful for this variable: 80% of our sample are freshman, sophomores, and juniors, 65% are freshman and sophomores.

Two-Dimensional Tables with Chi-Squ

A two-dimensional table can cross one categorical variable by another categorical variable. This may be useful, for example, if you which to know the breakdown of class for each sex. The following example produces a two-dimensional table of 'SEX' by 'CLASS', and requests a chi-square statistic on the table.

MTB> TABLE 'SEX' 'CLASS';
SUBC> CHISQ 3.

The "3" next to the Chi Square requests that the count, the expected count, and the standardized residual be put into each cell.

Here are the results:

  ROWS: sex     COLUMNS: class

1 2 3 4 ALL

1 7 2 1 1 11
4.95 2.20 1.65 2.20 11.00
0.92 -0.13 -0.51 -0.81 --

2 2 2 2 3 9
4.05 1.80 1.35 1.80 9.00
-1.02 0.15 0.56 0.89 --

ALL 9 4 3 4 20
9.00 4.00 3.00 4.00 20.00
-- -- -- -- --

CHI-SQUARE = 3.951 WITH D.F.= 3

CELL CONTENTS --
COUNT
EXP FREQ
STD RES

Here we see that 7 males are also freshman, where 2 males are sophomores, 1 is a junior, and 1 is a senior. 2 females are freshman, 2 are sophomores, 2 are juniors, and 3 are seniors. The chi-square statistic tests whether there is a significant difference between the two sexes as to how they are distributed among the classes. The chi-square statistic is used to determine this. The probability of chi-square is not given. If we looked up a chi-square with alpha of .05 and 3 degrees of freedom, we would find the value to be 7.81. Since 3.951 is less than 7.81, we cannot conclude that there is a difference between sexes.

The expected count and standardized residual are shown in each cell. This is helpful in determining which cells contributed greatest in the calculation of the chi-square.

Two-Group T Test

A two group T test asks whether the mean of a column for one group is significantly different than the mean of that column for another group.

Suppose we wanted to know if there was a difference between males and females as to how they did on Test 4. The following example would do a two group t-test to see if there is a difference between the means of the two groups.

MTB> TWOT 95 'TEST4' , 'SEX'

The TWOT command in MINITAB assumes that the data values will be in one column, and the codes containing group codes will be in another column.

TWOSAMPLE T FOR TEST4
sex N MEAN STDEV SE MEAN
1 11 80.4 13.3 4.01
2 9 81.56 8.11 2.70

95 PCT CI FOR MU 1 - MU 2: (-11.45,9.070)

TTEST MU 1 = MU 2 (VS NE): T= -0.25 P=0.81 DF= 16

The p value for the test is .81. Since this is not less than .05, we cannot conclude that there was any difference between how males and females did on test 4.

Analysis of Variance

When there are more than two groups, an analysis of variance is used rather than a two-group t-test. The null hypothesis of analysis of variance would be that there is no difference in the mean of the dependent variable between any of the groups. Suppose we wanted to know if there was a difference between the class of the students as to how they did on test 4. The following example would do an analysis of variance to see if there is a difference in means.

MTB> ONEWAY 'TEST4','CLASS'

When using the ONEWAY command, the first column specified should be the dependent variable and the second column should be the "treatment" variable, the variable that contains the groups.

   ANALYSIS OF VARIANCE ON TEST4  
SOURCE DF SS MS F p
class 3 140 47 0.34 0.793
ERROR 16 2166 135
TOTAL 19 2306

INDIVIDUAL 95 PCT CI'S FOR MEAN
BASED ON POOLED STDEV
LEVEL N MEAN STDEV --------+---------+---------+--------
1 9 83.33 13.73 (-------*--------)
2 4 77.75 7.14 (------------*-----------)
3 3 77.00 11.27 (-------------*-------------)
4 4 81.50 9.15 (------------*-----------)
--------+---------+---------+--------
POOLED STDEV = 11.63 70 80 90

The p value of the analysis of variance table is .793. If we desire a significance level of .05 , then we cannot conclude that there is any difference between the classes as to how they did on test 4. The confidence intervals provide a visual way of determining which groups are different, if any. If we had found significance in the analysis of variance, we then would have looked at the confidence intervals. We would conclude that two groups are different from each other if their confidence intervals did not overlap. More complicated tests to test differences in means are available.

Using Dummy Variables in Regression

An equivalent to using analysis of variance is to use what are called dummy variables in regression. Dummy variables are indicator variables. They are normally coded 0 or 1. In order to use our 'CLASS' variable as dummy variables in regression, we would need to create these dummy variables:

X1 = 1 if 'CLASS' is a 1 (freshman), 0 otherwise
X2 = 1 if 'CLASS' is a 2 (sophomore), 0 otherwise
X3 = 1 if 'CLASS' is a 3 (junior), 0 otherwise

It is necessary to have only 3 dummy variables even though there are 4 classes: if X1, X2, and X3 are all 0, then we have a senior.

The CODE statement in MINITAB allows us to create these variables from the existing 'CLASS' variable. Let's use columns 10,11, and 12 to store these new variables. The statements to create the new variables would be the following.

MTB> NAME C10= 'X1' C11= 'X2' C12= 'X3'
MTB> CODE (1)1 (2 3 4)0 'CLASS''X1'
MTB> CODE (2)1 (1 3 4)0 'CLASS' 'X2'
MTB> CODE (3)1 (1 2 4)0 'CLASS' 'X3'

The first code statement asks that we code the value 1 to a 1, and the values 2,3, and 4 to a 0, for the variable 'CLASS', and put the results in the variable 'X1'.

Then we could run a regression. If our dependent variable is 'TEST4' our independent variables would be the dummy variables for 'CLASS'. The model would be this:

Y= B0 + B1*X1 + B2*X2 + B3*X3

To run the regression, type:

MTB> REGRESS 'TEST4' 3 'X1' 'X2' 'X3'

The regression equation is

TEST4 = 81.5 + 1.83 X1 - 3.75 X2 - 4.50 X3

Predictor       Coef       Stdev   t-ratio        p
Constant 81.500 5.817 14.01 0.000
X1 1.833 6.991 0.26 0.796
X2 -3.750 8.227 -0.46 0.655
X3 -4.500 8.886 -0.51 0.619

s = 11.63 R-sq = 6.1% R-sq(adj)= 0.0%

Analysis of Variance

SOURCE DF SS MS F p
Regression 3 140.1 46.7 0.34 0.793
Error 16 2165.8 135.4
Total 19 2305.8

SOURCE DF SEQ SS
X1 1 96.9
X2 1 8.4
X3 1 34.7

Unusual Observations

Obs. X1 TEST4 Fit Stdev.Fit Residual St.Resid
15 1.00 60.00 83.33 3.88 -23.33 -2.13R

R denotes an obs. with a large st.resid.

We can see that the analysis of variance table produced by regression is the same as that produced by the Analysis of Variance command.

Analysis of Covariance

Analysis of Covariance refers to having both continuous and categorical independent variables in a model. This can be accomplished in MINITAB using the regression command.

Let's try to do an analysis of covariance with TEST4 as the dependent variable, TESTMEAN as the continuous variable, and the person's sex as a categorical variable. We would first have to recode sex to be 1 for males, 0 for females. Type:

MTB> NAME C12='SEXDUM'
MTB> CODE (1)1 (2)0 'SEX' 'SEXDUM'

Then compute an interaction term:

MTB> NAME C13='INTERAC'
MTB> LET C13='SEXDUM' 'TESTMEAN'

Then run the regression model:

MTB> REGRESS 'TEST4' 3 'TESTMEAN' 'SEXDUM' 'INTERAC'

The regression equation is

TEST4 = 55.3 + 0.333 TESTMEAN + 8.4 SEXDUM - 0.118 INTERAC

Predictor       Coef       Stdev   t-ratio        p
Constant 55.32 17.95 3.08 0.007
TESTMEAN 0.3330 0.2230 1.49 0.155
SEXDUM 8.40 28.12 0.30 0.769
INTERAC -0.1179 0.3552 -0.33 0.744

s = 11.05 R-sq = 15.3% R-sq(adj)= 0.0%

Analysis of Variance

SOURCE DF SS MS F p
Regression 3 353.1 117.7 0.96 0.434
Error 16 1952.7 122.0
Total 19 2305.8

SOURCE DF SEQ SS
TESTMEAN 1 336.6
SEXDUM 1 3.1
INTERAC 1 13.4

Unusual Observations
Obs.TESTMEAN TEST4 Fit Stdev.Fit Residual St.Resid
4 62 100.00 77.06 5.40 22.94 2.38R
19 40 70.00 68.64 9.40 1.36 0.23 X

R denotes an obs. with a large st.resid.
X denotes an obs. whose X value givesit large influence.

The p value in the analysis of variance tells us that the overall model is not significant, since .434 is not less than .05. It is also useful to look at the p values for the individual variables. The p value of the interaction term is .74, the p value of the sex term is .76, and the p value of the test mean is .15. Since the interaction effect is not significant, it could be dropped from the model. The regression command could be run again with only the sex and test mean as independent variables, to see if a model without an interaction term might be significant. Notice that the R squared is a rather low 15%, so it is unlikely we will make this model significant by dropping a term.


minitab1

Your rating: None Average: 2 (1 vote)