Data Analysis
Home UNI Math Center Find the Math Center Meet the Staff PPST Preparation Useful Math Links

 

Basic Concepts for Analyzing Data 

PPST Tools

Success Strategies
Sample Questions
Basic Number Facts
Fractions
Ratios, Rates, and %
Geometry
Geometry Formulas
Measurement
Data Analysis
Probability

1) Consider the following data set:

54 41 47 46 41

59 46 60 49 34

35 25 54 46 22

These are the number of homeruns hit by Babe Ruth during each year of his career. Each piece of data in the data set is called an observation. As a list of numbers, the data tells us very little. It is helpful to distribute the data into separate groups called classes.

Classes

Data in the class

Frequency

20-29

22, 25

2

30-39

34, 35

2

40-49

41, 41, 46, 46, 46, 47, 49

7

50-59

54, 54, 59

3

60-69

60

1

Note that each class has the same width and that each observation in the data set is distributed into one and only one class. The frequency of each class is the number of observations that lie in each class. The data can now be displayed using a histogram.

 

 

 

Note that the values of the observations are graphed on the horizontal axis and the frequency of each class is graphed on the vertical axis. The histogram gives a picture of how the data is distributed. The higher the bar the more data is in the class.

2) There are three different ways to determine the "center" or most "typical" value of a data set. They are the mean, median, and mode.

Consider the data set:  3, 5, 1, 14, and 7

Mean – The average of the observations in a data set.

Median – The "middle" value of the data set. To find it, first put the data in order: 1, 3, 5, 7, 14. The median is the observation that lies in the middle, in this case 5. Consider the data: 1, 3, 5, 7, 14, 23. Now the median is the average of 5 and 7 which is 6.

Mode – The most frequent observation in the data set. There is no mode in the data set above since each observation occurs only once. The mode of the data set: 3, 5, 7, 1, 5, 3, 5, 14 would be 5.

3) Data that has a bell shape like that below is said to have a normal distribution. The normal distribution is very important and many sets of data encountered in the real world resemble a normal distribution when displayed graphically. For a normal distribution the mean, median, and mode are equal. The standard deviation (st dev) is a measure of how spread out the curve is around the mean. The larger the st dev the more spread out the curve.

The values –1, -2, -3 and 1, 2, 3 are called z-scores and stand for one, two, or three st dev below or above the mean, respectively. If the curve above had mean of 55 and st dev of 5, then we could label the curve with actual observation values instead of z-scores as follows:

We would say, for example, 45 has a z-score of –2 since it lies two st dev below the mean and 60 has a z-score of 1 since it lies one st dev above the mean. The mean has a z-score of zero since it is zero st dev from the mean. The z-score of an observation such as 42 is not so clear. It is clear, however, that 42 lies between 2 and 3 st dev below the mean and thus it makes since that the z-score should lie somewhere between –3 and –2. The exact value can be found using the following formula:

Thus the exact z-score for 42 will be:

which is between –3 and –2 as we predicted. Thus 42 lies 2.6 st dev below the mean. Now suppose we would like to know what observation lies 1.7 st dev above the mean, that is, has a z-score of 1.7. Clearly the value must be somewhere between 60 and 65. It can be found using the following formula:

Thus the observation will be:

These two formulas can be used to find a z-score when given an observation and an observation when given a z-score.

4) For normal distributions the empirical rule says:

68% of the observations lie between 1 st dev below the mean and 1 st dev above the mean.

95% of the observations lie between 2 st dev below the mean and 2 st dev above the mean.

99.7% of the observations lie between 3 st dev below the mean and 3 st dev above the mean.

Consider the example of a normal distribution with a mean of 55 and a st dev of 5.

 

 

The empirical rule states that 68% of the observations in the data set will lie between 50 and 60, 95.4% of the observations will lie between 45 and 65, and 99.7% of the observations will lie between 40 and 70.

Memorizing the empirical rule will give you a good intuition of how normal data is distributed. Suppose you observed an observation that lied 4 st dev above the mean. You should immediately realize that this is a very rare observation since only 0.3% of the observations lie outside three st dev from the mean.

 

Designed by John Neely, Math Center Coordinator, University of Northern Iowa, Spring 2004
Last revised 1/31/06
Hit Counter