Chapter 4: Analysing the Data Part II : Descriptive Statistics

# Measures of central tendency

Measures of central tendency, or "location", attempt to quantify what we mean when we think of as the "typical" or "average" score in a data set. The concept is extremely important and we encounter it frequently in daily life. For example, we often want to know before purchasing a car its average distance per litre of petrol. Or before accepting a job, you might want to know what a typical salary is for people in that position so you will know whether or not you are going to be paid what you are worth. Or, if you are a smoker, you might often think about how many cigarettes you smoke "on average" per day. Statistics geared toward measuring central tendency all focus on this concept of "typical" or "average." As we will see, we often ask questions in psychological science revolving around how groups differ from each other "on average". Answers to such a question tell us a lot about the phenomenon or process we are studying.

Mode. By far the simplest, but also the least widely used, measure of central tendency is the mode. The mode in a distribution of data is simply the score that occurs most frequently. In the distribution of sexual partners data, the mode is "1" because it is the most frequently occurring score in the data set. If you have had only one sexual partner in the last year, it would be reasonable therefore to say that you are fairly typical of UNE students (or at least of those students who responded to the question). Importantly, you canÕt necessarily claim that "most" UNE students had only one sexual partner last year. From the frequency distribution, notice that actually fewer than half of the respondents reported having only one sexual partner. So "most" students reported having something different to 1 sexual partner. Still, "1" was the most frequent single response to this question, and so it is the mode or modal response. In some cases, however, such a conclusion would be justified. For example, from Figure 3.4, you can see that the modal ethnic group in the U.S. in 1990 was "white" and "most" people living in the U.S. were "white."

Recall that one way of describing a distribution is in terms of the number of modes in the data. A unimodal distribution has one mode. In contrast, a bimodal distribution has two. Now this might seem odd to you. How can there be more than one "most frequently occurring" score in a data set? I suppose statisticians are a bit bizarre in this way. We would accept that a distribution is bimodal if it seems that more than one score or value "stands out" as occurring especially frequently in comparison to other values. But when the data are quantitative in nature, weÕd also want to make sure that the two more frequently occurring scores are not too close to each other in value before weÕd accept the distribution as one that could be described as "bimodal." So there is some subjectivity in the decision as to whether or not a distribution is best characterised as unimodal, bimodal, or multimodal.

Median. Technically, the median of a distribution is the value that cuts the distribution exactly in half, such that an equal number of scores are larger than that value as there are smaller than that value. The median is by definition what we call the 50th percentile. This is an ideal definition, but often distributions canÕt be cut exactly in half in this way, but we still can define the median in the distribution. Distributions of qualitative data do not have a median.

The median is most easily computed by sorting the data in the data set from smallest to largest. The median is the "middle" score in the distribution. Suppose we have the following scores in a data set: 5, 7, 6, 1, 8. Sorting the data, we have: 1, 5, 6, 7, 8. The "middle score" is 6, so the median is 6. Half of the (remaining) scores are larger than 6 and half of the (remaining) scores are smaller than 6.

To derive the median, using the following rule. First, compute (n+1)/2, where n is the number of data points. Here, there are 5, so n = 5. If (n+1)/2 is an integer, the median is the value that is in the (n+1)/2 location in the sorted distribution. Here, (n+1)/2 = 6/2 or 3, which is an integer. So the median is the 3rd score in the sorted distribution, which is 6. If (n+1)/2 is not an integer, then there is no "middle" score. In such a case, the median is defined as one half of the sum of the two data points that hold the two nearest locations to (n+1)/2. For example, suppose the data are 1, 4, 6, 5, 8, 0. The sorted distribution is 0, 1, 4, 5, 6, 8. n = 6, and (n+1)/2 = 7/2 = 3.5. This is not an integer. So the median is one half of the sum of the 3rd and 4th scores in the sorted distribution. The 3rd score is 4 and the firth score is 5. One half of 4 + 5 is 9/2 or 4.5. So the median is 4.5. Here, notice that half of the scores are above 4.5 and half are below. In this case, the ideal definition is satisfied. Also, notice that the median may not be an actual value in the data set. Indeed, the median may not even be a possible value.

The median number of sexual partners last year is 1. Here, n = 177, and (n+1)/2 = 178/2 = 89, an integer. So in the sorted distribution, the 89th data point is the median. In this case, the 89th score is a 1. Notice that this doesnÕt meet the ideal definition, but we still call it the median. It certainly isnÕt true that half of the people reported having fewer than 1 sexual partner, and half reported having more than 1. Violations of the ideal definition will occur when the median value occurs more than once in the distribution, which is true here. There are many "1"s in the data.

Computing the median seems like a lot of work. But computers do it quite easily (see Output 4.2). In real life, youÕd rarely have to compute the median by hand but there are some occasions where you might, so you should know how.

Mean. The mean, or "average", is the most widely used measure of central tendency. The mean is defined technically as the sum of all the data scores divided by n (the number of scores in the distribution). In a sample, we often symbolise the mean with a letter with a line over it. If the letter is "X", then the mean is symbolised as , pronounced "X-bar." If we use the letter X to represent the variable being measured, then symbolically, the mean is defined as

For example, using the data from above, where the n = 5 values of X were 5, 7, 6, 1, and 8, the mean is (5 + 7 + 6 + 1 + 8) / 5 = 5.4. The mean number of sexual partners reported by UNE students who responded to the question is, from Figure 4.1, (1 + 0 + 2 + 4 + . . . + 0 + 6 + 2 + 2)/ 177 = 1.864. Note that this is higher than both the mode and the median. In a positively skewed distribution, the mean will be higher than the median because its value will be dragged in the direction of the tail. Similarly in a negatively skewed distribution, the mean will be dragged lower than the median because of the extra large values in the left-hand tail. Distributions of qualitative data do not have a mean.

While probably not intuitively obvious, the mean has a very desirable property: it is the "best guess" for a score in the distribution, when we measure "best" as LEAST IN ERROR. This might seem especially odd because, in this case, no one would report 1.864 sexual partners, so if you guessed 1.864 for someone, youÕd always be wrong! But if you measure how far off your guess would tend to be from the actual score that you are trying to guess, 1.864 would produce the smallest error in your guess. It is worth elaborating on this point because it is important. Suppose I put the data into a hat, and pulled the scores out of the hat one by one, and each time I ask you to guess the score I pulled out of the hat. After each guess, I record how far off your guess was, using the formula: error = actual score - guess. Repeating this procedure for all 177 scores, we can compute your mean error. Now, if you always guessed 1.864, your mean error would be, guess what? ZERO! Any other guessing strategy you used would produce a mean error different from zero. Because of this, the mean is often used to characterise the "typical" value in a distribution. No other single number we could report would more accurately describe EVERY data point in the distribution.