Chapter 1: An Introduction to Analyzing Data
Overview
This is a test section
1.3: Measures of Center
Learning Objectives
Calculate the mode, median, andmean for a set of data.
Explain the conceptual differences between the mode, median, and mean.
Identify the symbols and know the formulas for sample and population means.
Describe how outliers can influence the mean and median.
Solve problems involving the mode, mean, and median.
Calculate a weighted mean.
Introduction
Once data are collected, it is useful to summarize the data set by identifying a value around which the data are centered or clustered. Many times we refer to this idea as an average, but we must be careful to specify what we are focusing on to call something 'average.'
Three commonly used measures of central tendency are the mode, the median, and the mean. This lesson examines each measure of center based on the primary characteristic it uses to determine "average-ness."
The Mode
The mode can be a useful measure of data when that data falls into a small number of categories. It is simply a measure of the most common number, or sometimes the most popular choice. The mode is an especially useful concept for categorical data sets at the nominal level, such as eye color or favorite ice-cream flavor, where it wouldn't make sense to talk about a mean or median. In a previous section, we referred to the data with the Galapagos tortoises and noted that the variable 'Climate Type' was such a measurement. For that example, the mode is the value 'humid'.
Example 1
Antoinette collected color data on the vehicles in the parking lot as show in the table:
| Color | Frequency |
| Blue | 3 |
| Green | 5 |
| Red | 4 |
| White | 3 |
| Black | 2 |
| Gray | 3 |
For this data, 'Green' is the mode because it is the data value that occurred most often.
Example 2
The students in a statistics class were asked to report the number of children that live in their house (including brothers and sisters temporarily away at college). The data are recorded below:
1, 3, 4, 3, 1, 2, 2, 2, 1, 2, 2, 3, 4, 5, 1, 2, 3, 2, 1, 2, 3, 6
The mode can also be used a measure of center for quantitative data. In this example, the mode could be a useful statistic that would tell us something about the families of statistics students in our school. In this case, 2 is the mode, as it is the most frequently occurring number of children in the sample, telling us that more students in the class come from families where there are 2 children than from any other household size.
If there were seven 3-child households and seven 2-child households, we would say the data set has two modes, or the data set is bimodal. When a data set is described as being bimodal, it is clustered about two different modes. Technically, there can be more than two modes, but the more modes there are, the more trivial the mode becomes. In these cases, we would most likely search for a different statistic to describe the center of such data. You might encounter data sets with two or even three modes, but more than that would be unlikely unless you are working with very small sample sets.
If there is an equal number of each data value, the mode is not useful in helping us understand the data, and thus, we say the data set has no mode.
The Mean
Another measure of central tendency is the arithmetic average, or mean. We will look at two interpretations of the mean that are often seen in elementary school curricula.
Leveling Out
One interpretation of the mean is "leveling out." The mean is the value that results from equal sharing of a set of values among n individuals.
Suppose five children each made towers out of blocks. Child A has 2 blocks, Child B has 4 blocks, Child C has 7 blocks, Child D has 3 blocks, and Child E has 4 blocks as shown.
If these children were asked to equally share their blocks, Child C would give some blocks to Child A and Child D.
Now, all five children have 4 blocks and the stacks are leveled. The number of blocks in each stack, 4, is the mean. Note that during this entire process of leveling the total number of blocks remained 20 and the number of stacks remained 5. The only difference was the ways the 20 blocks were distributed among the 5 stacks. The data began as {2, 4, 7, 3, 4} and ended as {4, 4, 4, 4, 4}.
Another way the children could have fairly shared the blocks would be to dump all the blocks into a box and then take turns picking blocks until they were all gone. This process mirrors the typical formula we all know for calculating the mean: add all the data values and divide the sum by the total number of data values.
Symbolically, the formula for the sample mean is
\(\overline{x}= \frac{\sum_{}^{} x_i}{n} = \frac{x_1+x_2+\ldots+x_n}{n}\) where \(x_i\) is the \(i^\text{th}\) data value of the sample and \(n\) is the sample size.
Statisticians use the symbol \(\overline{x}\) to represent the mean when \(x\) is the symbol for a single measurement. Read \(\overline{x}\) as “x bar.” The formula for finding the mean of a population is the same, but we denote that it was calculated through a census by the Greek letter, μ. The sample mean \(\overline{x}\) is a statistic because it is a measure of a sample, and μ is a parameter because it is a measure of a population. We say that \(\overline{x}\) is an estimate of μ.
Balance Point
Another common interpretation of the mean is as the balancing point of a distribution. We can illustrate this physical interpretation of the mean as the balance point. Recall the number of children in a household data from Example 2:
1, 3, 4, 3, 1, 2, 2, 2, 1, 2, 2, 3, 4, 5, 1, 2, 3, 2, 1, 2, 3, 6.
There are 22 students in this class, and the total number of children in all of their houses is 55, so the mean of this data is \(\overline{x}=\frac{55}{22}=2.5\). Here is a graph of that data:
Suppose we use snap cubes to make a physical model of the graph, using one cube to represent each student’s family and a row of six cubes at the bottom to hold them together, like this:
It turns out that the model that you created balances at 2.5. In the pictures below, you can see that a block placed at 2 causes the graph to tip right, while one placed at 3 causes the graph to tip left. However, if you place the block at 2.5, it balances perfectly!
The Median
The median is simply the middle number in an ordered set of data.
Suppose a student took five statistics quizzes and received the following scores: 80, 94, 75, 96, 90. To find the median, you must put the data in order. The median will be the score that is in the middle. Placing the data in order from least to greatest yields: 75, 80, 90, 94, 96. The middle number in this case is the third score, or 90, so the median of this data is 90.
When there is an even number of numbers, no one of the data points will be in the middle. In this case, we take the mean of the two middle numbers.
Example 3
Find the median of the following quiz scores: 91, 83, 97, 89.
Place them in numeric order: 83, 89, 91, 97. The second and third numbers straddle the middle of this set. The mean of these two numbers is 90, so the median of the data is 90.
While it is easy to find the "middle" of a small set of numbers just by looking, this is a bit more trouble when the data set is large. It may be useful to have a guide to help. To find the median,
- Begin by listing the data in order from smallest to largest, or largest to smallest.
- If the number of data values \(n\) is odd, then the median will be the middle data value. To find the position of the median, round \(n/2\) up to the next whole number and count until you reach that position.
- If the number of data values \(n\) is even, there is no one middle value. Find the mean of the values in the data at the \((n/2)^\text{th}\) and \((n/2 +1)^\text{th}\) locations.
Example 4
Let's revisit the sample data about number of children in the household from Example 1. Find the median.
1, 3, 4, 3, 1, 2, 2, 2, 1, 2, 2, 3, 4, 5, 1, 2, 3, 2, 1, 2, 3, 6
Begin by ordering the data: 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 6.
There is an event number of data, \(n-22\). We need to find the mean of the numbers in the \(22/2=11^\text{th} \) position and the \(22/2+1=11^\text{th} \) position. The median is \(\frac {2+2}{2}=2\).
Technology Note: Finding the Mean and Median on the TI-83/84 Calculator | ||
1. | Press [STAT] to open the Statistics menu. Choose 1: Edit on the menu to get a screen that allows you to enter the data in a column. Press [ENTER]. | |
2. | Enter the number of children data in L1. Be sure to press [ENTER] after each value. | |
3. | Again, press [STAT] and use the right arrow to access the CALC menu. | |
4. | Choose 1:1-Var Stats. You will have one of the two screens shown. If you have a screen like the one on the top, press [2ND] [1] for the list L1 and press [ENTER]. If you have a STAT WIZARD screen that looks like the one on the bottom, enter the name of the list by pressing [2ND] [1]. Leave FreqList blank. Choose Calculate and press [ENTER]. | |
5. | The calculator will display several values. We will learn about all these statistics later. The first screen shows the mean x¯¯¯, the sum of the data values Σx , and the number of data values n. If you use the down arrow, you will also see the median (Med.) |
|
Example 5
This table shows the number of home runs for the past ten baseball games of the season. Find the median number of home runs.
| Number of Home Runs | Frequency |
|---|---|
| 4 | 1 |
| 5 | 2 |
| 6 | 0 |
| 7 | 2 |
| 8 | 4 |
| 9 | 1 |
Rather than list all the data, just work in the table. There are 10 pieces so \(n=10\), which is even. We need to find the data values in the \(10/2=5^\text{th}\) position and \(10/2+1=6^\text{th}\) position. The data are ordered so just count to find the data values in these positions and average them. The median is \(\frac{7+8}{2}=7.5\) home runs.
The Mean vs. the Median
Both the mean and the median are important and widely used measures of center. Is one better to use than the other? It depends as the following example shows.
Example 6
Suppose you got an 85 and a 93 on your first two statistics quizzes, but then you had a really bad day and got a 14 on your next quiz!
The mean of your three grades would be 64: \(\frac {85+93+14}{3}=64\)
The median of your three grades would be 85: \(14 \ \ \boxed{85} \ \ 93\)
Which is a better measure of your performance? The median does not change if the lowest grade is an 84, or if the lowest grade is a 14. However, when you add the three numbers to find the mean, the sum will be much smaller if the lowest grade is a 14.
The mean and the median are so different in this example because there is one grade that is extremely different from the rest of the data. In statistics, we call such an extreme value an outliers The mean is affected by the presence of an outlier; however, the median is not. A statistic that is not affected by outliers is called resistant.
We say that the median is a resistant measure of center, and the mean is not resistant. In a sense, the median is able to resist the pull of a far away value, but the mean is drawn to such values. It cannot resist the influence of outlier values. As a result, when we have a data set that contains an outlier, it is often better to use the median to describe the center, rather than the mean.
Example 7
In 2005, the CEO of Yahoo, Terry Semel, was paid almost $231,000,000. This is certainly not typical of what the average worker at Yahoo could expect to make. Instead of using the mean salary to describe how Yahoo pays its employees, it would be more appropriate to use the median salary of all the employees. The CEO's salary will have a big impact on the mean and inflate it to the point where it might no longer be representative.
You will often see medians used to describe the typical salary or the average value of houses in a neighborhood, as the presence of a very few extremely well-paid employees or expensive homes could make the mean appear misleadingly large.
The Trimmed Mean
Since the mean is not resistant to the effects of outliers, many students ask their teacher to “drop the lowest grade.” The argument is that everyone has a bad day, and one extreme grade that is not typical of the rest of their work should not have such a strong influence on their average. The problem is that this can work both ways; it could also be true that a student who is performing poorly most of the time could have a really good day (or even get lucky) and get one extremely high grade. We wouldn’t blame this student for not asking the teacher to drop the highest grade!
Attempting to more accurately describe a data set by removing the extreme values is referred to as "trimming the data." To be fair, a valid trimmed statistic must remove both the extreme maximum and minimum values. So, while some students might disapprove, to calculate a trimmed mean, you remove the maximum and minimum values and divide by the number of values that remain.
Example 8
Barron’s Profiles of American Colleges, 19th Edition, lists average class size for introductory lecture courses at each of the profiled institutions. A sample of 20 colleges and universities in California for introductory lecture courses resulted in the following:
14 20 20 20 20 23 25 30 30 30 35 35 35 40 40 42 50 50 80 80
The mean of all 20 data values is \(\overline{x}=\frac {∑x}{n}=\frac{719}{20}≈36.0\) and the median is \(\frac{30+35}{2}=32.5\). The mean and the median are quite different.
To find a 5% trimmed mean, we need to remove the lowest 5% and highest 5% of data values and calculate the mean again. Find 5% of 20, or 1. Remove the one smallest data value and one largest data value from the sample. Then, calculate the mean again. (The median won't change.)
14 20 20 20 20 23 25 30 30 30 35 35 35 40 40 42 50 50 80 80
The trimmed mean is \(\overline{x}=\frac {∑x}{n}=\frac{625}{18}≈34.7\). This new mean is closer to the median because we have trimmed the extreme values which affect the mean more than they do the median.
Problem Solving with the Mean
Typically, we think about finding the mean from a set of data that is given or collected. But, there may also be situations where you know the mean and want to find a missing data value or even create the entire data set. We need to think about the formula for the mean flexibly.
We can rewrite the formula for the mean algebraically to solve for the the sum of the data values:
\(\begin{align*} \overline{x}&= \frac{\sum_{}^{} x}{n}\\ n\cdot \overline{x}&=\sum_{}^{} x\\ \sum_{}^{} x &=n\cdot \overline{x} \end{align*}\).
This shows that the sum of all the data equals the product of the number of data values and the mean.
Example 9
The 28 students in Ms. Tracy's homeroom collected canned food items for the food drive. If the mean number of cans collected by each student was 12, how many cans of food did the class collect?
While we don't know how many cans any the individual 28 students collected, we know that the mean was 12 cans. This tells us that \(n=28\) and \(\overline{x}=12\). Substituting into the formula that is solved for the sum of the data gives \(\sum_{}^{} x =n\cdot \overline{x}=12 \cdot 28=336\) cans.
Sometimes we will need to use a similar process to find a missing individual data value.
Example 10
To qualify for the bowling tournament, Mara must have a 180 bowling score average in her most recent 5 games. So far, Mara has bowled 4 games with scores of 170, 184, 160, and 195. What score must Mara get in her 5th game to have a 180 average?
Again, we know n=5 and x¯¯¯=180. We also know four of the individual data values but not the fifth one: \(x_1=170, \ x_2=184,\ x_3=160, \ x_4=195, \text{ and } x_5=?\)
\(\begin{align*} ∑x = & = n\cdot \overline{x}\\ 170+184+160+192+ x_5&=5\cdot180\\ 706+x_5&=900\\ x_5&=900-706 \\ x_5&=184 \end{align*}\)
Mara needs to score 184 to bring her bowling average to 180.
Example 11
Design a set of data to represent nine 10-point quiz scores where the mean is 7 and the median is 8.
The sum of the nine quiz scores must be \(9 \times 7=63\) points. There are an odd number of data values and the median is 8 so the middle value must be 8:
\(x_1 \quad x_2 \quad x_3 \quad x_4 \quad \mathbf{8} \quad x_6 \quad x_7 \quad x_8 \quad x_9\)
The sum of the missing data values must be \(63−8=55\), the x-values to the left of 8 must be 8 or less and the x-values to the right of 8 must be 8 or more (but 10 or less.) There are many possibilities. Here is one possible solution: 2, 5, 6, 7, 8, 8, 8, 9, 10.
Example 12
Mr. Henry gave the same science test to two classes. The morning class of 18 students scored a mean of 78. His afternoon class of 24 students scored a mean of 85. What is the mean score when the classes are combined into one group?
First, we calculate the sum of the scores for each class using the rearranged mean formula:
\(\displaystyle\sum_{morning}^{} x=n \cdot \overline{x} =18 \cdot 78=1404 \quad \quad \displaystyle\sum_{afternoon}^{} x=n \cdot \overline{x} =24 \cdot 85=2040 \)
Next, find the mean of the combined scores by adding the sums for the two classes into one sum and divide by the total number of students in the two classes together:
\(\overline{x}_\text{combined}=\frac{\sum{x}}{n}=\frac{1404+2040}{18+24}=\frac{3444}{42}=82\)
When considered as one group, the mean score of the two classes is 82.
The Weighted Mean
The weighted mean is a method of calculating the mean where instead of each data value contributing equally to the mean, some data values contribute more than others. This could be because they appear more often or because a decision was made to increase their importance (give them more weight).
One common type of "weight" to use is the frequency, which is the number of times each number is observed in the data. When we calculated the mean for the children living at home, we could have used a weighted mean calculation. The calculation would look like this:
\(\frac{(5)(1)+(8)(2)+(5)(3)+(2)(4)+(1)(5)+(1)(6)}{22}\)
The symbolic representation of this is \(\overline{x}_{weighted}=\frac{\sum_{}^{} f_ix_i}{\sum_{}^{} f_i}\)where \(x_i\) is the ith data value and \(f_i\) is the frequency for that data value.
Example 13
The results on a 20-point multiple choice test are as follows: none students scored 20 points, four students scored 19, three students scored 18, five students scored 17, and one student scored 12. What was the mean score for the class?
The scores were 20. 19, 18, 17, and 12. However, we cannot just average those scores because they occurred with different frequency. The scores must be weighted by the number of times they each occurred. Using the formula for the weighted mean,
\(\begin{align*} \overline{x}_{weighted}&=\frac{\sum_{}^{} f_ix_i}{\sum_{}^{} f_i}\\ &=\frac{(9\times 20) + (4\times 19) + (3\times 18) + (5\times 17) + (1\times 12)}{9+4+3+5+1}\\ &=\frac{180+76+54+85+12}{22}\\ &=\frac{407}{22}\\ &=18.5 \end{align*}\)
The class mean was 18.5 points.
You can also weight data by a relative percentage. You are sure to have experienced this many times in calculating grades.
Example 14
Professor Henry uses a weighted grading system to determine his students' course grades in a history class:
- Tests - 50%
- Quizzes - 10%
- Research Paper - 15%
- Final Exam - 25%
If Jessica has an 82% average on tests, 90% average on quizzes, 78% on her research paper, and a 72% on the final exam. What is Jessica's course average?
We will use the formula for a weighted mean where percentages are used as the weights instead on frequencies:
\(\begin{align*} \overline{x}_{weighted}&=\frac{\sum_{}^{} p_ix_i}{\sum_{}^{} p_i}\\ &=\frac{(50\%\times 82\%) + (10\%\times 90\%) + (15\%\times 78\%) + (25\%\times 72\%)}{50\%+10\%+15\%+25\%}\\ &=\frac{0.41+0.09+0.117+0.18}{100\%}\\ &=\frac{0.797}{100\%}\\ &=0.797 = 79.7\% \end{align*}\)
Jessica's course average is 79.7%.
Another common use of weighted mean is a school GPA (grade point average.) At the college level, students receive points for each grade they receive: A = 4, B = 3, C = 2, D = 1, and F = 0 points. However, the average of the grade points is based on the number of credits each course is worth.
Example 15
Here is Tiffany's grade report for the Fall semester. Find Tiffany's GPA.
| Subject | Credits | Grade |
|---|---|---|
| Precalculus | 5 | A |
| English | 3 | D |
| Biology | 4 | B |
| Biology Lab | 1 | C |
| Bowling | 1 | A |
We apply the formula for the weighted mean where the weights are the number of credits:
\(\begin{align*} \overline{x}_{weighted}&=\frac{\sum_{}^{} f_ix_i}{\sum_{}^{} f_i}\\ &=\frac{(5\times 4) + (3\times 1) + (4\times 3) + (1\times 2) + (1\times 4)}{5+3+4+1+1}\\ &=\frac{20+3+12+2+4}{14}\\ &=\frac{41}{14}\\ &=2.93 \end{align*}\)
Tiffany's GPA for the Fall semester was 2.93.
Technology Note: Finding a Weighted Mean on the TI 83/84 Calculator
If you have weights for data values, you can enter the frequency or weights in L2. When following the steps above for finding the mean, enter L1,L2 instead of L1. If you have the Stat Wizard Screen, enter L2 for the FreqList.
Lesson Summary
When examining a set of data, we use descriptive statistics to provide information about where the data are centered.
The mode is a measure of the most frequently occurring response in a data set and is most useful for categorical data and data measured at the nominal level.
The mean and median are two of the most commonly used measures of center. The sample mean, \(\overline{x}\), is the sum of the data values divided by the number of data values in the sample. The mean is the leveling out amount or the numerical balancing point for the data.
The median is the numeric middle of a data set. If there are an odd number of data points, the median will be one of the data values. If there is an even number of data values, the median is the mean of the middle two values.
An outlier is a number that has an extreme value when compared with most of the data. The median is resistant to an outlier. That is, the median is not affected much by the presence of outliers. The mean is not resistant to outliers. The median tends to be a more appropriate measure of center to use for data sets that contain outliers.
A weighted mean involves multiplying individual data values by their frequencies or percentages before adding them and then dividing by the total of the frequencies (weights). Weighted means are often in combining data sets, determining grades, or GPA.
Review Questions
1. Draw a sketch of blocks and explain how you can use the idea of leveling out to find the mean of each set of numbers below:
a. {3, 6, 4, 7}
b. {2, 1, 6, 5, 1, 6}
2. Jerome shows the data set {4, 6, 8, 10, 11, 12, 12} by placing counters on a number line. Jerome knows that the mean is 9 and shows the balance point is there.
a. Where can Jerome put one more counter so that the balance point will not change?
b. Where can Jerome put two more counters so that the balance point will not change?
c. Suppose that Jerome puts one counter at 5 and another counter at 12. Where can Jerome put a third counter so that the balance point will not change?
d. Suppose Jerome puts a counter at 2. Where could Jerome place two more counters so that the balance point will not change?
3. The following data represent the number of pop-up advertisements a user received while surfing the internet during the past month:
43 37 35 30 41 23 33 31 16 21
a. Calculate the mean and median number of these advertisements by hand.
b. Confirm the answers to part (a) using technology.
4. Find the mean, median, mode, and 10% trimmed mean of the following numbers. Which of them do you think gives the best average? Why?
15, 19, 15, 16, 11, 11, 18, 21, 165, 9, 11, 20, 16, 8, 17, 10, 12, 11, 16, 14.
5. The frequency table below shows the heights of a group of 8th graders.
Height in cm | Frequency |
|---|---|
152 | 1 |
153 | 1 |
154 | 2 |
155 | 4 |
156 | 3 |
157 | 5 |
158 | 8 |
159 | 12 |
160 | 15 |
161 | 7 |
162 | 4 |
a. Calculate the mode, mean, and median by hand.
b. Confirm the answers to part (a) using technology.
6. In Lois’ 2nd grade class, all of the students are between 45 and 52 inches tall, except one boy, Lucas, who is 62 inches tall. Which of the following statements do you expect to be true about the heights of all of the students?
a. The mean height and the median height are about the same.
b. The mean height is greater than the median height.
c. The mean height is less than the median height.
7. Enrique has a 91, 87, and 95 for his statistics grades for the first three quarters. His mean grade for the year must be a 93 in order for him to be exempt from taking the final exam. Assuming grades are rounded following valid mathematical procedures, what is the lowest whole number grade he can get for the 4th quarter and still be exempt from taking the exam?
8. A math student scored 75, 70, 85, 90, and 100 on the first five tests he took. After he took his sixth math test, the average is now 85. What did he score on the sixth test?
9. The average time a man spent watching TV daily for the past week is 4 hours. If we remove one of these days, the average time he spent watching TV becomes 3.5 hours. How many hours did the man watch TV on the day we removed?
10. Create a set of 7 test scores (0 to 100 points) that have a mean of 82 and a median of 80.
11. Clarice rolled a number cube (labeled with the numbers 1 through 6) ten times and found the mean was 3 and the median was 3.5. Give one possible list of numbers she could have rolled.
12. The ages of the 8 children on a playground have a mean of 8.25 and a median of 8. There is no mode. Give a reasonable set of data for this situation.
13. Six people on an elevator have a mean weight of 110 pounds. Three more people with a mean weight of 150 pounds got on the elevator when the elevator stopped. What is the mean weight of all the people on the elevator?
14. For the month of April, a checking account has a balance of $700 for 24 days, 1215 for 2 days, and 375 for 4 days. What is the average daily balance for April?
15. Mary ordered dinner for a party of 10 people. Three people ordered the $4.75 chicken dinner, two people ordered the $4.95 fish dinner, and five had the beef dinner at a cost of $6.75 each. What was the average cost of each dinner at Mary’s party?
16. Mrs. Morris weights her grades. Tests are worth 60%, quizzes are worth 15%, homework is 10% and notebooks are 5%. Mike has a test average of 90, a quiz average of 87, a homework average of 65, and a notebook average of 70. What is Mike's grade?
Section 1.4: Measures of Position
Learning Objectives
Interpret the meaning of a percentile in context of a large data set.
Find the quartiles of a data set.
Describe a data set by presenting its five-number summary.
Identify outliers using the interquartile range (IQR.)
Introduction
While measures of central tendency such as the mean, median, and mode are important, they do not tell the whole story. Suppose the mean score on a social studies exam is 80%. From this information, can we determine a range in which most people scored? The answer is no. It is possible that everyone scored close to 80% - such as between 78 and 82. Or, no one may have scored close to 80% with all students scoring higher than 95 or lower than 65.
Besides the center, measures of position and spread help paint a more precise picture of what is going on in the data. In this section, we will consider the measures of position and discuss measures of spread in the next one.
Percentiles
A measure of position may describe where a certain percentage of the data fall. The measure of position we consider here is a percentile.
A percentile is a statistic that divides the data into hundredths. A percentile of a data value identifies the percentage of the data that is the same or less than the given data value. The most commonly used percentile is the median. The median is the halfway point in the data and 50% of the data fall at or below its value. Therefore, we can call the median the 50th percentile.
More generally, the pth percentile of a data set is a measurement such that after the data are ordered from smallest to largest, at most p% of the data are at or below this value and at most (100−p)% of the data are at or above it. The 40th percentile is the observation in which 40% of the data set are less than or equal to that observation. Percentiles are mostly used with very large populations and are expressed as integer values.
Example 1
To check a child’s physical development, pediatricians use height and weight charts that help them know how the child compares to children of the same age. A child whose height is at the 70th percentile is taller than 70% of children of the same age.
A common application of percentiles is their use in standardized testing. If a student scores at the 75th percentile, that student scored as well as or better than 75% of all other students who took the test. Some colleges and universities use percentiles to determine admittance into their programs.
As a future educator, it is important to distinguish between a raw score, a percent score, and a percentile. Lets say that you gave a 25-question test to your class of 20 fourth graders. Alex is one of your students. He answered 20 questions correct. His raw score was 20. His percent score is \(20/25=80\%\) correct. Let's also say that 7 students scored the same or lower than Alex. This means that Alex's score of 80% correct is located at the 35th percentile because he outscored \(7/20=35\%\) of his classmates. A low percentile does not imply a bad performance. It implies that the performance was lower relative to his peers.
Low percentiles always correspond to lower data values. High percentiles always correspond to higher data values. A percentile may correspond to a value judgment about whether it is "good" or "bad." The interpretation of whether a certain percentile is "good" or "bad" depends on the context of the situation to which the data applies. In some situations, a low percentile would be considered "good." In other contexts a high percentile might be considered "good." In many situations, there is no value judgment that applies.
Understanding how to interpret percentiles properly is important not only when describing data, but also when calculating probabilities in later chapters of this text. When writing the interpretation of a percentile in the context of the given data, the sentence should contain the following information:
information about the context of the situation being considered
the data value (value of the variable) that represents the percentile
the percent of individuals or items with data values below the percentile
the percent of individuals or items with data values above the percentile.
Example 2
During a season, a player who scores 8 points per game is at the 40th percentile. Interpret the 40th percentile in the context of this situation.
Forty percent of players scored eight points or fewer. Sixty percent of players scored eight points or more. A higher percentile here is good because getting more points in a basketball game is desirable.
Example 3
For the 100-meter dash, the 80th percentile of finishing times was 11.5 seconds. Interpret the 80th percentile in the context of the situation.
Eighty percent of runners finished the race in 11.5 seconds or less. Twenty percent of runners finished the race in 11.5 seconds or more. A lower percentile is good because finishing a race more quickly is desirable.
Quartiles
Two very commonly used percentiles are the 25th and 75th percentiles. The 25th percentile, median, and 75th percentile divide the data into four quarters. Because of this, the 25th percentile is notated as \(Q_1\) and is called the first (lower) quartile, and the 75th percentile is notated as \(Q_3\) and is called the third (upper) quartile. The median is the second (middle) quartile and is sometimes referred to as \(Q_2\).
Example 4
Let's return to a previous data set, which is as follows:
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 6
Recall that the median (50th percentile) is 2. You can think of the quartiles as the medians of the upper and lower halves of the data.
The lower quartile is \(Q_1=2\). One-fourth (25%) of the entire set of values are less than or equal to 2 and three-fourths (75%) of the values are greater than or equal to 2. The value \(Q_1=2\) is part of the data set in this example.
The upper quartile is \(Q_3=3\). Three-fourths (75%) of the entire data set are less than or equal to 3. One-fourth (25%) of the values is greater than or equal to 3. The value \(Q_3=3\) is part of the data set in this example.
In the previous example there were an odd number of values in each half. If there were an even number of values, then we would follow the procedure for medians and average the middle two values of each half. Look at the set of data below:
For this example, neither \(Q_1=77.5\) nor \(Q_3=93.5\) are part of the set of data.
Five-Number Summary
One way to summarize a set of data is to give the lowest number, the highest number, and the middle number. In addition to these three numbers it is also useful to give Q1 and Q3. This list of five numbers gives a very concise summary of the data and is referred to as the five-number summary.
The five-number summary of a data set is its minimum value, \(Q_1\), median, \(Q_3\), and maximum.
Example 5
Find the five-number summary of the heights (in inches) of 9 females:
| 59 | 60 | 62 | 64 | 66 | 67 | 69 | 70 | 72 |
The data are already ordered. There is an odd number of data, \(n=9\). The median's location is at the 5th position:
\(\begin{align*} 59, \ 60, \ 62, \ 64, \boxed{66}, &\ 67, \ 69, \ 70, \ 72\\ \\ \text {Median} &=66 \end{align*}\)
To find \(Q_1\) and \(Q_3 \), find the median of the left half of data and the median of the right half of the data. \(Q_1\) and \(Q_3\) are each positioned between the 2nd and 3rd piece of data in each half as shown:
\(\begin{align*} 59,\boxed{60, \ 62}, \ 64, & \boxed{66}, \ 67,\boxed{69, \ 70},\ 72 \\ \\ Q_1=\frac{60+62}{2} =61 \ & \text{ and } \ Q_3=\frac{69+70}{2} =69.5 \end{align*}\)
The minimum value is 67 and the maximum value is 72. We report the five-number summary as {59, 61, 66, 69.5, 72}.
Example 6
Suppose a chain restaurant advertises that a typical number of french fries in a large order is 82. Roberta is a bit curious about this claim so she bought a large order of fries each day for the past 18 days and counted the number of fries in the orders. Her data are shown below. Find the five-number summary for the data in her sample.
| 80 | 72 | 77 | 80 | 90 | 85 | 93 | 52 | 84 | 87 | 80 | 86 | 92 | 88 | 67 | 86 | 66 | 77 |
First put the data in order and find the median. There are an even number of data values, \(n=18\), so the median will be between the data values in position #\((18/2)=9\) and position #\((18/2)+1=10\) as shown:
\(\begin{align*} 52, \ 66, \ 67, \ 72, \ 77, \ 77, \ 80, \ 80, \ & \boxed{80, \ 84}, 85, \ 86, \ 86, \ 87, \ 88, \ 90, \ 92, \ 93 \\ \\ \text{Median} &=\frac{80+84}{2}=82 \end{align*} \)
The values \(Q_1\) and \(Q_3\) will each be at position #\((9+1)/2=5\) in each half of the data:
\(\begin{align*} 52, \ 66, \ 67, \ 72, \boxed{77}, 77, \ 80, \ 80, & \boxed{80, \ 84}, \ 85, \ 86, \ 86, \boxed{87}, 88, \ 90, \ 92, \ 93 \\ \\ Q_1=77 \ &\text { and } \ Q_3=87 \end{align*} \)
The minimum number of fries was 52 and the maximum number of fries was 93. The five-number summary is {52, 77, 82, 87, 93}.
Technology Note: Finding Five-Number Summary on the TI 83/84 Calculator
You may recall seeing the five-number summary when finding the mean and median on the calculator in the previous section. Follow that same procedure and make sure you "arrow down" to see the five-number summary. The sequence of screenshots is shown below for the data in Example 4.
Test for Outliers using the Quartiles
Many data sets have values that are either extremely high or extremely low when compared to the rest of the data values. These values are called outliers. If the position of a data value is too far below or above the rest of the data, we consider it to be outlier. One way to check for outliers is to use the quartiles of the five-number summary. Here are the steps:
- Order the data set and find the values \(Q_1\) and \(Q_3\) .
- Calculate the difference between \(Q_1\) and \(Q_3\). This difference is called the interquartile range: \(IQR=Q_3−Q_1\) .
- Find the step by multiplying 1.5 times the IQR: \(\text{Step}=1.5×IQR\) .
- Find \((Q_1−\text{Step})\) and \((Q_3+\text{Step})\) to locate the fence values.
- Any data that fall beyond these fences are considered outliers. That is, any value lower than \((Q_1−\text{Step})\) or any value higher than \((Q_3+\text{Step})\) is an outlier. There may be none, one, or more than one outlier.
Example 7
Determine if any of the values in the french fries data were outlier values. We will follow the steps above using the resuts from Example 6/
- Recall that \(Q_1=77\) and \(Q_3=87\).
- \(IQR=Q_3−Q_1=87-77=10\).
- \(\text{Step}=1.5×IQR=1.5\times 10=15\)
- \(\text{Lower Fence} = Q_1-\text{Step} = 77-15=62\\ \text{Upper Fence} = Q_3+\text{Step} = 87+15=102\)
- Any values below 62 or above 102 will be considered outliers. Consulting the data, the day when Roberta got only 52 fries was an outlier. The remaining data all lie in the range created by the fences [62,102].
Example 8
Based on past school spelling bee records, Ms. Lopez found the five-number summary of the number of words used in the bee before a winner was determined. Are there any outliers in the data?
| Minimum | Quartile 1 | Median | Quartile 3 | Maximum |
| 48 | 80 | 92 | 99 | 154 |
We will follow the steps above:
- \(Q_1=80\) and\(Q_3=99\).
- \(IQR=Q_3−Q_1=99-80=19\).
- \(\text{Step}=1.5×IQR=1.5\times 19=28.5\)
- \(\text{Lower Fence}=Q1−Step=80−28.5=51.5 \\ \text{Upper Fence}=Q3+Step=99+51.5=150.5\)
- Any values outside the fenced interval will be considered outliers. Consulting the five-number summary, the minimum value was 48 so there are no outliers on the low side. The maximum value was 154 which is beyond the fence of 150.5. Therefore, the year when 154 words were used is an outlier. There could be other outliers on this higher end, but they cannot be identified without knowing the full set of raw data.
Lesson Summary
The pth percentile is the data value that divides an ordered data set into two parts so that p% of the data are less than or equal to that data value and (100−p)% of the data are greater than or equal to that data value.
The quartiles and median are special percentiles:
- Quartile 1 is the value where 25% of the data is less than or equal to that value.
- Median is the value where 50% of the data is less than or equal to that value.
- Quartile 3 is the value where 75% of the data is less than or equal to that value.
The five-number summary is a set of five values that divides a data set into quarters: minimum, lower quartile, median, upper quartile, and maximum.
Quartiles are used to test a data set for outliers. Any data value that is less than \(Q_1−1.5(Q_3−Q_1)\) or greater than \(Q_3+1.5(Q_3−Q_1)\) are outlier values when compared to the rest of the data. The difference \(Q_3−Q_1\) is called the interquartile range (IQR.)
Review Questions
1. Mina is waiting in line at the Department of Motor Vehicles (DMV). Her wait time of 32 minutes is the 85th percentile of wait times. Is that good or bad? Write a sentence interpreting the 85th percentile in the context of this situation.
2. On a 20-question math test, the 70th percentile for number of correct answers was 16.
a. What percent of the questions did a student scoring at the 70th percentile answer correctly?
b. What percent of students scored higher than 16 on this math test?
c. Is a higher or lower percentile more desirable in this context?
3. The numbers of coins that 12 randomly selected people had in their piggy banks are shown below. Find the five-number summary and test for outliers.
\(35 \quad 58 \quad 29 \quad 44 \quad 104 \quad 39 \quad 72 \quad 34 \quad 50 \quad 41 \quad 64 \quad 54\)
4. Use the five-number summarybelow to determine whether the data set from wgich it was calculated contains any outliers.
Minimum | Quartile 1 | Median | Quartile 3 | Maximum |
5 | 12 | 14 | 16 | 24 |
5. The following data represent the average snowfall (in centimeters) for 18 Canadian cities for the month of January. Find the five-number summary and test for outliers.
Name of City | Amount of Snow(cm) |
|---|---|
Calgary | 123.4 |
Charlottetown | 74.5 |
Edmonton | 80.6 |
Fredericton | 73.8 |
Halifax | 64.0 |
Labrador City | 110.4 |
Moncton | 82.4 |
Montreal | 63.6 |
Ottawa | 48.9 |
Quebec City | 53.8 |
Regina | 35.9 |
Saskatoon | 25.4 |
St. John’s | 97.5 |
Sydney | 44.2 |
Toronto | 21.8 |
Vancouver | 12.8 |
Victoria | 8.3 |
Winnipeg | 76.2 |
6. Firman’s Fitness Factory is a new gym that offers reasonably-priced family packages. The following table represents the number of family packages sold during the opening month. Find the five-number summary and determine whether any values are
\(24 \quad 21 \quad 31 \quad 28 \quad 29\\ 27 \quad 22 \quad 27 \quad 30 \quad 32\\ 26 \quad 35 \quad 24 \quad 22 \quad 34\\ 30 \quad 28 \quad 24 \quad 32 \quad 27\\ 32 \quad 28 \quad 27 \quad 32 \quad 23\\ 20 \quad 32 \quad 28 \quad 32 \quad 34\)