Sampling distributions
This is a teacher-led classroom activity suitable for introductory statistics courses in higher education.
In a nutshell
What and why: This activity introduces students to the concept of sampling
distributions and point estimates, and to how the accuracy of point
estimates are affected by sample size. Concurrently, it allows students
to become aware of the existence of open datasets in their discipline,
and to practice using dataset documentation, downloading and importing
datasets. A firm grasp of the concept of sampling distributions forms the basis of
inferential statistics. This activity will help establish the
conceptual foundation for calculating confidence intervals. Point
estimates (and confidence intervals) are essential tools in any
researcher’s repertoire of statistical techniques. All students should
be able to read and understand data documentation, and to download data
from repositories.
Prerequisite knowledge: Students should be familiar with the concepts of populations and
samples, and with the idea that our knowledge of populations typically
are inferred from estimates from samples. They should be able to
calculate a sample mean. The calculation of confidence intervals around
point estimates can be incorporated into this lesson if students have
not already been introduced to it.
Stuff you need: Each student group should have a computer and the analysis software you
prefer. You should have a teacher computer from which you can project.
Estimated time: Could be done in a 45-50 minute session, but would probably work better with 70-90 minutes available. In one longer session or across two shorter sessions.
Suggested intended learning outcomes
After completing this activity, the student should be able to:
Introductory statistics outcomes:
- Understand that sample characteristics vary around the true population parameters.
- Explain how the sample variance and the confidence interval around point estimates changes with the sample size.
- Calculate confidence intervals around estimated means.
Open Data outcomes:
- Be aware of the existence of open data sets.
- Understand the importance of clear data documentation.
- Download open data sets and import them to analysis software of choice
Example dataset
You can use any medium or large (say at least 500 cases) dataset for this activity. The total number of cases in the set will serve as the population, from which the students will draw random samples. It is probably a good idea to chose a dataset from your own discipline.
Suggested activity and instructions
This activity works best if your class can be divided into at least 4-5 groups.
Provide your students with instructions on how to locate and download your chosen dataset (or the part of it needed for this activity), and import it into your preferred analysis software. Guide them to isolate the chosen variable (e.g. discard the ones you do not need), and a case id variable if it exists. Have them save the resulting spreadsheet file as “population data” or something similar.
Point them toward the documentation, and have them peruse the bits of it that are relevant to understand how the chosen variable was coded. Allow some time for questions and clarification.
Next, explain that for this exercise, the total of cases in the dataset represents the population, and that the the activity will involve drawing random samples and attempting to estimate the population mean from those samples.
Guide your students through a prepared routine for drawing random samples from your the complete “population” variable. You could do this by inserting a column of random values, sorting the data by this column, and keeping, say, the first 8 cases in the sorted spreadsheet. (For instructions on how to do this in Excel, see the linked video under ‘Ancillary files and supplements’, below.) Have them save the resulting reduced spreadsheet as “small sample” or something similar.
Next, have them calculate the sample mean and its confidence interval. If your students have not already been introduced to the idea of a confidence interval, briefly explain the idea and how it is calculated. Provide them with instructions and a routine/formula for doing the calculation, or ask them to do the calculation if they already know how. If you have only a few student groups, you can have each group draw another small sample and calculate its mean and CI in this manner from their population data file.
Next, open and project a file for displaying sample distributions. (A template for excel is provided under ‘Ancillary files and supplements’.) Enter the sample means and CIs from the student group samples.
Ideally, the exercise should be repeated, but with larger samples. This will allow comparison of how the sample means vary more around the population mean when samples are small rather than large.
Ancillary resources, files and supplements
- Template for point estimate and CI graph in Excel.