Name(s)
:
Sampling Distributions and Variability
Introduction to Statistics, Spring 2007, Tom
Linton
Class data 10 AM, class
data 2 PM
Work in groups of size two or three, but each person
MUST collect their own samples (we need enough data to
see the patterns). You can turn in a single paper per group, but
everyone must create their own samples.
In today's activity, we will begin to examine relationships between a
population and samples drawn from that population. In particular we
will explore ties between the notions of "center", "spread", and
"proportion" (three of the most common measurements associated with
data). We've seen various ways to measure center and spread (today
we'll focus on the notions of mean and standard deviation for these
two), and proportions are one of the most common measures used to
analyze categorical variables (what proportion of college students
smoke, what proportion of registered voters will vote for a certain
candidate, what proportion of pop drinkers prefer diet pop, etc.). In
order to study these relationships, we will use an extremely small
population consisting of the words in the Gettysburg Address
(attributed to Abraham Lincoln, given November 19, 1963 on the
battlefield near Gettysburg, Pennsylvania).
Circle 10 representative words in the
following passage. That is, try to select words whose length
(number of letters) captures the typical length characteristics of the
words in the entire passage. Keep in mind that part of the
characteristics of word length includes its spread (so you probably
shouldn't select all of your words of the same length).
Four score and seven years ago our
fathers brought forth upon this continent a new nation, conceived in
liberty and dedicated to the proposition that all men are created
equal.
Now we are engaged in a great civil war, testing whether that nation or
any nation so conceived and so dedicated can long endure. We are met on
a great battlefield of that war. We have come to dedicate a portion of
that field as a final resting-place for those who here gave their lives
that that nation might live. It is altogether fitting and proper that
we should do this. But in a larger sense, we cannot dedicate, we cannot
consecrate, we cannot hallow this ground. The brave men, living and
dead who struggled here have consecrated it far above our poor power to
add or detract. The world will little note nor long remember what we
say here, but it can never forget what they did here.
It is for us the living rather to be dedicated here to the unfinished
work which they who fought here have thus far so nobly advanced. It is
rather for us to be here dedicated to the great task remaining before
us--that from these honored dead we take increased devotion to that
cause for which they gave the last full measure of devotion--that we
here highly resolve that these dead shall not have died in vain, that
this nation under God shall have a new birth of freedom, and that
government of the people, by the people, for the people shall not
perish from the earth.
- In relation to our population (the words in the passage above),
we are interested in the following variables. Decide if each of the
variables below is quantitative or categorical.
Length of word
(number of letters)
Whether or not
(Yes, No) the word contains more than 4 letters
- In the table below, record the information for each of the 10
representative words you circled in the Gettysburh Address.
|
1
|
2
|
3
|
4
|
5
|
Word
|
|
|
|
|
|
Length
|
|
|
|
|
|
> 4
Characters (Yes, No)
|
|
|
|
|
|
|
6
|
7
|
8
|
9
|
10
|
Word
|
|
|
|
|
|
Length
|
|
|
|
|
|
> 4
Characters (Yes, No)
|
|
|
|
|
|
- Ideally, we want our sample to be representative of our population,
that is, having the same characteristics. Do you think your sample of
10 words listed in the table above are representative of the 268 words
in the Gettsburg address? Explain.
- Construct a "dot plot" (like a bar chart) of the lengths of the
10 words in your
sample and record it below. Also calculate the mean word-length for
your
sample and record this
below.
=
- Is the mean from question (4) a statistic or
a parameter?
- To display the distribution of a categorical variable for a
single sample, we can use a bar chart. The bar chart has one bar for
each category, and the height of each bar is the proportion of your
sample in that category. You can use the title of your bar chart to
help describe the variable. Let us refer to a word as Long if it has more than 4 letters,
and Short if it has 4 or fewer
letters. Make a bar chart of your sample for this categorical variable,
and calculate your sample's proportion of Long words.
- Add your sample mean and sample proportion of long words to the
class stem plots and the Excel spreadsheet.
- Is your proportion of long words from question (6) a statistic or
parameter?
- The mean length of all 268 words in our population is 4.29
letters. Is this number a parameter or statistic?
- There are 99 "Long" words in the population of 268 words. What
proportion of the words in the population are Long? Is this value a
parameter or statistic?
- Did everyone in your class obtain the same values for their
sample means and sample proportions?
- For the two variables (separate plots for each) sample mean and
sample proportion, make a stem plot, dot plot, or histogram that
includes all of the class data. Indicate on each of your plots the
value for the corresponding parameter (population mean = 4.29 letters,
population proportion of long words = 0.37). You can draw a vertical
line at the location of each parameter.
- For the collection of sample means, describe the shape and center
of this distribution.
- For the collection of sample proportions, describe the shape and
center of this distribution.
You have witnessed the fundamental principle of sampling variability: Values of
sample statistics vary when
one repeatedly takes samples from a population. Both of your plots
should indicate definite patterns to this variability. When we attempt
to study this variability, our "individuals" become the samples, and
the variables of interest are quantities like the sample's mean,
proportion, or standard deviation. In a sense, we are now interested in
looking at the population of "all possible samples" (of size 10) from
our original population of words in the Gettysburg Address. Because we
now have both parameters and statistics around, we use different
symbols for these quantities. Let us refer to our population mean as μ,
and our sample means as
. Similarly, we
refer to the population proportion as p, and the sample's proportions
as
.
- Was your sample mean,
, above or below the
population mean μ = 4.29 letters? How many and what percentage of the
class's
values exceeded the
population mean?
- Was your sample proportion of long words,
,
above or below the population proportion p = 0.37? How many and what
percentage of the class's
values exceeded the
population proportion?
- Past experience indicates that our sampling method (asking people
to "hand select" representative words from a passage) is biased. There is a tendency for
people to over estimate both the average word length and the proportion
of long words. By looking at your responses to questions (12) to (16),
write a sentence or two about whether or not our results support this
claim of bias.
- Rather than "hand selecting" our samples, we will now draw our
samples from our population using our calculator's randInt
command
(do NOT seed your calculator however). Each PERSON should select
one SRS of size 5 (use randInt(1,268,5)
for this) and one SRS of size 10 (using randInt(1,268,10)).
Of course, we must remove duplicates from each of our samples. Shown
below is our population with all of the words labeled. Use your
calculator to select your 2 samples, record the values in the tables
below.
SRS of size 5
Label
|
|
|
|
|
|
Word
|
|
|
|
|
|
Length
|
|
|
|
|
|
>
4
Characters (Yes, No)
|
|
|
|
|
|
SRS of size 10
Labels
|
|
|
|
|
|
|
|
|
|
|
Word
|
|
|
|
|
|
|
|
|
|
|
Length
|
|
|
|
|
|
|
|
|
|
|
>
4
Characters
|
|
|
|
|
|
|
|
|
|
|
- For your sample of size 5, calculate the sample mean
and the sample proportion
.
Record these values below and add these values to the class data stem
plots and Excel spreadsheet.
- For your sample of size 10, calculate the sample mean
and the sample proportion
.
Record these values below and add these values to the class data stem
plots and Excel spreadsheet.
- Was your sample mean,
, for your sample of
size 5 above or below the population mean μ = 4.29 letters? How many
and what percentage of the class's
values for SRS's of
size 5 exceeded the population mean?
- Was your sample mean,
, for your sample of
size 10 above or below the population mean μ = 4.29 letters? How many
and what percentage of the class's
values for SRS's of
size 10 exceeded the population mean?
- Was your sample proportion of long words,
, for
your SRS of size 5 above or below the population proportion p = 0.37?
How many and what percentage of the class's
values for SRS's of size 5 exceeded the population proportion?
- Was your sample proportion of long words,
, for
your SRS of size 10 above or below the population proportion p = 0.37?
How many and what percentage of the class's
values for SRS's of size 10 exceeded the population proportion?
- Based on your last several answers, does it appear that using randInt to
select our samples is a biased sampling procedure?
- By looking at the class stem plots for sample means from SRS's of
size 5 and then size 10, does there appear to be a difference in the
spread of these 2 plots? If so, describe the difference (which one is
less spread out, and roughly how big of a difference is there).
- By looking at the class stem plots for sample proportions from
SRS's of size
5 and then size 10, does there appear to be a difference in the spread
of these 2 plots? If so, describe the difference (which one is less
spread out, and roughly how big of a difference is there).
Here is a labeled version of the population.