Name(s)                                                               :
Sampling Distributions and Variability
Introduction to Statistics, Spring 2007, Tom Linton
Class data 10 AM, class data 2 PM

Work in groups of size two or three, but each person MUST collect their own samples (we need enough data to see the patterns). You can turn in a single paper per group, but everyone must create their own samples.
In today's activity, we will begin to examine relationships between a population and samples drawn from that population. In particular we will explore ties between the notions of "center", "spread", and "proportion" (three of the most common measurements associated with data). We've seen various ways to measure center and spread (today we'll focus on the notions of mean and standard deviation for these two), and proportions are one of the most common measures used to analyze categorical variables (what proportion of college students smoke, what proportion of registered voters will vote for a certain candidate, what proportion of pop drinkers prefer diet pop, etc.). In order to study these relationships, we will use an extremely small population consisting of the words in the Gettysburg Address (attributed to Abraham Lincoln, given November 19, 1963 on the battlefield near Gettysburg, Pennsylvania).

Circle 10 representative words in the following passage. That is, try to select words whose length (number of letters) captures the typical length characteristics of the words in the entire passage. Keep in mind that part of the characteristics of word length includes its spread (so you probably shouldn't select all of your words of the same length).

Four score and seven years ago our fathers brought forth upon this continent a new nation, conceived in liberty and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation or any nation so conceived and so dedicated can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field as a final resting-place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But in a larger sense, we cannot dedicate, we cannot consecrate, we cannot hallow this ground. The brave men, living and dead who struggled here have consecrated it far above our poor power to add or detract. The world will little note nor long remember what we say here, but it can never forget what they did here.

It is for us the living rather to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us--that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion--that we here highly resolve that these dead shall not have died in vain, that this nation under God shall have a new birth of freedom, and that government of the people, by the people, for the people shall not perish from the earth.

  1. In relation to our population (the words in the passage above), we are interested in the following variables. Decide if each of the variables below is quantitative or categorical.

              Length of word (number of letters)



              Whether or not (Yes, No) the word contains more than 4 letters




  2. In the table below, record the information for each of the 10 representative words you circled in the Gettysburh Address.


    1
    2
    3
    4
    5
    Word





    Length





    > 4 Characters (Yes, No)









    6
    7
    8
    9
    10
    Word





    Length





    > 4 Characters (Yes, No)













  3. Ideally, we want our sample to be representative of our population, that is, having the same characteristics. Do you think your sample of 10 words listed in the table above are representative of the 268 words in the Gettsburg address? Explain.

  4.  

     
     
     
     

     

  5. Construct a "dot plot" (like a bar chart) of the lengths of the 10 words in your sample and record it below. Also calculate the mean word-length for your sample and record this below.













    x-bar =



  6. Is the mean  from question (4) a statistic or a parameter?





  7. To display the distribution of a categorical variable for a single sample, we can use a bar chart. The bar chart has one bar for each category, and the height of each bar is the proportion of your sample in that category. You can use the title of your bar chart to help describe the variable. Let us refer to a word as Long if it has more than 4 letters, and Short if it has 4 or fewer letters. Make a bar chart of your sample for this categorical variable, and calculate your sample's proportion of Long words.















  8. Add your sample mean and sample proportion of long words to the class stem plots and the Excel spreadsheet.

  9. Is your proportion of long words from question (6) a statistic or parameter?




  10. The mean length of all 268 words in our population is 4.29 letters. Is this number a parameter or statistic?




  11. There are 99 "Long" words in the population of 268 words. What proportion of the words in the population are Long? Is this value a parameter or statistic?





  12. Did everyone in your class obtain the same values for their sample means and sample proportions?






  13. For the two variables (separate plots for each) sample mean and sample proportion, make a stem plot, dot plot, or histogram that includes all of the class data. Indicate on each of your plots the value for the corresponding parameter (population mean = 4.29 letters, population proportion of long words = 0.37). You can draw a vertical line at the location of each parameter.




























  14. For the collection of sample means, describe the shape and center of this distribution.







  15. For the collection of sample proportions, describe the shape and center of this distribution.




    You have witnessed the fundamental principle of sampling variability: Values of sample statistics vary when one repeatedly takes samples from a population. Both of your plots should indicate definite patterns to this variability. When we attempt to study this variability, our "individuals" become the samples, and the variables of interest are quantities like the sample's mean, proportion, or standard deviation. In a sense, we are now interested in looking at the population of "all possible samples" (of size 10) from our original population of words in the Gettysburg Address. Because we now have both parameters and statistics around, we use different symbols for these quantities. Let us refer to our population mean as μ, and our sample means as x-bar. Similarly, we refer to the population proportion as p, and the sample's proportions as p-hat.

  16. Was your sample mean, x-bar, above or below the population mean μ = 4.29 letters? How many and what percentage of the class's x-bar values exceeded the population mean?




  17. Was your sample proportion of long words, p-hat, above or below the population proportion p = 0.37? How many and what percentage of the class's p-hat values exceeded the population proportion?




  18. Past experience indicates that our sampling method (asking people to "hand select" representative words from a passage) is biased. There is a tendency for people to over estimate both the average word length and the proportion of long words. By looking at your responses to questions (12) to (16), write a sentence or two about whether or not our results support this claim of bias.










  19. Rather than "hand selecting" our samples, we will now draw our samples from our population using our calculator's randInt command (do NOT seed your calculator however). Each PERSON should select one SRS of size 5 (use randInt(1,268,5) for this) and one SRS of size 10 (using randInt(1,268,10)). Of course, we must remove duplicates from each of our samples. Shown below is our population with all of the words labeled. Use your calculator to select your 2 samples, record the values in the tables below.

SRS of size 5
Label





Word





Length





> 4 Characters (Yes, No)








SRS of size 10
Labels










Word










Length










> 4 Characters





































     







  1. For your sample of size 5, calculate the sample mean x-bar and the sample proportion p-hat. Record these values below and add these values to the class data stem plots and Excel spreadsheet.







  2. For your sample of size 10, calculate the sample mean x-bar and the sample proportion p-hat. Record these values below and add these values to the class data stem plots and Excel spreadsheet.








  3. Was your sample mean, x-bar, for your sample of size 5 above or below the population mean μ = 4.29 letters? How many and what percentage of the class's x-bar values for SRS's of size 5 exceeded the population mean?















  4. Was your sample mean, x-bar, for your sample of size 10 above or below the population mean μ = 4.29 letters? How many and what percentage of the class's x-bar values for SRS's of size 10 exceeded the population mean?














  5. Was your sample proportion of long words, p-hat, for your SRS of size 5 above or below the population proportion p = 0.37? How many and what percentage of the class's p-hat values for SRS's of size 5 exceeded the population proportion?

















  6. Was your sample proportion of long words, p-hat, for your SRS of size 10 above or below the population proportion p = 0.37? How many and what percentage of the class's p-hat values for SRS's of size 10 exceeded the population proportion?

















  7. Based on your last several answers, does it appear that using randInt to select our samples is a biased sampling procedure?
















  8. By looking at the class stem plots for sample means from SRS's of size 5 and then size 10, does there appear to be a difference in the spread of these 2 plots? If so, describe the difference (which one is less spread out, and roughly how big of a difference is there).














  9. By looking at the class stem plots for sample proportions from SRS's of size 5 and then size 10, does there appear to be a difference in the spread of these 2 plots? If so, describe the difference (which one is less spread out, and roughly how big of a difference is there).



Here is a labeled version of the population.