NAMES:                                                                       :
Association Activity Introductory Statistics, Fall 2007, Tom Linton 
The goal of this activity is to explore notions related to the relationships between two variables, collected from the same individual. We will look at scatterplots, association (both its direction and strength), and introduce the notions of explanatory and response variables.

Many interesting statistical relationships exist between pairs of variables. Did you know that shorter women are much more likely to have heart attacks than taller women? Doesn't it seems reasonable to assume that the age at which parents were married is related to the age at which their children wed? These are examples of relationships between two variables. In both cases, one of the variables seems to explain, or predict something about the other variable, while the response of this other variable is what we're interested in. In the first example, a women's height is used to explain the number of heart attacks she has. The number of heart attacks seems to respond to changes in height, and the frequency of heart attacks is what we're interested in learning about. In the second example, the age at which adults were married is being used to explain differences in the ages at which children wed. The "wedding age" of the children is the variable whose responses we're most interested in.

A response variable measures an outcome of interest in a study or experiment. An explanatory variable explains, or influences changes in a response variable. For the examples above, frequency of heart attacks and the age at which children get married are the response variables, while height and the age at which parents wed are the explanatory variables. Most of the time it is straightforward to decide which variable is the response variable and which is the explanatory variable. Frequently we are trying to predict values of the response variable based on knowledge of the explanatory variable.
  1. Each situation below involves two variables. Decide which is the response variable and which is the explanatory variable. Sometimes, when neither variable stands out as explaining or influencing the other, and neither variable is obviously of higher interest than the other, both variables could play both roles. In this case, simply say "both".
    1. The fuel efficiency of a vehicle (in gallons per mile) and its speed (in mph).



    2. The number of hours of studying and the score a student receives on a statistics exam.



    3. The age of the husband, and the age of the wife on their wedding day.



    4. The width (in feet) of an executives office and the number of years they have worked for a company.



We can display a relationship between two variables by making a scatterplot of the data. For each individual, we plot the pair
x = explanatory variable, y = response variable
look for the overall pattern, and any striking deviations from the overall pattern.




To do this on the TI-83 or 84, you simply:
  1. Use the statistical editor to enter the values of the explanatory variable (x) into L1 and the values of the response variable (y) into L2. Make sure that each individual's y-value is in the same row as their x-value.
  2. On the [STATPLOT] menu, select plot 1; turn it on and select the first icon in the top row in the [TYPE] field (the scatterplot icon, see below).
  3. Set Xlist to L1 and Ylist to L2.
  4. Select a type of Mark (boxes work well).
  5. Press [ZOOM] [9: ZoomStat].
    The stat plot window
  1. Here is some data relating x = the cost (in millions of dollars) to make a movie, and y = the total income (in millions of dollars) of that movie. Make a scatterplot of this data and copy it below, being sure to label the axes, and provide a decent indication of the scale on each axis. You may need to use the [WINDOW] menu to do this.

Cost to Produce Movie (millions of dollars)
55
42
17
30
43
19
22
13
26
35
income of movie (millions of dollars)
150
123
68
93
100
10
20
15
5
35











You should notice that, in general, as the value of the explanatory variable (x = cost) gets bigger, so does the value of the response variable (y = income).
  1. While the general pattern is that movies costing more to produce, also have higher incomes, there are several examples where this is not the case. Find a pair of data points (x1, y1) and (x2, y2) where it cost more to make movie 1 than it did to make movie 2 (so x1 > x2), but movie 2 had a higher income than movie 1 (so y2 > y1).





    The concept of association between variables is an example of a statistical tendency. Not every movie that costs more to produce, ends up with a higher income, but movies which are expensive to make tend to have higher incomes.

    We say that a positive association exists when values of the response variable (y) tend to increase as values of the explanatory variable (x) increase. In general this means that large values of x are paired with large values of y, and small values of x are paired with small values of y. On a scatterplot, the data will have a tendency to flow from the lower left corner of the plot to the upper right corner.

    We say that a negative association exists when values of the response variable (y) tend to decrease as values of the explanatory variable (x) increase. In general this means that large values of x are paired with small values of y, and small values of x are paired with large values of y. On a scatterplot, the data will have a tendency to flow from the upper left corner of the plot to the lower right corner.

    Sometimes there is no association, or a zero association, meaning that all sizes of the explanatory variable tend to occur with all sizes of values of the response variable. On a scatterplot, there is no general pattern, the data are simply "all over the place", or look like a "cloud" of points. Most of the time, common sense can be used to guess whether an association is positive, negative, or near zero.
  1. Below are brief descriptions of an explanatory variable and a response variable. For each pair, using only the verbal description given, guess as to whether the association would be positive, negative or near zero, and then explain your guess. If you cannot guess, just guess near zero.

  2.  
    explanatory variable response variable association guess
    explanation
    Length of hair 
    in inches
    Cost of last haircut 
    in dollars
     
    Number of hours 
    spent training
    Errors made 
    by employees
     
    Size of a fastfood
    sandwich in ounces
    calories in the
    fastfood sandwich
     
    Highway fuel
    efficiency (mpg)
    size (gallons)
    of gas tank
     
    Mean amount of pop
    consumed per week
    Length of
    right big toe
     
    Distance Des Moines
    to the city
    Airfare Des Moines
    to a city
     
  1. Associations come in a variety of strengths. Strong associations are ones where the pattern is followed by nearly all of the data points (you can predict quite accurately the value of y, knowing only the value of x). Weak associations have many pairs that don't fit the pattern (each x-value has a range of possible y-values that it may be paired with). Below are six scatterplots, for which you are to determine the direction (positive or negative) of the association, and its strength (strong, moderate, or weak). Use each scatterplot once to fill in the table below.

Association
weak
moderate
strong
Positive
 
 
 
Negative
 
 
 




We want to describe these linear associations with a number called the correlation. Some data pairs have scatterplots that are much closer to a line than other scatterplots. We want our correlation value to quickly indicate how well a straight line describes our data (is the scatterplot almost a straight line, close to a straight line, or far from a straight line). In addition, this numerical description of association should also indicate whether the data are positively associated or negatively associated. The correlation, known as r on the TI-calculators, is a number that is always between -1 and 1. Negative correlation values indicate negative association, while positive correlation values indicate positive associations. Correlations near 1 indicate almost perfectly linear data with a positive slope. Correlations near -1 indicate almost perfectly linear data with a negative slope. Correlations near zero indicate scatterplots that do not look much like straight lines. You cannot accurately predict the correlation from just a scatterplot, but you can make statements like "the correlation is slightly positive" or "the correlation is definitely negative, but not too close to -1".
Each of the scatterplots below shows a typical scatterplot for the correlation range on the left, and displays the actual value of the correlation r as well. These plots will serve as a guide to what scatterplots with given correlations look like. For the moment, simply look at the pictures and the corresponding values of the correlation r.

Close to 1 (0.8 to 1) 
r = 0.98
 


Medium positive (0.3 to 0.7) 
r = 0.68
 



Slightly positive (0.1 to 0.3) 
r = 0.25

Near zero (-0.1 to 0.1) 
r = -0.02




Slightly negative (-0.3 to -0.1) 
r = -0.3





Medium negative (-0.7 to -0.3) 
r = -0.7
Near -1 (-1 to -0.8) 
r = -0.98

  1. Shown below are scatterplots of various data pairs. On each plot, draw in a straight line that you think best describes the scatterplot and guess a value (one number) for the correlation r (think about whether the association is positive or negative, and how strong it is). Pay careful attention to whether your line increases or decreases. If you don't know where to draw the line, draw it "down the middle". Look at the examples above for help.

 
 

  1. To calculate correlations on the TI-83 or 84, there is a one time setup required. From your Home-screen (press [2nd][QUIT] to get to your home screen if you are not already there), press [2nd][CATALOG] (near the number 0 key) and then [D] (the letter keys are in Green above and right on the normal key function). Now scroll down with the down-arrow button and press [ENTER] when you get to the DiagnosticsOn line. This pastes the diagnostics on command to your home-screen. Press [ENTER] to execute that command. From this point on, your calculator will display the correlation r and sometimes r2 (the square of the correlation) whenever you ask the calculator to perform a linear regression (see the third bullet below for details on how).
    To calculate the correlation for (X,Y) data pairs:
    • Enter the explanatory variable values (X) in L1;
    • Enter the response variable data (Y) in L2;
    • Run a linear regression by pressing [STAT], right arrow over to the [CALC] menu and select [8: LinReg(a + bx)]. Be careful, there are several regression types that have names similar to number 8. This will paste the linear regression command to your home-screen. At this point, you can tell the calculator where the data are by entering the list name for your x-values and the list name for your y-values. We'll learn about linear regression in the next chapter, for now, think of running a linear regresssion as asking the calculator for a linear equation (straight line) that best fits the (X,Y) data pairs. All we want is the value of r that gets printed near the bottom of the screen. Some examples:

      EXAMPLE 1:  The command LinReg(a+bx) L1, L2 asks the calculator to find a linear equation for the x-values in L1, the y-values in L2. Once you execute this command, the correlation will be printed on your screen.

      EXAMPLE 2:  The command LinReg(a+bx) L3, L2 asks the calculator to find a linear equation for the x-values in L3 and the y-values in L2. Once you execute this command, the correlation will be printed on your screen.
       

  2. Hopefully you still have the data for movie costs and profits in your calculator. Calculate the correlation r for this data.