name: title class: middle, center, dark
--- class: light, center, middle, clear # Correlation does not equal Causation --- class: light # Questions for this class 1. **What is Causation?** Why do we care about it? 2. **What is Correlation?** Why do we care about it? 3. **Why does correlation not equal causation?**...lots of reasons --- class: light # Causation Why does a ping pong ball move when you hit it? <img src="figs-crump/correlation/Figures/Figures.001.png" width="75%" /> --- class: light # Simple Physics 1. The energy from swinging the paddle gets transferred to the ping pong ball when they collide 2. The **force** in the paddle that is transferred to the ball **causes** the ball to move --- class: light # Measuring two things What if we measured two things in the ping pong ball example 1. The strength of the swing 2. The distance travelled by the ball Would we expect a relationship between the two measurements, what would happen to the ball if we swung the paddle soft, to medium, to hard? --- class: light, center, middle, clear <img src="figs-crump/correlation/Figures/Figures.002.png" width="665" /> --- class: light, center, middle, clear <img src="figs-crump/correlation/Figures/Figures.003.png" width="665" /> --- class: light, center, middle, clear <img src="figs-crump/correlation/Figures/Figures.004.png" width="665" /> --- class: light, center, middle, clear <img src="figs-crump/correlation/Figures/Figures.005.png" width="665" /> --- class: dark # What have we learned 1. We know hitting a ball can cause it to move -- 2. The action of the causal force can be measured by relating a measure of the cause (swing strength) to a measure of the outcome (distance travelled) -- 3. There is a positive relationship between the two measures, increasing swing strength is associated with longer distances --- class: dark # What have we learned cont. 4. We use the term **correlation** to describe the relationship between the two measures, in this case we found a **positive** correlation -- 5. We have seen that a causal force (swing strength) can produce a correlation between two measures --- class: light, center, middle, clear # What if we indirectly measured swing strength by asking the person to report how hard they thought the swing was --- class: light, center, middle, clear <img src="figs-crump/correlation/Figures/Figures.006.png" width="665" /> --- class: light, center, middle, clear # What if our x-axis was a random number chosen by somebody before each swing? --- class: light, center, middle, clear <img src="figs-crump/correlation/Figures/Figures.007.png" width="665" /> --- class: light # Causation 1. Psychological science is interested in understanding the causes of psychological processes -- 2. We can measure change in a causal force, and measure change in an outcome (psychological process) -- 3. If the force causally changes the outcome, we expect a relationship or association between the force and the outcome. --- class: light # Correlation 1. A measure of association or relationship between two variables -- 2. In general, correlations occur when change in one variable corresponds in some regular way to change in another variable --- class: light
--- class: light
--- class: light
- Check out http://tylervigen.com/discover for more examples! --- class: light, center, middle
--- class: light # 3 kinds of correlation <img src="figs-crump/correlation/3posnegrand-1.png" width="692" /> --- class: light # Positive correlation - **Increases** in the X variable are associated with **increases** in the Y variable AND - **Decreases** in the X variable are associated with **decreases** in the Y variable --- class: light # Negative correlation - **Increases** in the X variable are associated with **decreases** in the Y variable AND - **Decreases** in the X variable are associated with **increases** in the Y variable --- class: light # Random (no correlation) - **Increases** in the X variable are **NOT** associated with **increases** or **decreases** in the Y variable AND - **Decreases** in the X variable are **NOT** associated with **increases** or **decreases** in the Y variable --- class: light, center, middle, clear # Correlations can be weak or strong --- class: light # Increasing positive correlation <!-- --> --- class: light # Negative correlations <!-- --> --- class: light # Correlation Strength - Super strong (Perfect): dots all line up, no exceptions - Strong: Clear pattern, not much variation in dots - Medium: There is a pattern but dots have a lot of variaion - Weak: Sort of a hint of a pattern, dots have loads of variation - None: Dots are everywhere, no clear pattern --- class: light # Pearson's r A single number that summarizes the strength of the correlation - varies between -1 and 1 - 0 means no correlation - 1 means perfect positive correlation - (-1) means perfect negative correlation - values in between indicate more or less strength **How do we compute this number?** --- class: light # Formula for Pearson's r <img src="figs-crump/correlation/cov.png" width="636" /> <img src="figs-crump/correlation/r.png" width="676" /> --- class: light # Covariation If you have two measures that can vary, then **co-variation** is how they vary together - If the variation in X is associated (predicts or corresponds to) variation in Y, then there is some co-variation - If the variation in X is totally random (does not predict or correspond to) variation in Y, then there isn't co-variation, there's just a lot of variation. --- class: light # Measuring covariation 1. We will look at two set of measures taken from 10 fictitious people (how much chocolate they have, and how happy they are) 2. We show that multiplying the scores and summing them up can help us measure covariation 3. We then show to compute covarition more formally using mean difference scores --- class: light # Two variables <img src="figs-crump/correlation/cov1.png" width="80%" /> --- class: light # Sum of products <img src="figs-crump/correlation/cov2.png" width="80%" /> --- class: light # Sum of products continued 1. multiply each pair of scores 2. sum them all up - What would make the sum of products very large? - What would make the sum of prodcuts very small? --- class: light # Biggest (XY) vs. smallest (AB) <img src="figs-crump/correlation/cov3.png" width="80%" /> --- class: light # Biggest vs. smallest sums of products **Biggest sum** - occurs when the smallest to largest numbers in X, are perfectly lined up with the smallest to largest numbers in Y **Smallest sum** - occurs when the smallest to largest numbers in X, are perfectly lined up with LARGEST to SMALLEST numbers in Y --- class: light # Simulation demonstration 1. Take the numbers in X and Y 2. Randomly scramble the numbers in X 3. Randomly scramble the numbers in Y 4. Compute the sum of products 5. Do this a bunch to see if the sum of products is always between the smallest and biggest value that we have already found for these numbers --- class: light # Simulation Results <img src="figs-crump/correlation/3simsum-1.png" width="80%" /> --- class: light # The problem A sum of products can describe the covariation between two measurements, but it is not a convenient number. - It is not readily interpretable, need to know the possible range to know whether the observed value reflects positive, negative, or 0 co-variation --- class: light # Covariation (the formula) Notice the formula for the covariation involves a sum of products, but the products are not between the raw data, they are between the mean difference scores <img src="figs-crump/correlation/cov.png" width="636" /> --- class: light # Still a problem Covariation can tell us about correlation in the data, it is a measure of correlation. - It varies between negative and postive values, so we can know the direction of the correlation - But, we still need to know the possible range to interpret the strength - **Solution**, transform the covarition so it is always between -1 and +1 --- class: light # Computing Covariation <img src="figs-crump/correlation/cov4.png" width="90%" /> --- class: light # Back to Pearson's r Pearson's r divides the covariance by the product of the standard deviations of each variable. This causes the r-value to be between -1 and 1 <img src="figs-crump/correlation/r.png" width="676" /> --- class: light # Python: sum of products ```python x = np.array([1,2,3,4,5]) y = np.array([4,3,6,7,8]) x*y ``` ```python array([ 4, 6, 18, 28, 40]) ``` ```python np.sum(x*y) ``` ``` 96 ``` --- class: light # Python: covariation ```python x = np.array([1,2,3,4,5]) y = np.array([4,3,6,7,8]) N = 5 np.sum((x-x.mean())*(y-y.mean()))/N ``` ``` 2.4 ``` --- class: light # Python: Pearson's r ```python x = np.array([1,2,3,4,5]) y = np.array([4,3,6,7,8]) N = 5 covariation = np.sum((x-x.mean())*(y-y.mean()))/N SD_x = np.sqrt(np.sum(((x-x.mean())**2))/N) SD_y = np.sqrt(np.sum(((y-y.mean())**2))/N) r = covariation/(SD_x*SD_y) ``` ```python 0.914991421995628 ``` --- class: light # Python's cor() function R has a function to compute correlations called `pearsonr()`. ```python import scipy.stats as stats x = np.array([1,2,3,4,5]) y = np.array([4,3,6,7,8]) (r,p)=stats.pearsonr(x,y) r ``` ``` 0.914991421995628 ``` --- class: light # Pearson's r (a different way) 1. Pearson's r gives us a value between -1 and 1 that is easier to interpret than covariation (which is not necessarilly between -1 and 1). 2. The reason covariance is not between -1 and 1 is because the values for X and Y can have very different ranges What happens when we transform the X and Y values so they are in the same range... --- class: light # Transforming Scores Any set of scores can be transformed. E.g., to transorm proportions (between 0 and 1) to percentages, between 0 and 100, we multiply the proportions by 100 ```python x = np.array([.1, .3, .5, .6, .8, .9]) x ``` ```python array([0.1, 0.3, 0.5, 0.6, 0.8, 0.9]) ``` ```python x*100 ``` ```python array([10., 30., 50., 60., 80., 90.]) ``` --- class: light # Difference score transformation If we subtract the mean from each score, we center the data around 0, and preserve the differences between the numbers, but ranges can be different. ```python x = np.array([1,3,5,7,9]) x-np.mean(x) ``` ```python array([-4., -2., 0., 2., 4.]) ``` ```python y = np.array([100,300,500,700,900]) y-np.mean(y) ``` ```python array([-400., -200., 0., 200., 400.]) ``` --- class: light # Z score transformation If we divide the mean difference scores by their standard deviation, we center the data around 0, and make the ranges the same ```python x = np.array([1,3,5,7,9]) (x-np.mean(x))/np.std(x, ddof=1) ``` ``` array([-1.26491106, -0.63245553, 0. , 0.63245553, 1.26491106]) ``` ```r y = np.array([100,300,500,700,900]) (y-np.mean(y)) / np.std(y, ddof=1) ``` ``` array([-1.26491106, -0.63245553, 0. , 0.63245553, 1.26491106]) ``` --- class: light # Computing correlation from z-scores ```python x = np.array([1,4,6,8,9]) y = np.array([200,500,100,800,900]) N = 5 z_x = (x-np.mean(x)) / np.std(x,ddof=1) z_y = (y-np.mean(y)) / np.std(y,ddof=1) covariation = np.sum(z_x*z_y)/(N-1) covariation ``` ```python 0.7270768712447324 ``` ```python (r,p)=stats.pearsonr(x,y) r ``` ```python 0.7270768712447323 ``` --- class: light # Next class: Regression Lab! --- class: light Thanks to Matt Crump for some of the slides.