name: title class: middle, center, dark
--- class: light, center, middle, clear # What do lots of number look like? --- class: light # Lots of Numbers look like this Like this <div class=rtable> <table> <tbody> <tr> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> -23 </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> 91 </td> <td style="text-align:right;"> -42 </td> <td style="text-align:right;"> 34 </td> <td style="text-align:right;"> -59 </td> <td style="text-align:right;"> -50 </td> <td style="text-align:right;"> -80 </td> <td style="text-align:right;"> -71 </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> 49 </td> <td style="text-align:right;"> -14 </td> </tr> <tr> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> -48 </td> <td style="text-align:right;"> -49 </td> <td style="text-align:right;"> -25 </td> <td style="text-align:right;"> -75 </td> <td style="text-align:right;"> 81 </td> <td style="text-align:right;"> -69 </td> <td style="text-align:right;"> -5 </td> <td style="text-align:right;"> 79 </td> <td style="text-align:right;"> -85 </td> <td style="text-align:right;"> -5 </td> <td style="text-align:right;"> -69 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> -11 </td> <td style="text-align:right;"> 89 </td> <td style="text-align:right;"> -24 </td> </tr> <tr> <td style="text-align:right;"> -55 </td> <td style="text-align:right;"> -14 </td> <td style="text-align:right;"> -51 </td> <td style="text-align:right;"> 49 </td> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> -71 </td> <td style="text-align:right;"> 91 </td> <td style="text-align:right;"> 77 </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> -81 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 86 </td> <td style="text-align:right;"> -32 </td> <td style="text-align:right;"> 15 </td> </tr> <tr> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> -22 </td> <td style="text-align:right;"> 85 </td> <td style="text-align:right;"> -57 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> 76 </td> <td style="text-align:right;"> -11 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> -60 </td> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> -61 </td> <td style="text-align:right;"> 30 </td> <td style="text-align:right;"> 93 </td> <td style="text-align:right;"> -53 </td> </tr> <tr> <td style="text-align:right;"> 90 </td> <td style="text-align:right;"> 90 </td> <td style="text-align:right;"> -68 </td> <td style="text-align:right;"> 51 </td> <td style="text-align:right;"> 85 </td> <td style="text-align:right;"> -58 </td> <td style="text-align:right;"> -56 </td> <td style="text-align:right;"> 38 </td> <td style="text-align:right;"> -34 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> -52 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> -10 </td> <td style="text-align:right;"> -34 </td> <td style="text-align:right;"> -42 </td> </tr> <tr> <td style="text-align:right;"> 99 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> -30 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 46 </td> <td style="text-align:right;"> -11 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 67 </td> <td style="text-align:right;"> -17 </td> <td style="text-align:right;"> -48 </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> -62 </td> <td style="text-align:right;"> -86 </td> </tr> <tr> <td style="text-align:right;"> -24 </td> <td style="text-align:right;"> -28 </td> <td style="text-align:right;"> -9 </td> <td style="text-align:right;"> -13 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> -3 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 90 </td> <td style="text-align:right;"> -63 </td> <td style="text-align:right;"> -28 </td> <td style="text-align:right;"> -18 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 92 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:right;"> -94 </td> <td style="text-align:right;"> -25 </td> </tr> <tr> <td style="text-align:right;"> 26 </td> <td style="text-align:right;"> 93 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> -90 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> -19 </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> -27 </td> <td style="text-align:right;"> -67 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> -19 </td> <td style="text-align:right;"> -46 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 48 </td> </tr> <tr> <td style="text-align:right;"> -45 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> -56 </td> <td style="text-align:right;"> -48 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> -32 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -99 </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> -80 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> -49 </td> <td style="text-align:right;"> 92 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> -54 </td> <td style="text-align:right;"> -95 </td> <td style="text-align:right;"> -73 </td> <td style="text-align:right;"> -61 </td> <td style="text-align:right;"> -71 </td> <td style="text-align:right;"> -61 </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> </div> --- class: light # What can we say about them? We can see they aren't all the same. Not much else really. Looking at a bunch of numbers is hard work. <div class=rtable> <table> <tbody> <tr> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> -23 </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> 91 </td> <td style="text-align:right;"> -42 </td> <td style="text-align:right;"> 34 </td> <td style="text-align:right;"> -59 </td> <td style="text-align:right;"> -50 </td> <td style="text-align:right;"> -80 </td> <td style="text-align:right;"> -71 </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> 49 </td> <td style="text-align:right;"> -14 </td> </tr> <tr> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> -48 </td> <td style="text-align:right;"> -49 </td> <td style="text-align:right;"> -25 </td> <td style="text-align:right;"> -75 </td> <td style="text-align:right;"> 81 </td> <td style="text-align:right;"> -69 </td> <td style="text-align:right;"> -5 </td> <td style="text-align:right;"> 79 </td> <td style="text-align:right;"> -85 </td> <td style="text-align:right;"> -5 </td> <td style="text-align:right;"> -69 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> -11 </td> <td style="text-align:right;"> 89 </td> <td style="text-align:right;"> -24 </td> </tr> <tr> <td style="text-align:right;"> -55 </td> <td style="text-align:right;"> -14 </td> <td style="text-align:right;"> -51 </td> <td style="text-align:right;"> 49 </td> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> -71 </td> <td style="text-align:right;"> 91 </td> <td style="text-align:right;"> 77 </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> -81 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 86 </td> <td style="text-align:right;"> -32 </td> <td style="text-align:right;"> 15 </td> </tr> <tr> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> -22 </td> <td style="text-align:right;"> 85 </td> <td style="text-align:right;"> -57 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> 76 </td> <td style="text-align:right;"> -11 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> -60 </td> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> -61 </td> <td style="text-align:right;"> 30 </td> <td style="text-align:right;"> 93 </td> <td style="text-align:right;"> -53 </td> </tr> <tr> <td style="text-align:right;"> 90 </td> <td style="text-align:right;"> 90 </td> <td style="text-align:right;"> -68 </td> <td style="text-align:right;"> 51 </td> <td style="text-align:right;"> 85 </td> <td style="text-align:right;"> -58 </td> <td style="text-align:right;"> -56 </td> <td style="text-align:right;"> 38 </td> <td style="text-align:right;"> -34 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> -52 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> -10 </td> <td style="text-align:right;"> -34 </td> <td style="text-align:right;"> -42 </td> </tr> <tr> <td style="text-align:right;"> 99 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> -30 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 46 </td> <td style="text-align:right;"> -11 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 67 </td> <td style="text-align:right;"> -17 </td> <td style="text-align:right;"> -48 </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> -62 </td> <td style="text-align:right;"> -86 </td> </tr> <tr> <td style="text-align:right;"> -24 </td> <td style="text-align:right;"> -28 </td> <td style="text-align:right;"> -9 </td> <td style="text-align:right;"> -13 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> -3 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 90 </td> <td style="text-align:right;"> -63 </td> <td style="text-align:right;"> -28 </td> <td style="text-align:right;"> -18 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 92 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:right;"> -94 </td> <td style="text-align:right;"> -25 </td> </tr> <tr> <td style="text-align:right;"> 26 </td> <td style="text-align:right;"> 93 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> -90 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> -19 </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> -27 </td> <td style="text-align:right;"> -67 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> -19 </td> <td style="text-align:right;"> -46 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 48 </td> </tr> <tr> <td style="text-align:right;"> -45 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> -56 </td> <td style="text-align:right;"> -48 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> -32 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -99 </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> -80 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> -49 </td> <td style="text-align:right;"> 92 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> -54 </td> <td style="text-align:right;"> -95 </td> <td style="text-align:right;"> -73 </td> <td style="text-align:right;"> -61 </td> <td style="text-align:right;"> -71 </td> <td style="text-align:right;"> -61 </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> </div> --- class: light # Summary numbers It would be nice to reduce the big set of numbers down to a few numbers that we can look at and make sense of. **Sameness (Central Tendency)** - What are all the numbers close to? **Differentness (Variance)** - How different are the numbers? --- class: light # Descriptive Statistics - Give us summaries of big sets of numbers - Useful single numbers to look at - They tell us about patterns of sameness and differentness --- class: light, center, middle, clear # Graph the numbers to get a better look --- class: light # Dot plot (unordered) Graphing the numbers gives a quick and dirty sense of what they are like. Here's 200 numbers presented as dots <img src="2-Descriptives_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- class: light # Dot plot (ordered) Sorting the numbers from smallest to largest <img src="2-Descriptives_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- class: light # Histograms Histograms count up the numbers inside specific ranges <img src="2-Descriptives_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- class: light # Histograms Bars show you which bins have more or less numbers in the range <img src="2-Descriptives_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- class: light # So what are these numbers like? What single number would you say best describes most of these numbers? <img src="2-Descriptives_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- class: light # Question Is the red or blue value a better summary of all the numbers? <img src="2-Descriptives_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- class: light, center, middle, clear # Measures of Central Tendency --- class: light # Central Tendency 1. **Central tendency** should describe what most of the data is like -- 2. We want our summary number to be most like the other numbers. We want it to be a **representative value** -- 3. There are **multiple measures** of central tendency -- 4. They have **different properties** -- 5. Some work better than others depending on the data --- class: dark, center, middle, clear # Mode --- class: light # Mode The mode is the single most frequently occuring number > 1 1 2 2 3 4 5 6 7 7 7 7 7 - The mode is 7 because 7 happens the most - Find the mode by counting the occurence of each number, the mode is the most frequently occuring number - If there is a tie, then you have two or three or more modes (depends on how many different numbers tie) --- class: light # Finding the Mode in Python We make 25 numbers, how do we get python to find the mode? ```python import numpy as np a=np.random.randint(1,10+1, 25) counts = np.bincount(a) max=np.argmax(counts) max, counts[max] ``` --- class: light # Custom function for the mode in python You can always write your own function for the mode. This one is called `my_mode` ```python def my_mode(array): counts = np.bincount(a) max=np.argmax(counts) return max, counts[max] a=np.random.randint(1,10+1, 25) my_mode(a) ``` --- class: light # Thinking about the mode 1. Tells us the most frequent number(s) -- 2. Is it representative of all the numbers? -- 3. When would the mode be a good thing to know? --- class: light class: pink, center, middle, clear # Median --- class: light # Median The median is the middle number > 1 1 2 2 3 4 **5** 6 7 7 7 7 7 - The median is 5 because it is the middle number - Find the median by ordering the numbers from smallest to largest, then take the number in the middle --- class: light # Median (even number of numbers) If there are an even number of numbers, find the two in the middle, and > 1 2 3 **4** **5** 6 7 8 - The median is 4.5 because, 4.5 is in between the two middle numbers --- class: light # Finding the Median in Python Put some numbers in a variable. ```python a=np.random.randint(1,10+1, 12) np.median(a) ``` --- class: light # Thinking about the median 1. Tells us the number in the middle of the ordered numbers -- 2. Is it representative of all the numbers? -- 3. When would the median be a good thing to know? --- class: light, center, middle, clear # Mean --- class: light # Mean The Mean (also called average) is the sum of the numbers, divided by the number of numbers `\(\text{Mean} = \frac{\text{sum of numbers}}{\text{number of numbers}}\)` > 1 1 2 2 3 4 5 6 7 7 7 7 7 - Sum = 1+1+2+2+3+4+5+6+7+7+7+7 = 59 - Number of numbers = 13 - Mean = 59/13 = 4.538462 --- class: light # Mean `\(\text{Mean} = \bar{X} = \frac{\sum_{i=1}^{i=N}{x_i}}{N}\)` - `\(\bar{X}\)` bar symbolizes the mean - `\(\sum_{i=1}^{i=N}{x_i}\)` Summation notation - `\(x\)` = all the numbers (1,2,3,4...) - `\(i\)` = an index value, representing the first to last and all the numbers in between of x. - `\(N\)` = the number of numbers - `\(\sum\)` = instruction to add up numbers --- class: light # Summation example `\(x = 4,7,9\)` `\(\sum_{i=1}^{i=N}{x_i}\)` = `\(x_{i=1} = 4\)` + `\(x_{i=2} = 7\)` + `\(x_{i=3} = 9\)` `\(\sum_{i=1}^{i=N}{x_i} = 4 + 7 + 9 = 20\)` --- class: light # Mean in a table <table> <thead> <tr> <th style="text-align:left;"> index </th> <th style="text-align:left;"> x </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 2 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:left;"> 30 </td> </tr> <tr> <td style="text-align:left;"> N </td> <td style="text-align:left;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Mean </td> <td style="text-align:left;"> 6 </td> </tr> </tbody> </table> --- class: light # The mean equally divides the sum <table> <thead> <tr> <th style="text-align:left;"> index </th> <th style="text-align:left;"> x </th> <th style="text-align:left;"> equal_parts </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 6 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 30 </td> </tr> <tr> <td style="text-align:left;"> N </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Mean </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 6 </td> </tr> </tbody> </table> --- class: light # The mean is the balancing point .pull-left[ - deviation = score minus mean - sum of deviations will always equal zero ] .pull-right[ <table> <thead> <tr> <th style="text-align:left;"> index </th> <th style="text-align:left;"> x </th> <th style="text-align:left;"> deviations </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> -2 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> -4 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 3 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Sum </td> <td style="text-align:left;"> 30 </td> <td style="text-align:left;"> 0 </td> </tr> <tr> <td style="text-align:left;"> N </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Mean </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 0 </td> </tr> </tbody> </table> ] --- class: light # Finding the Mean in Python Use the `mean()` function ```python #make some numbers a=np.random.randint(1,10+1, 12) np.mean(a) ``` --- class: light # sum() and length() - `sum()` sums up the numbers - `.size` counts up the number of numbers in the variable ```python a=np.random.randint(1,10+1, 12) np.sum(a) ``` ```python a.size ``` --- class: light # Mean = sum()/length() ```python a=np.random.randint(1,10+1, 12) np.sum(a)/a.size ``` --- class: light # Thinking about the Mean 1. The mean divides the total sum into equal parts. -- 2. Is it representative of all the numbers? -- 3. When would the mean be a good thing to know? --- class: light, center, middle, clear # Do descriptive statistics for central tendency actually describe the data? ## It depends on the data --- class: light # Histogram shape: Bell-Shaped Mean (Red), Median (Green), Mode (Blue) <img src="2-Descriptives_files/figure-html/unnamed-chunk-24-1.png" width="450px" style="display: block; margin: auto;" /> --- class: light # Right-skewed <img src="2-Descriptives_files/figure-html/unnamed-chunk-25-1.png" width="450px" style="display: block; margin: auto;" /> --- class: light # Outliers Outliers are really big or really small values that are unusual compared to the rest of the data <img src="2-Descriptives_files/figure-html/unnamed-chunk-26-1.png" width="400px" style="display: block; margin: auto;" /> --- class: light # Mean, Median, and outliers The mean is influenced by outliers, the median is not. <img src="2-Descriptives_files/figure-html/unnamed-chunk-27-1.png" width="400px" style="display: block; margin: auto;" /> --- class: light # Zooming in The big number (2000) makes the mean really big, because it is included in the sum. <img src="2-Descriptives_files/figure-html/unnamed-chunk-28-1.png" width="400px" style="display: block; margin: auto;" /> --- class: pink, center, middle, clear # Always plot your data --- class: light # Big ideas 1. Descriptive statistics help us reduce a large pile of numbers to a few numebrs that "describe the data" -- 2. Mode, median, mean, are descriptives for central tendency in the data (meant to represent what most of the numbers are like) -- 3. Measures of central tendency can be "off" by quite a bit depending on the shape of the data, need to look at data to see if they are appropriate --- class: light, center, middle, clear # We already know what lots of numbers look like --- class: light # Lots of Numbers look like this Like this <div class=rtable> <table> <tbody> <tr> <td style="text-align:right;"> -57 </td> <td style="text-align:right;"> 27 </td> <td style="text-align:right;"> -16 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> -33 </td> <td style="text-align:right;"> -48 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> -30 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> 45 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> -72 </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:right;"> -35 </td> </tr> <tr> <td style="text-align:right;"> 27 </td> <td style="text-align:right;"> 96 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> -41 </td> <td style="text-align:right;"> 27 </td> <td style="text-align:right;"> -57 </td> <td style="text-align:right;"> -76 </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> -13 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> 38 </td> <td style="text-align:right;"> -75 </td> <td style="text-align:right;"> -58 </td> <td style="text-align:right;"> -30 </td> <td style="text-align:right;"> -58 </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:right;"> -63 </td> <td style="text-align:right;"> -84 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> -58 </td> <td style="text-align:right;"> 88 </td> <td style="text-align:right;"> -51 </td> <td style="text-align:right;"> 54 </td> <td style="text-align:right;"> -73 </td> <td style="text-align:right;"> -50 </td> <td style="text-align:right;"> -80 </td> <td style="text-align:right;"> -19 </td> <td style="text-align:right;"> -9 </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> 26 </td> </tr> <tr> <td style="text-align:right;"> -47 </td> <td style="text-align:right;"> -94 </td> <td style="text-align:right;"> -4 </td> <td style="text-align:right;"> -60 </td> <td style="text-align:right;"> -69 </td> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> 32 </td> <td style="text-align:right;"> -7 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 31 </td> <td style="text-align:right;"> 38 </td> <td style="text-align:right;"> 38 </td> <td style="text-align:right;"> -46 </td> </tr> <tr> <td style="text-align:right;"> -53 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> 32 </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 92 </td> <td style="text-align:right;"> 90 </td> <td style="text-align:right;"> -41 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> -65 </td> <td style="text-align:right;"> -93 </td> <td style="text-align:right;"> -26 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 97 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> -59 </td> <td style="text-align:right;"> -88 </td> <td style="text-align:right;"> -61 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> -84 </td> <td style="text-align:right;"> 72 </td> <td style="text-align:right;"> -22 </td> <td style="text-align:right;"> -51 </td> <td style="text-align:right;"> 99 </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> -37 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 65 </td> </tr> <tr> <td style="text-align:right;"> 71 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> -68 </td> <td style="text-align:right;"> 92 </td> <td style="text-align:right;"> -82 </td> <td style="text-align:right;"> -20 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:right;"> 26 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> -29 </td> <td style="text-align:right;"> -62 </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:right;"> -68 </td> <td style="text-align:right;"> -12 </td> <td style="text-align:right;"> 51 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 73 </td> <td style="text-align:right;"> -29 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> -66 </td> <td style="text-align:right;"> 82 </td> <td style="text-align:right;"> -72 </td> <td style="text-align:right;"> 26 </td> <td style="text-align:right;"> -52 </td> <td style="text-align:right;"> -98 </td> <td style="text-align:right;"> 59 </td> <td style="text-align:right;"> 48 </td> <td style="text-align:right;"> 34 </td> </tr> <tr> <td style="text-align:right;"> -35 </td> <td style="text-align:right;"> -95 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> -80 </td> <td style="text-align:right;"> -39 </td> <td style="text-align:right;"> 61 </td> <td style="text-align:right;"> 96 </td> <td style="text-align:right;"> -56 </td> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> -95 </td> <td style="text-align:right;"> -30 </td> <td style="text-align:right;"> 75 </td> <td style="text-align:right;"> -12 </td> <td style="text-align:right;"> 98 </td> <td style="text-align:right;"> 58 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> -46 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> -92 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> -88 </td> <td style="text-align:right;"> -35 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> -40 </td> </tr> </tbody> </table> </div> --- class: light # Summary numbers We want to reduce the big set of numbers down to a few numbers that we can look at and make sense of. **Sameness (Central Tendency)** - What are all the numbers close to? (topic from last class) **Differentness (Variance)** - How different are the numbers? (topic for this class) --- class: light, center, middle, clear # Graph the numbers to get a better look at the differences --- class: light # Histogram We can see the spread in the data, there are different numbers... <img src="2b_variation_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- class: light, center, middle, clear # Range --- class: light # The Range <img src="2b_variation_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- class: light # The Range The range is two numbers. - minimum value: the smallest number in the data - maximum value: the largest number in the data `1 4 3 6 5 7 6 8 7 6 9` Range is (1,9) - smallest number is 1 - largest number is 9 --- class: light # min() Use the `min()` function to find the smallest value in a variable in python ``` a=np.random.randint(1,10+1, 12) np.min(x) ``` --- class: light # max() Use the `max()` function to find the largest value in a variable in Python ``` a=np.random.randint(1,10+1, 12) np.min(x) ``` --- class: light # Thinking about the range - Pros: Great way to find out the largest possible difference - Cons: The biggest possible difference is probably not representative of all the differences in the numbers --- class: light # Some numbers Here are two sets of numbers. What is the range? Does it do a good job showing the average differences? `1 5 6 5 4 5 6 5 4 5 6 5 4 100` `1 2 1 2 1 1 1 1 2 2 2 2 1 2` --- class: light # Average differences - It would be nice if we could find a way to measure the average amount of differences. - This average could be a **representative** value that summarizes the differences between the numbers --- class: light # Average differences What should the average difference for these numbers be? ` 1 2 1 2 1 1 1 1 2 2 2 2 1 2` - All of the numbers are 1s or 2s. - The difference between 1 and 2 is 1 - It seems the average difference should be 1 (+ or -) --- class: light # Differences between numbers Consider these 10 numbers: ` 1 3 4 5 5 6 7 8 9 24` - We can see there are some differences, they are not all the same. - We can measure the differences, by finding the difference between each score, and every other scores - e.g., 1-3 = 2, 1-4 = 3, etc. --- class: light # Difference scores <img src="figs-crump/2bdifferences.png" width="1789" /> --- class: light # Problem: The sum = 0 <img src="figs-crump/2bdifferences.png" width="1789" /> --- class: light # Summarizing the difference scores 1. We can find the differences between scores -- 2. There are lots of difference scores -- 3. Even though we can see the difference scores have different values, we can't summarize them in the normal fashion -- 4. The sum adds up to 0... -- 5. How can we solve the problem? --- class: light, center, middle, clear # Difference scores from the mean --- class: light # Difference scores from the mean Consider these numbers: ` 1 6 4 2 6 8` 1. We can compute the mean to describe the central tendency of the numbers 2. How far off is the mean for each number? This is the amount of error 3. The difference scores from the mean show how far off (different) each score is from the mean difference score = `\(\bar{X} - x_{i}\)` --- class: light # Difference scores from the mean <img src="figs-crump/2bdiffmean.png" width="1776" /> --- class: light # sum = 0, Same problem... <img src="figs-crump/2bdiffmean.png" width="1776" /> --- class: light # Mean is the balancing point 1. The mean is the balancing point in the data -- 2. Half of the data is on one side of the mean, the other half is on the other side -- 3. Difference scores from mean will always sum to 0 --- class: light # The mean minimizes the deviations The mean is the only number that minimizes the sum of the deviations (difference scores) `\(\sum_{i=1}^{i=N} (\bar{X} - x_{i}) = 0\)` -- If the mean difference scores will always sum to 0, how can we find a sum for the difference scores that does not sum to 0? --- class: light, center, middle, clear # Squared Deviations --- class: light # Squared deviations <img src="figs-crump/2bsquared.png" width="1763" /> --- class: light # Squared deviations Why square the deviations (differences between mean and each score)? - Squaring converts all the negative numbers to postive numbers - This allows us to sum them all up, and not get 0! --- class: light, center, middle, clear # SS (Sum of squared deviations) --- class: light # SS (sum of squared deviations) <img src="figs-crump/2bsquaredSS.png" width="1763" /> --- class: light # SS (sum of squared deviations) The formula for the sum of squared deviations (SS, also called sum of squares) is: `\(SS = \sum_{i=1}^{i=N} (\bar{X} - x_{i})^2\)` --- class: light # What next? 1. We've found a way to sum up the differences (SS) -- 2. We used the squared differences from the mean, and added them all up -- 3. How can we find the average? Remember, we want a single number that does a good job of representing the differences... --- class: light, center, middle, clear # Variance --- class: light # Variance = SS/N <img src="figs-crump/2bsquaredV.png" width="1763" /> --- class: light # Jargon 1. Learning statistics can be confusing because there are many new terms, and some of them refer to normal everyday concepts Everyday words: - Variability & Variance: The things aren't all the same, they have some variability **Statistical Variance**: The average of the sum of the squared difference scores from the mean --- class: light # Variance The average of the sum of the squared difference scores from the mean `\(Variance = SS / N\)` `\(Variance = \frac{\sum_{i=1}^{i=N}(\bar{X}-x_i)^2}{N}\)` Usefulness Pros: The variance provides us with one summary number about the average differences Cons: We squared the differences, so the variance doesn't directly relate to size of the original differences --- class: light # The variance is too big <img src="figs-crump/2bsquaredV2.png" width="1763" /> --- class: light # What to do? 1. We are searching for a summary number to represent the differences in our data. -- 2. The variance is too big because of squaring -- 3. What can we do to solve the problem, and make our summary number in the range of the actual differences? --- class: light # Square root the variance 1. Squaring numbers makes them bigger (e.g., `\(2^2 =4\)`) 2. Square rooting numbers brings them back down to their unsquared size (e.g., `\(\sqrt{2^2}=2\)`) 3. Let's square root the variance --- class: light # Square root the variance <img src="figs-crump/2bsquaredV3.png" width="1763" /> --- class: light, center, middle, clear # Standard Deviation --- class: light # Standard deviation = sqrt(variance) When we took the square root of the variance, we also did something else, called computing the **standard deviation**. `\(\text{standard deviation} = \sqrt{\text{variance}}\)` `\(\text{standard deviation} = \sqrt{\frac{SS}{N}}\)` `\(\text{standard deviation} = \sqrt{ \frac{\sum_{i=1}^{i=N}(\bar{X}-x_i)^2}{N}}\)` The standard deviation is a summary of the variability in the data that is in the same scale as the original differences --- class: light # Standard Deviation <img src="figs-crump/2bsquaredV3.png" width="1763" /> --- class: light # Populations vs samples There are different formulas for the variance and standard deviation, depending on whether your data represents an entire population of scores, or just a sample (a subset of the population). **Population**: Divide by N (this is what you will do for the quiz, we tell you to divide by N) **Sample**: Divide by N-1 (this is what you will do when you are working with samples later on this class) - The differences are explained in week 4 (distributions and sampling) --- class: light, center, middle, clear # Data sense with Descriptives --- class: light # What if? - Someone told you they had some numbers with: -- - Mean = 100, Standard Deviation = 25 -- - What would most of the numbers be like? - What would be a good summary of the average differences in the data? - What kind of numbers would you expect to see or not see? --- class: light # Animating the standard deviation <img src="figs-crump/2banimSD.gif" style="display: block; margin: auto;" /> --- class: light, center, middle, clear # Python tips --- class: light # Warning - Python has functions for variance and standard deviation... - **But**, they divide by N-1, not N - We talk about what this means later in the course, and why R does this - For now, you can't use the R formulas `sd()` or `var()` --- class: light # Python: Mean difference scores ```python x <- c(8,2,6,4,6,2,4,4) mean(x)-x ``` ``` ## [1] -3.5 2.5 -1.5 0.5 -1.5 2.5 0.5 0.5 ``` ```python sum(mean(x)-x) ``` ``` ## [1] 0 ``` --- class: light # Python: Squared deviations ```python x <- c(8,2,6,4,6,2,4,4) (mean(x)-x)^2 ``` ``` ## [1] 12.25 6.25 2.25 0.25 2.25 6.25 0.25 0.25 ``` --- class: light # Python: SS, sum of squares ```r x <- c(8,2,6,4,6,2,4,4) sum((mean(x)-x)^2) ``` ``` ## [1] 30 ``` More explicit code: ```r x <- c(8,2,6,4,6,2,4,4) squared_deviations <- (mean(x)-x)^2 SS <- sum(squared_deviations) SS ``` ``` ## [1] 30 ``` --- # Python: Variance ```r x <- c(8,2,6,4,6,2,4,4) squared_deviations <- (mean(x)-x)^2 SS <- sum(squared_deviations) N <- length(x) variance <- SS/N variance ``` ``` ## [1] 3.75 ``` --- # Python: Standard Deviation ```r x <- c(8,2,6,4,6,2,4,4) squared_deviations <- (mean(x)-x)^2 SS <- sum(squared_deviations) N <- length(x) variance <- SS/N standard_deviation <- sqrt(variance) standard_deviation ``` ``` ## [1] 1.936492 ``` --- class: light Thanks to Matt Crump (Brooklyn College) for some of the slides.