class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-06B.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-12.png") # .monash-blue[ETC5521: Exploratory Data Analysis] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Exploring bivariate dependencies</h2> .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 6 - Session 2 <br> ] --- class: transition # Numerical measures of association --- # Correlation - Correlation between variables `\(x_1\)` and `\(x_2\)`, with `\(n\)` observations in each. `$$r = \frac{\sum_{i=1}^n (x_{i1}-\bar{x}_1)(x_{i2}-\bar{x}_2)}{\sqrt{\sum_{i=1}^n(x_{i1}-\bar{x}_1)^2\sum_{i=1}^n(x_{i2}-\bar{x}_2)^2}} = \frac{\mbox{covariance}(x_1, x_2)}{(n-1)s_{x_1}s_{x_2}}$$` - Test for statistical significance, whether population correlation could be 0 based on observed `\(r\)`, using a `\(t_{n-2}\)` distribution: `$$t=\frac{r}{\sqrt{1-r^2}}\sqrt{n-2}$$` --- .flex[ .item[ <img src="images/lecture-06B/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" /> ] .item[ ```r cor(d1$x, d1$y) ``` ``` ## [1] 0.5228401 ``` ```r cor.test(d1$x, d1$y) ``` ``` ## ## Pearson's product-moment correlation ## ## data: d1$x and d1$y *## t = 8.6306, df = 198, p-value = 1.993e-15 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.4141406 0.6168362 ## sample estimates: ## cor ## 0.5228401 ``` ] ] --- .flex[ .item[ <img src="images/lecture-06B/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] .item[ ```r cor(d2$x, d2$y) ``` ``` ## [1] -0.04993755 ``` ```r cor.test(d2$x, d2$y) ``` ``` ## ## Pearson's product-moment correlation ## ## data: d2$x and d2$y *## t = -0.70356, df = 198, p-value = 0.4825 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.18738032 0.08942303 ## sample estimates: ## cor ## -0.04993755 ``` ] ] --- .flex[ .w-30[ <img src="images/lecture-06B/unnamed-chunk-10-1.png" width="100%" style="display: block; margin: auto;" /> ] .w-30[ All observations ``` ## $estimate *## cor *## 0.2994041 ## ## $statistic ## t ## 4.426682 ## *## $p.value *## [1] 1.576086e-05 ``` ] .w-5.white[ gap ] .w-30[ Without outlier ``` ## $estimate *## cor *## -0.01173776 ## ## $statistic ## t ## -0.1651764 ## *## $p.value *## [1] 0.8689737 ``` ] ] --- # Perceiving correlation .panelset[ .panel[.panel-name[🖼️] .monash-orange2[Let's play a game:] Guess the correlation! <br> <img src="images/lecture-06B/simcor-1.png" width="70%" style="display: block; margin: auto;" /> ] .panel[.panel-name[answers] <img src="images/lecture-06B/unnamed-chunk-13-1.png" width="40%" style="display: block; margin: auto;" /> <br> <br> Generally, people don't do very well at this task. Typically people under-estimate `\(r\)` from scatterplots, particularly when it is around 0.4-0.7. The variation in a scatterplot perceptually doesn't vary is not linearly with `\(r\)`. When someone says .monash-blue2[*correlation is 0.5* it sounds impressive]. BUT when someone shows you a .monash-blue2[scatterplot of data that has correlation 0.5], you will say that's a .monash-blue2[weak relationship.] ] .panel[.panel-name[R] .s400[ ```r set.seed(7777) vc <- matrix(c(1, 0, 0, 1), ncol = 2, byrow = T) d <- as_tibble(rmvnorm(500, sigma = vc)) p1 <- ggplot(d, aes(x = V1, y = V2)) + geom_point() + theme_void() + theme( aspect.ratio = 1, plot.background = element_rect(fill = "gray90") ) vc <- matrix(c(1, 0.4, 0.4, 1), ncol = 2, byrow = T) d <- as_tibble(rmvnorm(500, sigma = vc)) p2 <- ggplot(d, aes(x = V1, y = V2)) + geom_point() + theme_void() + theme( aspect.ratio = 1, plot.background = element_rect(fill = "gray90") ) vc <- matrix(c(1, 0.6, 0.6, 1), ncol = 2, byrow = T) d <- as_tibble(rmvnorm(500, sigma = vc)) p3 <- ggplot(d, aes(x = V1, y = V2)) + geom_point() + theme_void() + theme( aspect.ratio = 1, plot.background = element_rect(fill = "gray90") ) vc <- matrix(c(1, 0.8, 0.8, 1), ncol = 2, byrow = T) d <- as_tibble(rmvnorm(500, sigma = vc)) p4 <- ggplot(d, aes(x = V1, y = V2)) + geom_point() + theme_void() + theme( aspect.ratio = 1, plot.background = element_rect(fill = "gray90") ) vc <- matrix(c(1, -0.2, -0.2, 1), ncol = 2, byrow = T) d <- as_tibble(rmvnorm(500, sigma = vc)) p5 <- ggplot(d, aes(x = V1, y = V2)) + geom_point() + theme_void() + theme( aspect.ratio = 1, plot.background = element_rect(fill = "gray90") ) vc <- matrix(c(1, -0.5, -0.5, 1), ncol = 2, byrow = T) d <- as_tibble(rmvnorm(500, sigma = vc)) p6 <- ggplot(d, aes(x = V1, y = V2)) + geom_point() + theme_void() + theme( aspect.ratio = 1, plot.background = element_rect(fill = "gray90") ) vc <- matrix(c(1, -0.7, -0.7, 1), ncol = 2, byrow = T) d <- as_tibble(rmvnorm(500, sigma = vc)) p7 <- ggplot(d, aes(x = V1, y = V2)) + geom_point() + theme_void() + theme( aspect.ratio = 1, plot.background = element_rect(fill = "gray90") ) vc <- matrix(c(1, -0.9, -0.9, 1), ncol = 2, byrow = T) d <- as_tibble(rmvnorm(500, sigma = vc)) p8 <- ggplot(d, aes(x = V1, y = V2)) + geom_point() + theme_void() + theme( aspect.ratio = 1, plot.background = element_rect(fill = "gray90") ) grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 4) ``` ] .scroll-sign[ <br> ] ] ] --- # Robust correlation measures 1/2 - Spearman (based on ranks) - Sort each variable, and return rank (of actual value) - Compute correlation between ranks of each variable .pull-left[ ``` ## # A tibble: 6 × 4 ## x y xr yr ## <dbl> <dbl> <dbl> <dbl> ## 1 0.7 -1.7 5 1 ## 2 0.5 1.1 4 5 ## 3 -0.6 0.3 2 3 ## 4 -0.2 -0.9 3 2 ## 5 -1.7 0.4 1 4 ## 6 10 10 6 6 ``` ].pull-right[ ```r cor(df$x, df$y) ``` ``` ## [1] 0.935397 ``` ```r cor(df$xr, df$yr) ``` ``` ## [1] 0.2 ``` ```r cor(df$x, df$y, method = "spearman") ``` ``` ## [1] 0.2 ``` ] --- # Robust correlation measures 2/2 - Kendall `\(\tau\)` (based on comparing pairs of observations) - Sort each variable, and return rank (of actual value) - For all pairs of observations `\((x_i, y_i), (x_j, y_j)\)`, determine if **concordant**, `\(x_i < x_j, y_i < y_j\)` or `\(x_i > x_j, y_i > y_j\)`, or **discordant**, `\(x_i < x_j, y_i > y_j\)` or `\(x_i > x_j, y_i < y_j\)`. `$$\tau = \frac{n_c-n_d}{\frac12 n(n-1)}$$` .pull-left[ <img src="images/lecture-06B/unnamed-chunk-17-1.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ ```r cor(df$x, df$y) ``` ``` ## [1] 0.935397 ``` ```r cor(df$x, df$y, method = "kendall") ``` ``` ## [1] 0.06666667 ``` ] --- # Comparison of correlation measures <table class="table lightable-classic" style='width: auto !important; margin-left: auto; margin-right: auto; font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> sample </th> <th style="text-align:right;"> corr </th> <th style="text-align:right;"> spearman </th> <th style="text-align:right;"> kendall </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <img src="images/lecture-05B/diffscatter-1.png" height="100px"> </td> <td style="text-align:right;"> 0.523 </td> <td style="text-align:right;"> 0.512 </td> <td style="text-align:right;"> 0.355 </td> </tr> <tr> <td style="text-align:left;"> <img src="images/lecture-05B/diffscatter-2.png" height="100px"> </td> <td style="text-align:right;"> -0.050 </td> <td style="text-align:right;"> -0.087 </td> <td style="text-align:right;"> -0.073 </td> </tr> <tr> <td style="text-align:left;"> <img src="images/lecture-05B/diffscatter-3.png" height="100px"> </td> <td style="text-align:right;"> 0.299 </td> <td style="text-align:right;"> -0.023 </td> <td style="text-align:right;"> -0.014 </td> </tr> </tbody> </table> --- class: transition middle # Scatterplot case studies --- # .orange[Case study] .bg-orange.circle[2] Movies .panelset[ .panel[.panel-name[🖼️] .flex[ .w-50[ <img src="images/lecture-06B/movies-1.png" width="80%" style="display: block; margin: auto;" /> ] .w-50[ - `votes`: Number of IMDB users who rated this movie - `rating`: Average IMDB user rating <br> <br> Describe the relationship between rating and votes. ] ] ] .panel[.panel-name[learn] .grid[ .item[ - Odd pattern, almost looks like an "r" - No films with lots of votes and low rating - No film with lots of votes has rating close to maximum possible: **barrier?** - Films with very high ratings only have a few votes - Generally, rating appears to increase as votes increases (its hard to really read this with so few points though) - A few films with really large number of votes: **outliers?** or just **skewness?** - Films with few votes have ratings that span the range of the scale. ] .item[ Would you say this is positive, linear, moderate? Or positive, non-linear, and moderate? Or weak? In some sense, these descriptions are meaningless, here. <br> What about causation? association? outliers? clusters? gaps? barrier? conditional relationships? <br> .monash-blue2[These descriptive help to describe relationships generally, but it is important to convert them into the context of the (variables in the) data.] .monash-orange2[BUT, BUT there is a skewness in votes that needs fixing before assessing the relationship.] ] ] ] .panel[.panel-name[R] ```r ggplot(movies, aes(x = votes, y = rating)) + geom_point() + scale_y_continuous("rating", breaks = seq(0, 10, 2)) ``` ] ]
−
+
01
:
00
--- # .orange[Case study] .bg-orange.circle[2] Movies .panelset[ .panel[.panel-name[🖼️] .flex[ .w-50[ <img src="images/lecture-06B/logmovies-1.png" width="80%" style="display: block; margin: auto;" /> ] .w-50[ <br> <br> <br> <br> 🤔 Something funny happens, right at 1000 votes <br> <br> Some positive association between two variables only for large number of votes. ] ] ] .panel[.panel-name[R] ```r ggplot(movies, aes(x = votes, y = rating)) + geom_point(alpha = 0.1) + geom_smooth(se = F, colour = "orange", size = 2) + scale_x_log10() + scale_y_continuous("rating", breaks = seq(0, 10, 2)) ``` *Note*: Used .monash-orange2[transparency] (because there is a lot of data) and a .monash-orange2[loess smooth] (because I am interested in assessing the trend between votes and rating). <br> Correlation between .monash-blue2[raw variables] is 0.1 <br> and between .monash-blue2[transformed] `log(votes)`and `rating` is 0.07. Which more accurately reflects the relationship? ] ] --- # .orange[Case study] .bg-orange.circle[3] Cars .panelset[ .panel[.panel-name[🖼️] .flex[ .w-50[ <img src="images/lecture-06B/cars-1.png" width="80%" style="display: block; margin: auto;" /> ] .w-50[ - `mpg`: Miles/(US) gallon - `hp`: Gross horsepower <br> <br> Describe the relationship between horsepower and mpg. ] ] ] .panel[.panel-name[learn] <br> <br> - negative: as horsepower increases fuel efficiency is worse - nonlinear: for lower horse power the decrease in efficieny is more - strong: very little variation between cars, looks fundamentally like a physics problem - outlier: one car with high horse power has unusually high efficiency ] .panel[.panel-name[R] ```r data(mtcars) ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + geom_smooth(colour = "forestgreen", se = F) ``` ] ] --- # .orange[Case study] .bg-orange.circle[3] Cars .panelset[ .panel[.panel-name[🖼️] .flex[ .w-50[ <img src="images/lecture-06B/logcars-1.png" width="80%" style="display: block; margin: auto;" /> ] .w-50[ - `mpg`: Miles/(US) gallon - `hp`: Gross horsepower <br> <br> Log transforming `mpg` linearised the relationship between horsepower and mpg. <br> <br> .monash-green2[Need to also remove the outlier, because it is a little influential (swinging the line towards it).] ] ] ] .panel[.panel-name[R] ```r ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + * scale_y_log10("log mpg") + geom_smooth(method = "lm", colour = "forestgreen", se = F) + geom_smooth(data = filter(mtcars, hp < 300), method = "lm", colour = "orangered", se = F, lty = 2) ``` Correlation between .monash-blue2[raw variables] is -0.78 <br> and between .monash-blue2[transformed] `log(mpg)` and `hp` is -0.85. Which more accurately reflects the relationship? ] ] --- class: transition middle # Transformations for skewness, heteroskedasticity and linearising relationships, and to emphasize association --- # Circle of transformations for linearising .grid[ .item[ <img src="images/lecture-06B/circleoftrans-1.png" width="80%" style="display: block; margin: auto;" /> ] .item[ Remember the power ladder: -1, 0, 1/3, 1/2, .monash-orange2[1], 2, 3, 4 <br> 1.Look at the shape of the relationship. 2.Imagine this to be a number plane, and depending on which quadrant the shape falls in, you either transform `\(x\)` or `\(y\)`, up or down the ladder: `+,+` both up; `+,-` x up, y down; `-,-` both down; `-,+` x down, y up <br> If there is heteroskedasticity, try transforming `\(y\)`, may or may not help ] ] --- class: transition middle # Scatterplot case studies --- # .orange[Case study] .bg-orange.circle[4] Soils .flex[ .w-50[ <img src="images/lecture-06B/baker-1.png" width="80%" style="display: block; margin: auto;" /> ] .w-50[ Interplay between skewness and association Data is from a soil chemical analysis of a farm field in Iowa. Is there a relationship between Yield and Boron? <br> You can get a marginal plot of each variable added to the scatterplot using `ggMarginal`. This is useful for assessing the skewness in each variable. <br> Boron is right-skewed Yield is left-skewed. With skewed distributions in marginal variables it is .monash-orange2[hard] to assess the relationship between the two. Make a transformation to fix, first. ] ] --- # .orange[Case study] .bg-orange.circle[4] Soils .flex[ .w-50[ <img src="images/lecture-06B/transfbaker-1.png" width="80%" style="display: block; margin: auto;" /> ] .w-50[ <br> <br> ```r p <- ggplot( baker, aes(x = B, y = Corn97BU^2) *) + geom_point() + xlab("log Boron (ppm)") + ylab("Corn Yield^2 (bushells)") + * scale_x_log10() *ggMarginal(p, type = "density") ``` ] ] --- # .orange[Case study] .bg-orange.circle[4] Soils .flex[ .w-50[ <img src="images/lecture-06B/bakeriron-1.png" width="80%" style="display: block; margin: auto;" /> ] .w-50[ <br> Lurking variable? <br> <br> ```r p <- ggplot( baker, aes(x = Fe, y = Corn97BU^2) ) + geom_density2d(colour = "orange") + geom_point() + * xlab("Iron (ppm)") + ylab("Corn Yield^2 (bushells)") ggMarginal(p, type = "density") ``` ] ] --- # .orange[Case study] .bg-orange.circle[4] Soils .flex[ .w-40[ <img src="images/lecture-06B/bakerironca-1.png" width="100%" style="display: block; margin: auto;" /> ] .w-60[ Colour high calcium (>5200ppm) calcium values .f5[ ```r ggplot(baker, aes( x = Fe, y = Corn97BU^2, * colour = ifelse(Ca > 5200, "high", "low" ) *)) + geom_point() + xlab("Iron (ppm)") + ylab("Corn Yield^2 (bushells)") + scale_colour_brewer("", palette = "Dark2") + theme( aspect.ratio = 1, legend.position = "bottom", legend.direction = "horizontal" ) ``` ] If calcium levels in the soil are high, yield is consistently high. If calcium levels are low, then there is a positive relationship between yield and iron, with higher iron leading to higher yields. ] ] --- # .orange[Case study] .bg-orange.circle[5] COVID-19 .panelset[ .panel[.panel-name[🖼️] <img src="images/lecture-06B/usacovid-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[info] <br><br><br> Bubble plots, size of point is mapped to another variable. This bubble plot here shows total count of COVID-19 incidence (as of Aug 30, 2020) for every county in the USA, inspired by the [New York Times coverage](https://www.nytimes.com/news-event/coronavirus). ] .panel[.panel-name[R] ```r load("../data/nyt_covid.rda") usa <- map_data("state") ggplot() + geom_polygon( data = usa, aes(x = long, y = lat, group = group), fill = "grey90", colour = "white" ) + geom_point( data = nyt_county_total, aes(x = lon, y = lat, size = cases), colour = "red", shape = 1 ) + geom_point( data = nyt_county_total, aes(x = lon, y = lat, size = cases), colour = "red", fill = "red", alpha = 0.1, shape = 16 ) + scale_size("", range = c(1, 30)) + theme_map() + theme(legend.position = "none") ``` ] ] --- # Scales matter .grid[ .item[ <br> <br> <img src="images/lecture-06B/unnamed-chunk-27-1.png" width="70%" style="display: block; margin: auto;" /> ] .item[ <br> <br> Where has COVID-19 hit the hardest? <br> Where are there more people? <br> <br> <br> This plot tells you NOTHING except where the population centres are in the USA. To understand relative incidence/risk, report COVID numbers relative the population. For example, .monash-orange2[number of cases per 100,000 people]. ] ] --- class: transition middle # Beyond quantitative variables --- # When variables are not quantitative > What do you do if the variables are not continuous/quantitative? The type of variable determines the choice of mapping. - Continuous and categorical `\(\longrightarrow\)` side-by-side boxplots, side-by-side density plots - Both categorical `\(\longrightarrow\)` faceted bar charts, stacked bar charts, mosaic plots, double decker plots <br> <br> > We'll see more examples soon. --- class: transition middle # Paradoxes --- # Simpsons paradox There is an additional variable, which if used for conditioning, changes the association between the variables, you have a .monash-orange2[paradox] 🙃. .grid[ .item[ <img src="images/lecture-06B/scat-1.png" width="70%" style="display: block; margin: auto;" /> ] .item[ <img src="images/lecture-06B/scatcol-1.png" width="70%" style="display: block; margin: auto;" /> ] ] --- # Simpsons paradox: famous example <br> <img src="images/lecture-06B/berkeley-1.png" width="90%" style="display: block; margin: auto;" /> Did Berkeley .monash-orange2[discriminate] against female applicants? .footnote[Example from Unwin (2015)] --- # Simpsons paradox: famous example <img src="images/lecture-06B/berkeleydd-1.png" width="100%" style="display: block; margin: auto;" /> Based on separately examining each department, there is .monash-orange2[no evidence of discrimination] against female applicants. .footnote[Example from Unwin (2015)] --- class: transition middle # Is what you see really association? --- # Checking association with visual inference .panelset[ .panel[.panel-name[Soils] <img src="images/lecture-06B/soils-lineup-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] ```r ggplot( lineup(null_permute("Corn97BU"), baker, n = 12), aes(x = B, y = Corn97BU) ) + geom_point() + facet_wrap(~.sample, ncol = 4) ``` 11 of the panels have had the association broken by permuting one variable. .monash-blue2[There is no association] in these data sets, and hence plots. Does the data plot stand out as being different from the null (no association) plots? ] .panel[.panel-name[Olympics] <img src="images/lecture-06B/oly-lineup-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] .f5[ ```r data(oly12, package = "VGAMdata") oly12_sub <- oly12 %>% filter(Sport %in% c( "Swimming", "Archery", "Hockey", "Tennis" )) %>% filter(Sex == "F") %>% mutate(Sport = fct_drop(Sport), Sex = fct_drop(Sex)) ggplot( lineup(null_permute("Sport"), oly12_sub, n = 12), aes(x = Height, y = Weight, colour = Sport) ) + geom_smooth(method = "lm", se = FALSE) + scale_colour_brewer("", palette = "Dark2") + facet_wrap(~.sample, ncol = 4) + theme(legend.position = "none") ``` ] 11 of the panels have had the association broken by permuting the Sport label. .monash-blue2[There is no difference in the association between weight and height across sports] in these data sets, and hence plots. Does the data plot stand out as being different from the null (no association difference between sports) plots? ] ] --- # Resources - Friendly and Denis "Milestones in History of Thematic Cartography, Statistical Graphics and Data Visualisation" available at http://www.datavis.ca/milestones/ - Unwin (2015) [Graphical Data Analysis with R](http://www.gradaanwr.net) - Graphics using [ggplot2](https://ggplot2.tidyverse.org) - Wilke (2019) Fundamentals of Data Visualization https://clauswilke.com/dataviz/ --- background-size: cover class: title-slide background-image: url("images/bg-12.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 6 - Session 2 <br> ]