ETC5521: Exploratory Data Analysis

class: middle center hide-slide-number monash-bg-gray80

.info-box.w-50.bg-white[
These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-06B.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. 
]

<br>

.white[Press the **right arrow** to progress to the next slide!]

---

class: title-slide
count: false
background-image: url("images/bg-12.png")

# .monash-blue[ETC5521: Exploratory Data Analysis]

<br>

<h2 style="font-weight:900!important;">Exploring bivariate dependencies</h2>

.bottom_abs.width100[

Lecturer: *Di Cook*

<i class="fas fa-envelope"></i>  ETC5521.Clayton-x@monash.edu

<i class="fas fa-calendar-alt"></i> Week 6 - Session 2

<br>

]

---
class: transition

# Numerical measures of association

---
# Correlation

- Correlation between variables `$x_1$` and `$x_2$`, with `$n$` observations in each.

`$$r = \frac{\sum_{i=1}^n (x_{i1}-\bar{x}_1)(x_{i2}-\bar{x}_2)}{\sqrt{\sum_{i=1}^n(x_{i1}-\bar{x}_1)^2\sum_{i=1}^n(x_{i2}-\bar{x}_2)^2}} = \frac{\mbox{covariance}(x_1, x_2)}{(n-1)s_{x_1}s_{x_2}}$$`
- Test for statistical significance, whether population correlation could be 0 based on observed `$r$`, using a `$t_{n-2}$` distribution:

`$$t=\frac{r}{\sqrt{1-r^2}}\sqrt{n-2}$$`
---

.flex[
.item[
<img src="images/lecture-06B/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" />
]

.item[

```r
cor(d1$x, d1$y)
```

```
## [1] 0.5228401
```

```r
cor.test(d1$x, d1$y)
```

```
## 
## 	Pearson's product-moment correlation
## 
## data:  d1$x and d1$y
*## t = 8.6306, df = 198, p-value = 1.993e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4141406 0.6168362
## sample estimates:
##       cor 
## 0.5228401
```
]
]

---

.flex[
.item[
<img src="images/lecture-06B/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" />
]

.item[

```r
cor(d2$x, d2$y)
```

```
## [1] -0.04993755
```

```r
cor.test(d2$x, d2$y)
```

```
## 
## 	Pearson's product-moment correlation
## 
## data:  d2$x and d2$y
*## t = -0.70356, df = 198, p-value = 0.4825
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.18738032  0.08942303
## sample estimates:
##         cor 
## -0.04993755
```
]
]

---

.flex[
.w-30[
<img src="images/lecture-06B/unnamed-chunk-10-1.png" width="100%" style="display: block; margin: auto;" />
]

.w-30[
All observations

```
## $estimate
*##       cor 
*## 0.2994041 
## 
## $statistic
##        t 
## 4.426682 
## 
*## $p.value
*## [1] 1.576086e-05
```
]

.w-5.white[
gap
]

.w-30[
Without outlier

```
## $estimate
*##         cor 
*## -0.01173776 
## 
## $statistic
##          t 
## -0.1651764 
## 
*## $p.value
*## [1] 0.8689737
```
]
]

---
# Perceiving correlation

.panelset[
.panel[.panel-name[🖼️]

.monash-orange2[Let's play a game:] Guess the correlation!

<br>
<img src="images/lecture-06B/simcor-1.png" width="70%" style="display: block; margin: auto;" />
]
.panel[.panel-name[answers]

Generally, people don't do very well at this task. Typically people under-estimate `$r$` from scatterplots, particularly when it is around 0.4-0.7. The variation in a scatterplot perceptually doesn't vary is not linearly with `$r$`.

When someone says .monash-blue2[*correlation is 0.5* it sounds impressive]. BUT when someone shows you a .monash-blue2[scatterplot of data that has correlation 0.5], you will say that's a .monash-blue2[weak relationship.]
]
.panel[.panel-name[R]
.s400[

```r
set.seed(7777)
vc <- matrix(c(1, 0, 0, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p1 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, 0.4, 0.4, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p2 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, 0.6, 0.6, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p3 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, 0.8, 0.8, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p4 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, -0.2, -0.2, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p5 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, -0.5, -0.5, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p6 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, -0.7, -0.7, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p7 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
vc <- matrix(c(1, -0.9, -0.9, 1), ncol = 2, byrow = T)
d <- as_tibble(rmvnorm(500, sigma = vc))
p8 <- ggplot(d, aes(x = V1, y = V2)) +
  geom_point() +
  theme_void() +
  theme(
    aspect.ratio = 1,
    plot.background = element_rect(fill = "gray90")
  )
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 4)
```
]
.scroll-sign[
<br>
]
]
]

---
# Robust correlation measures 1/2

- Spearman (based on ranks)
    - Sort each variable, and return rank (of actual value)
    - Compute correlation between ranks of each variable

.pull-left[

```
## # A tibble: 6 × 4
##       x     y    xr    yr
##   <dbl> <dbl> <dbl> <dbl>
## 1   0.7  -1.7     5     1
## 2   0.5   1.1     4     5
## 3  -0.6   0.3     2     3
## 4  -0.2  -0.9     3     2
## 5  -1.7   0.4     1     4
## 6  10    10       6     6
```
].pull-right[

```r
cor(df$x, df$y)
```

```
## [1] 0.935397
```

```r
cor(df$xr, df$yr)
```

```
## [1] 0.2
```

```r
cor(df$x, df$y, method = "spearman")
```

```
## [1] 0.2
```

]

---
# Robust correlation measures 2/2

- Kendall `$\tau$` (based on comparing pairs of observations)
    - Sort each variable, and return rank (of actual value)
    - For all pairs of observations `$(x_i, y_i), (x_j, y_j)$`, determine  if **concordant**, `$x_i < x_j, y_i < y_j$` or `$x_i > x_j, y_i > y_j$`, or **discordant**, `$x_i < x_j, y_i > y_j$` or `$x_i > x_j, y_i < y_j$`.

`$$\tau = \frac{n_c-n_d}{\frac12 n(n-1)}$$`

.pull-left[
<img src="images/lecture-06B/unnamed-chunk-17-1.png" width="70%" style="display: block; margin: auto;" />
]
.pull-right[

```r
cor(df$x, df$y)
```

```
## [1] 0.935397
```

```r
cor(df$x, df$y, method = "kendall")
```

```
## [1] 0.06666667
```

]

---
# Comparison of correlation measures

<table class="table lightable-classic" style='width: auto !important; margin-left: auto; margin-right: auto; font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> sample </th>
   <th style="text-align:right;"> corr </th>
   <th style="text-align:right;"> spearman </th>
   <th style="text-align:right;"> kendall </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> <img src="images/lecture-05B/diffscatter-1.png" height="100px"> </td>
   <td style="text-align:right;"> 0.523 </td>
   <td style="text-align:right;"> 0.512 </td>
   <td style="text-align:right;"> 0.355 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <img src="images/lecture-05B/diffscatter-2.png" height="100px"> </td>
   <td style="text-align:right;"> -0.050 </td>
   <td style="text-align:right;"> -0.087 </td>
   <td style="text-align:right;"> -0.073 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <img src="images/lecture-05B/diffscatter-3.png" height="100px"> </td>
   <td style="text-align:right;"> 0.299 </td>
   <td style="text-align:right;"> -0.023 </td>
   <td style="text-align:right;"> -0.014 </td>
  </tr>
</tbody>
</table>

---
class: transition middle

# Scatterplot case studies

---
#  .orange[Case study] .bg-orange.circle[2] Movies

.panelset[
.panel[.panel-name[🖼️]
.flex[
.w-50[
<img src="images/lecture-06B/movies-1.png" width="80%" style="display: block; margin: auto;" />
]
.w-50[
- `votes`: Number of IMDB users who rated this movie
- `rating`: Average IMDB user rating

<br>
<br>
Describe the relationship between rating and votes. 
]
]
]
.panel[.panel-name[learn]
.grid[
.item[
- Odd pattern, almost looks like an "r"
- No films with lots of votes and low rating
- No film with lots of votes has rating close to maximum possible: **barrier?**
- Films with very high ratings only have a few votes
- Generally, rating appears to increase as votes increases (its hard to really read this with so few points though)
- A few films with really large number of votes: **outliers?** or just **skewness?**
- Films with few votes have ratings that span the range of the scale.
]
.item[
Would you say this is positive, linear, moderate?

Or positive, non-linear, and moderate? Or weak?

In some sense, these descriptions are meaningless, here.

<br>
What about causation? association? outliers?  clusters? gaps? barrier? conditional relationships?

<br>

.monash-blue2[These descriptive help to describe relationships generally, but it is important to convert them into the context of the (variables in the) data.]

.monash-orange2[BUT, BUT there is a skewness in votes that needs fixing before assessing the relationship.]
]
]
]
.panel[.panel-name[R]

```r
ggplot(movies, aes(x = votes, y = rating)) +
  geom_point() +
  scale_y_continuous("rating", breaks = seq(0, 10, 2))
```
]
]

---
#  .orange[Case study] .bg-orange.circle[2] Movies

.panelset[
.panel[.panel-name[🖼️]
.flex[
.w-50[
<img src="images/lecture-06B/logmovies-1.png" width="80%" style="display: block; margin: auto;" />
]
.w-50[
<br>
<br>
<br>
<br>
🤔 Something funny happens, right at 1000 votes
<br>
<br>

Some positive association between two variables only for large number of votes.
]
]
]
.panel[.panel-name[R]

```r
ggplot(movies, aes(x = votes, y = rating)) +
  geom_point(alpha = 0.1) +
  geom_smooth(se = F, colour = "orange", size = 2) +
  scale_x_log10() +
  scale_y_continuous("rating", breaks = seq(0, 10, 2))
```

*Note*: Used .monash-orange2[transparency] (because there is a lot of data) and a .monash-orange2[loess smooth] (because I am interested in assessing the trend between votes and rating).

<br>

Correlation between .monash-blue2[raw variables] is 0.1 <br> and between .monash-blue2[transformed] `log(votes)`and `rating` is 0.07. Which more accurately reflects the relationship?

]
]

---
#  .orange[Case study] .bg-orange.circle[3] Cars

.panelset[
.panel[.panel-name[🖼️]
.flex[
.w-50[
<img src="images/lecture-06B/cars-1.png" width="80%" style="display: block; margin: auto;" />
]
.w-50[
- `mpg`: Miles/(US) gallon
- `hp`: Gross horsepower

<br>
<br>
Describe the relationship between horsepower and mpg. 
]
]
]
.panel[.panel-name[learn]

- negative: as horsepower increases fuel efficiency is worse
- nonlinear: for lower horse power the decrease in efficieny is more
- strong: very little variation between cars, looks fundamentally like a physics problem
- outlier: one car with high horse power has unusually high efficiency

]
.panel[.panel-name[R]

```r
data(mtcars)
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_smooth(colour = "forestgreen", se = F)
```
]
]

---
#  .orange[Case study] .bg-orange.circle[3] Cars

.panelset[
.panel[.panel-name[🖼️]
.flex[
.w-50[
<img src="images/lecture-06B/logcars-1.png" width="80%" style="display: block; margin: auto;" />
]
.w-50[
- `mpg`: Miles/(US) gallon
- `hp`: Gross horsepower

<br>
<br>
Log transforming `mpg` linearised the relationship between horsepower and mpg.

.monash-green2[Need to also remove the outlier, because it is a little influential (swinging the line towards it).]
]
]
]
.panel[.panel-name[R]

```r
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
* scale_y_log10("log mpg") +
  geom_smooth(method = "lm", colour = "forestgreen", se = F) +
  geom_smooth(data = filter(mtcars, hp < 300), method = "lm", colour = "orangered", se = F, lty = 2)
```

Correlation between .monash-blue2[raw variables] is -0.78 <br> and between .monash-blue2[transformed] `log(mpg)` and `hp` is -0.85. Which more accurately reflects the relationship?

]
]
---
class: transition middle

# Transformations

for skewness, heteroskedasticity and linearising relationships, and to emphasize association

---
# Circle of transformations for linearising

.grid[
.item[ 
<img src="images/lecture-06B/circleoftrans-1.png" width="80%" style="display: block; margin: auto;" />
]
.item[
Remember the power ladder:

-1, 0, 1/3, 1/2, .monash-orange2[1], 2, 3, 4

<br>

1.Look at the shape of the relationship. 
2.Imagine this to be a number plane, and depending on which quadrant the shape falls in, you either transform `$x$` or `$y$`, up or down the ladder: `+,+` both up; `+,-` x up, y down; `-,-` both down;  `-,+` x down, y up
    
<br>

If there is heteroskedasticity, try transforming `$y$`, may or may not help

]
]

---
class: transition middle

# Scatterplot case studies

---
#  .orange[Case study] .bg-orange.circle[4] Soils

.flex[
.w-50[
<img src="images/lecture-06B/baker-1.png" width="80%" style="display: block; margin: auto;" />

]
.w-50[
Interplay between skewness and association

Data is from a soil chemical analysis of a farm field in Iowa. Is there a relationship between Yield and Boron?

<br>
You can get a marginal plot of each variable added to the scatterplot using `ggMarginal`. This is useful for assessing the skewness in each variable.

<br>
Boron is right-skewed Yield is left-skewed. With skewed distributions in marginal variables it is .monash-orange2[hard] to assess the relationship between the two. Make a transformation to fix, first.
]
]
---
#  .orange[Case study] .bg-orange.circle[4] Soils

.flex[
.w-50[
<img src="images/lecture-06B/transfbaker-1.png" width="80%" style="display: block; margin: auto;" />
]
.w-50[

```r
p <- ggplot(
  baker,
  aes(x = B, y = Corn97BU^2)
*) +
  geom_point() +
  xlab("log Boron (ppm)") +
  ylab("Corn Yield^2 (bushells)") +
* scale_x_log10()
*ggMarginal(p, type = "density")
```

]
]

---
#  .orange[Case study] .bg-orange.circle[4] Soils

.flex[
.w-50[
<img src="images/lecture-06B/bakeriron-1.png" width="80%" style="display: block; margin: auto;" />
]
.w-50[

<br>
Lurking variable?

```r
p <- ggplot(
  baker,
  aes(x = Fe, y = Corn97BU^2)
) +
  geom_density2d(colour = "orange") +
  geom_point() +
* xlab("Iron (ppm)") +
  ylab("Corn Yield^2 (bushells)")
ggMarginal(p, type = "density")
```

]
]

---
#  .orange[Case study] .bg-orange.circle[4] Soils

.flex[
.w-40[
<img src="images/lecture-06B/bakerironca-1.png" width="100%" style="display: block; margin: auto;" />
]
.w-60[

Colour high calcium (>5200ppm) calcium values

.f5[

```r
ggplot(baker, aes(
  x = Fe, y = Corn97BU^2,
* colour = ifelse(Ca > 5200,
    "high", "low"
  )
*)) +
  geom_point() +
  xlab("Iron (ppm)") +
  ylab("Corn Yield^2 (bushells)") +
  scale_colour_brewer("", palette = "Dark2") +
  theme(
    aspect.ratio = 1,
    legend.position = "bottom",
    legend.direction = "horizontal"
  )
```
]

If calcium levels in the soil are high, yield is consistently high. If calcium levels are low, then there is a positive relationship between yield and iron, with higher iron leading to higher yields.

]
]
---
#  .orange[Case study] .bg-orange.circle[5] COVID-19

.panelset[
.panel[.panel-name[🖼️]

]
.panel[.panel-name[info]

<br><br><br>
Bubble plots, size of point is mapped to another variable.

This bubble plot here shows total count of COVID-19 incidence (as of Aug 30, 2020) for every county in the USA, inspired by the [New York Times coverage](https://www.nytimes.com/news-event/coronavirus).

]
.panel[.panel-name[R]

```r
load("../data/nyt_covid.rda")
usa <- map_data("state")
ggplot() +
  geom_polygon(
    data = usa,
    aes(x = long, y = lat, group = group),
    fill = "grey90", colour = "white"
  ) +
  geom_point(
    data = nyt_county_total,
    aes(x = lon, y = lat, size = cases),
    colour = "red", shape = 1
  ) +
  geom_point(
    data = nyt_county_total,
    aes(x = lon, y = lat, size = cases),
    colour = "red", fill = "red", alpha = 0.1, shape = 16
  ) +
  scale_size("", range = c(1, 30)) +
  theme_map() +
  theme(legend.position = "none")
```
]
]

---
# Scales matter

.grid[
.item[
<br>
<br>

]
.item[
<br>
<br>
Where has COVID-19 hit the hardest?
<br>

Where are there more people?
<br>
<br>
<br>

This plot tells you NOTHING except where the population centres are in the USA. To understand relative incidence/risk, report COVID numbers relative the population. For example,  .monash-orange2[number of cases per 100,000 people].
]
]

---
class: transition middle

# Beyond quantitative variables

---
# When variables are not quantitative

> What do you do if the variables are not continuous/quantitative?

The type of variable determines the choice of mapping.

- Continuous and categorical `$\longrightarrow$` side-by-side boxplots, side-by-side density plots
- Both categorical `$\longrightarrow$` faceted bar charts, stacked bar charts, mosaic plots, double decker plots

> We'll see more examples soon.
---
class: transition middle

# Paradoxes

---
# Simpsons paradox

There is an additional variable, which if used for conditioning, changes the association between the variables, you have a .monash-orange2[paradox] 🙃.

.grid[
.item[
<img src="images/lecture-06B/scat-1.png" width="70%" style="display: block; margin: auto;" />

]
.item[
<img src="images/lecture-06B/scatcol-1.png" width="70%" style="display: block; margin: auto;" />

]

---
# Simpsons paradox: famous example

Did Berkeley .monash-orange2[discriminate] against female applicants?

.footnote[Example from Unwin (2015)]

---
# Simpsons paradox: famous example

Based on separately examining each department, there is .monash-orange2[no evidence of discrimination] against female applicants.

.footnote[Example from Unwin (2015)]

---
class: transition middle

# Is what you see really association?

---
# Checking association with visual inference

.panelset[
.panel[.panel-name[Soils]

]
.panel[.panel-name[R]

```r
ggplot(
  lineup(null_permute("Corn97BU"), baker, n = 12),
  aes(x = B, y = Corn97BU)
) +
  geom_point() +
  facet_wrap(~.sample, ncol = 4)
```

11 of the panels have had the association broken by permuting one variable. .monash-blue2[There is no association] in these data sets, and hence plots. Does the data plot stand out as being different from the null (no association) plots?

]
.panel[.panel-name[Olympics]

]
.panel[.panel-name[R]

.f5[

```r
data(oly12, package = "VGAMdata")
oly12_sub <- oly12 %>%
  filter(Sport %in% c(
    "Swimming", "Archery",
    "Hockey", "Tennis"
  )) %>%
  filter(Sex == "F") %>%
  mutate(Sport = fct_drop(Sport), Sex = fct_drop(Sex))

ggplot(
  lineup(null_permute("Sport"), oly12_sub, n = 12),
  aes(x = Height, y = Weight, colour = Sport)
) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_colour_brewer("", palette = "Dark2") +
  facet_wrap(~.sample, ncol = 4) +
  theme(legend.position = "none")
```
]

11 of the panels have had the association broken by permuting the Sport label. .monash-blue2[There is no difference in the association between weight and height across sports] in these data sets, and hence plots. Does the data plot stand out as being different from the null (no association difference between sports) plots?

]
]

---
# Resources

- Friendly and Denis "Milestones in History of Thematic Cartography, Statistical Graphics and Data Visualisation" available at http://www.datavis.ca/milestones/
- Unwin (2015) [Graphical Data Analysis with R](http://www.gradaanwr.net)
- Graphics using [ggplot2](https://ggplot2.tidyverse.org)
- Wilke (2019) Fundamentals of Data Visualization https://clauswilke.com/dataviz/

---

background-size: cover
class: title-slide
background-image: url("images/bg-12.png")

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

.bottom_abs.width100[

Lecturer: *Di Cook*

<i class="fas fa-envelope"></i>  ETC5521.Clayton-x@monash.edu

<i class="fas fa-calendar-alt"></i> Week 6 - Session 2

<br>

]