ETC5521: Exploratory Data Analysis

.info-box.w-50.bg-white[
These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-03B.pdf>here for the PDF </a>. 
]

---

# .monash-blue[ETC5521: Exploratory Data Analysis]

<h2 style="font-weight:900!important;">Initial data analysis and model diagnostics</h2>

.bottom_abs.width100[

Lecturer: *Di Cook*

ETC5521.Clayton-x@monash.edu

Week 3 - Session 2

]

---
# .circle.bg-black.white[2] Hypothesis Testing and Predictive Modeling.f4[Part 3/3]

.w-60[
- Hypothesis testing: usually make assumptions about the distribution of the data, and are formed relative to a parameter.
- Predictive modeling: form of the relationship, distribution of the errors. 
]

---

# Hypothesis testing in R .f4[REVIEW] .f5[Part 1/3]

.w-90[
- State the hypothesis (pair), e.g. `$H_o: \mu_1 = \mu_2$` vs `$H_a: \mu_1 < \mu_2$`. 
- Test statistic depends on _assumption_ about the distribution, e.g. 
 - `$t$`-test will assume that distributions are _normal_, or small departures from if we have a large sample. 
 - two-sample might assume both groups have the _same variance_ 
- Steps to complete: 
 - Compute the test statistic
 - Measure it against a standard distribution
 - If it is extreme, `$p$`-value is small, decision is to reject `$H_o$`
 - `$p$`-value is the probability of observing a value as large as this, or large, assuming `$H_o$` is true.
]

---
# .blue[Example] .circle.bg-blue.white[1] Checking variance and distribution assumption .f4[Part 1/2]

```r
data(sleep)
ggplot(sleep, aes(x=group, y=extra)) + 
  geom_boxplot() +
  geom_point(colour="orange")
```

<img src="images/lecture-03B/sleep-1.png" width="80%" style="display: block; margin: auto;" />
]
]
.pull-right[

```r
ggplot(sleep, aes(x=extra)) + 
  geom_density(fill="orange", colour="orange", alpha=0.6) + 
  geom_rug(outside = TRUE, colour="orange") +
  coord_cartesian(clip = "off") +
  facet_wrap(~group)
```

<img src="images/lecture-03B/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" />
]
.footnote[Cushny, A. R. and Peebles, A. R. (1905) The action of optical isomers: II hyoscines. The Journal of Physiology 32, 501–510.]

---
# .blue[Example] .circle.bg-blue.white[1] Hypothesis test .f4[Part 2/2]

```r
tt <- with(sleep,
 t.test(extra[group == 1],
 extra[group == 2], 
 paired = TRUE))
```

```r
tt$estimate
```

```
## mean difference 
##           -1.58
```

```r
tt$null.value
```

```
## mean difference 
##               0
```
]

```r
tt$statistic
```

t 
-4.062128

```r
tt$p.value
```

[1] 0.00283289

```r
tt$conf.int
```

[1] -2.4598858 -0.7001142
attr(,"conf.level")
[1] 0.95
]
.footnote[Cushny, A. R. and Peebles, A. R. (1905) The action of optical isomers: II hyoscines. The Journal of Physiology 32, 501–510.]

---
# .blue[Example] .circle.bg-blue.white[2] Checking distribution assumption

```r
InsectSprays %>% 
  ggplot(aes(x=fct_reorder(spray, count), 
             y=count)) + 
  geom_jitter(width=0.1, height=0, colour="orange", size=3, alpha=0.8) +
  xlab("") 
```

]

```r
fm1 <- aov(count ~ spray, data = InsectSprays)
summary(fm1)
```

```
## Df Sum Sq Mean Sq F value Pr(>F) 
## spray 5 2669 533.8 34.7 <2e-16 ***
## Residuals 66 1015 15.4 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

# Linear models in R .f4[REVIEW] .f5[Part 1/3]

```r
library(tidyverse)
library(broom)
glimpse(cars)
```

```
## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24…
## $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92…
```
]
--

```r
ggplot(cars, aes(speed, dist)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```

<img src="images/lecture-03B/plot-cars-1.png" width="432" style="display: block; margin: auto;" />
]

---

# Linear models in R .f4[REVIEW] .f5[Part 2/3]

* We can fit linear models in R with the `lm` function:
```r
lm(dist ~ speed, data = cars)
```
is the same as
```r
lm(dist ~ 1 + speed, data = cars)
```

* The above model is mathematically written as 
`$$y_i = \beta_0 + \beta_1 x_i + e_i$$`
where <ul>
<li>\$y_i\$ and \$x_i\$ are the stopping distance (in ft) and speed (in mph), respectively, of the \$i\$-th car;</li> <li>\$\beta_0\$ and \$\beta_1\$ are intercept and slope, respectively; and</li>
<li>\$e_i\$ is the random error; usually assuming \$e_i \sim NID(0, \sigma^2)\$. </li>
</ul>

---

# Linear models in R .f4[REVIEW] .f5[Part 3/3]

.flex[
.w-20[
<img src="images/lecture-03B/plot-cars-1.png" width="216" style="display: block; margin: auto auto auto 0;" />

]
.w-80[

```r
fit <- lm(dist ~ 1 + speed, data = cars)
tidy(fit)
```

```
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -17.6 6.76 -2.60 1.23e- 2
## 2 speed 3.93 0.416 9.46 1.49e-12
```

```r
glance(fit)
```

```
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 0.651 0.644 15.4 89.6 1.49e-12 1 -207. 419. 425. 11354. 48 50
```
]

]

.w-80[
* *Assuming* this model is appropriate, .monash-blue[the stopping distance increases by about 4 ft for increase in speed by 1 mph.]
]

---

# .circle.bg-black.white[2] Model form .f4[Part 1/2]

.w-60[
* Say, we are interested in characterising the price of the diamond in terms of its carat.
<img src="images/lecture-03B/plot-diamonds-1.png" width="432" style="display: block; margin: auto;" />
* Looking at this plot, would you fit a linear model with formula

]

---

# .circle.bg-black.white[2] Model form .f4[Part 1/2]

.w-60[
* Say, we are interested in characterising the price of the diamond in terms of its carat.
<img src="images/lecture-03B/plot-diamonds-lm-1.png" width="432" style="display: block; margin: auto;" />
* Looking at this plot, would you fit a linear model with formula

]

---

# .circle.bg-black.white[2] Model form .f4[Part 2/2]

.flex[
.w-50[
<img src="images/lecture-03B/plot-diamonds-lm2-1.png" width="432" style="display: block; margin: auto;" />
]
.w-50[
* What about
<center>
<code style="color: orange">price ~ poly(carat, 2)</code>?
</center>
which is the same as fitting:

`$$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + e_i.$$`

{{content}}
]
]

* Should the assumption for error distribution be modified if so?
{{content}}
--

* Should we make some transformation before modelling?
{{content}}
--

* Are there other candidate models?

---

# .circle.bg-black.white[2] Model form .f4[Part 2/2]

* Notice that there was _**no formal statistical inference**_ when trying to determine an appropriate model form.
--

* The goal of the main analysis is to characterise the price of a diamond by its carat. This may involve:
   * formal inference for model selection;
   * justification of the selected "final" model; and
   * fitting the final model.
--

* There may be in fact many, many models considered but discarded at the IDA stage. 
--

* These discarded models are hardly ever reported. Consequently, majority of reported statistics give a distorted view and it's important to remind yourself what might _**not**_ be reported.

---

# Model selection

.blockquote.w-70[
All models are _approximate_ and _tentative_; approximate in the sense that no model is exactly true and tentative in that they may be modified in the light of further data

]

]

---
class: transition

# Model diagnostics

---
# Residuals .f4[1/2]

.flex[
.w-70[
<img src="images/lecture-03B/plot-diamonds-resid-1.png" width="80%" style="display: block; margin: auto;" />
]
.w-30[

Residual = Observed - Fitted

Residual plot: Plot the residual against explanatory variable (or Fitted value)

---
# Alternative approach: linearise relationship

.flex[
.w-50[
<img src="images/lecture-03B/plot-diamonds-lm3-1.png" width="432" style="display: block; margin: auto;" />
]
.w-50[
<img src="images/lecture-03B/plot-diamonds-lm4-1.png" width="432" style="display: block; margin: auto;" />
]]

The .monash-orange2[log transformation of both variables] linearises the relationship, so that a simple linear model can be used, and also corrects the heteroskedasticity.

---
# Residuals .f4[2/2]

.flex[
.w-70[
<img src="images/lecture-03B/plot-diamonds-resid2-1.png" width="80%" style="display: block; margin: auto;" />
]
.w-30[

Which has the best residual plot?

]
]

---

Yet there is still very little emphasis of it in teaching and also at times in practice.

So don't forget to do IDA!

---

# Take away messages

<ul class="fa-ul">
{{content}}
</ul>

]
]

<li>Initial data analysis (IDA) is a model-focused exploration to support a confirmatory analysis with:
 <ul>
 <li> data description and collection </li>
 <li> data quality checking, and</li>
 <li> checking assumptions </li>
 <li> model fit without any formal statistical inference.</li>
 </ul>
</li>

<li>IDA may never see the limelight BUT it forms the foundation that the main analysis is built upon. Do it well!</li>

---

background-size: cover
class: title-slide
background-image: url("images/bg-12.png")

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

.bottom_abs.width100[

Lecturer: *Di Cook*

ETC5521.Clayton-x@monash.edu

Week 3 - Session 2

]

Lecture materials originally developed by Dr Emi Tanaka