class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-03B.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-12.png") # .monash-blue[ETC5521: Exploratory Data Analysis] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Initial data analysis and model diagnostics</h2> .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 3 - Session 2 <br> ] --- # .circle.bg-black.white[2] Hypothesis Testing and Predictive Modeling.f4[Part 3/3] .w-60[ - Hypothesis testing: usually make assumptions about the distribution of the data, and are formed relative to a parameter. - Predictive modeling: form of the relationship, distribution of the errors. ] --- # Hypothesis testing in R .f4[REVIEW] .f5[Part 1/3] .w-90[ - State the hypothesis (pair), e.g. `\(H_o: \mu_1 = \mu_2\)` vs `\(H_a: \mu_1 < \mu_2\)`. - Test statistic depends on _assumption_ about the distribution, e.g. - `\(t\)`-test will assume that distributions are _normal_, or small departures from if we have a large sample. - two-sample might assume both groups have the _same variance_ - Steps to complete: - Compute the test statistic - Measure it against a standard distribution - If it is extreme, `\(p\)`-value is small, decision is to reject `\(H_o\)` - `\(p\)`-value is the probability of observing a value as large as this, or large, assuming `\(H_o\)` is true. ] --- # .blue[Example] .circle.bg-blue.white[1] Checking variance and distribution assumption .f4[Part 1/2] .pull-left[ .f4[ ```r data(sleep) ggplot(sleep, aes(x=group, y=extra)) + geom_boxplot() + geom_point(colour="orange") ``` <img src="images/lecture-03B/sleep-1.png" width="80%" style="display: block; margin: auto;" /> ] ] .pull-right[ ```r ggplot(sleep, aes(x=extra)) + geom_density(fill="orange", colour="orange", alpha=0.6) + geom_rug(outside = TRUE, colour="orange") + coord_cartesian(clip = "off") + facet_wrap(~group) ``` <img src="images/lecture-03B/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" /> ] .footnote[Cushny, A. R. and Peebles, A. R. (1905) The action of optical isomers: II hyoscines. The Journal of Physiology 32, 501–510.] --- # .blue[Example] .circle.bg-blue.white[1] Hypothesis test .f4[Part 2/2] .pull-left[ ```r tt <- with(sleep, t.test(extra[group == 1], extra[group == 2], paired = TRUE)) ``` ```r tt$estimate ``` ``` ## mean difference ## -1.58 ``` ```r tt$null.value ``` ``` ## mean difference ## 0 ``` ] .pull-right[ ```r tt$statistic ``` t -4.062128 ```r tt$p.value ``` [1] 0.00283289 ```r tt$conf.int ``` [1] -2.4598858 -0.7001142 attr(,"conf.level") [1] 0.95 ] .footnote[Cushny, A. R. and Peebles, A. R. (1905) The action of optical isomers: II hyoscines. The Journal of Physiology 32, 501–510.] --- # .blue[Example] .circle.bg-blue.white[2] Checking distribution assumption .pull-left[ ```r InsectSprays %>% ggplot(aes(x=fct_reorder(spray, count), y=count)) + geom_jitter(width=0.1, height=0, colour="orange", size=3, alpha=0.8) + xlab("") ``` <img src="images/lecture-03B/unnamed-chunk-10-1.png" width="432" style="display: block; margin: auto;" /> .f4[Can you see any violations of normality? Or equal variance?] ] .pull-right[ ```r fm1 <- aov(count ~ spray, data = InsectSprays) summary(fm1) ``` ``` ## Df Sum Sq Mean Sq F value Pr(>F) ## spray 5 2669 533.8 34.7 <2e-16 *** ## Residuals 66 1015 15.4 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` .f4[Write down the hypothesis being tested. What would the decision be?] ] --- # Linear models in R .f4[REVIEW] .f5[Part 1/3] .f4[ ```r library(tidyverse) library(broom) glimpse(cars) ``` ``` ## Rows: 50 ## Columns: 2 ## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24… ## $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92… ``` ] -- .f4[ ```r ggplot(cars, aes(speed, dist)) + geom_point() + geom_smooth(method = "lm", se = FALSE) ``` <img src="images/lecture-03B/plot-cars-1.png" width="432" style="display: block; margin: auto;" /> ] --- # Linear models in R .f4[REVIEW] .f5[Part 2/3] * We can fit linear models in R with the `lm` function: ```r lm(dist ~ speed, data = cars) ``` is the same as ```r lm(dist ~ 1 + speed, data = cars) ``` -- * The above model is mathematically written as `$$y_i = \beta_0 + \beta_1 x_i + e_i$$` where <ul> <li>\\(y_i\\) and \\(x_i\\) are the stopping distance (in ft) and speed (in mph), respectively, of the \\(i\\)-th car;</li> <li>\\(\beta_0\\) and \\(\beta_1\\) are intercept and slope, respectively; and</li> <li>\\(e_i\\) is the random error; usually assuming \\(e_i \sim NID(0, \sigma^2)\\). </li> </ul> --- # Linear models in R .f4[REVIEW] .f5[Part 3/3] .flex[ .w-20[ <img src="images/lecture-03B/plot-cars-1.png" width="216" style="display: block; margin: auto auto auto 0;" /> ] .w-80[ ```r fit <- lm(dist ~ 1 + speed, data = cars) tidy(fit) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -17.6 6.76 -2.60 1.23e- 2 ## 2 speed 3.93 0.416 9.46 1.49e-12 ``` ```r glance(fit) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> ## 1 0.651 0.644 15.4 89.6 1.49e-12 1 -207. 419. 425. 11354. 48 50 ``` ] ] -- .w-80[ * *Assuming* this model is appropriate, .monash-blue[the stopping distance increases by about 4 ft for increase in speed by 1 mph.] ] --- count: false # .circle.bg-black.white[2] Model form .f4[Part 1/2] .w-60[ * Say, we are interested in characterising the price of the diamond in terms of its carat. <img src="images/lecture-03B/plot-diamonds-1.png" width="432" style="display: block; margin: auto;" /> * Looking at this plot, would you fit a linear model with formula .center[ `price ~ 1 + carat`? ] ] --- # .circle.bg-black.white[2] Model form .f4[Part 1/2] .w-60[ * Say, we are interested in characterising the price of the diamond in terms of its carat. <img src="images/lecture-03B/plot-diamonds-lm-1.png" width="432" style="display: block; margin: auto;" /> * Looking at this plot, would you fit a linear model with formula .center[ `price ~ 1 + carat`? ] ] --- # .circle.bg-black.white[2] Model form .f4[Part 2/2] .flex[ .w-50[ <img src="images/lecture-03B/plot-diamonds-lm2-1.png" width="432" style="display: block; margin: auto;" /> ] .w-50[ * What about <center> <code style="color: orange">price ~ poly(carat, 2)</code>? </center> which is the same as fitting: `$$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + e_i.$$` {{content}} ] ] -- * Should the assumption for error distribution be modified if so? {{content}} -- * Should we make some transformation before modelling? {{content}} -- * Are there other candidate models? --- # .circle.bg-black.white[2] Model form .f4[Part 2/2] * Notice that there was _**no formal statistical inference**_ when trying to determine an appropriate model form. -- * The goal of the main analysis is to characterise the price of a diamond by its carat. This may involve: * formal inference for model selection; * justification of the selected "final" model; and * fitting the final model. -- * There may be in fact many, many models considered but discarded at the IDA stage. -- * These discarded models are hardly ever reported. Consequently, majority of reported statistics give a distorted view and it's important to remind yourself what might _**not**_ be reported. --- # Model selection .blockquote.w-70[ All models are _approximate_ and _tentative_; approximate in the sense that no model is exactly true and tentative in that they may be modified in the light of further data .pull-right[—Chatfield (1985)] ] <br><br> -- .blockquote[ All models are wrong but some are useful .pull-right[—George Box] ] --- class: transition # Model diagnostics --- # Residuals .f4[1/2] .flex[ .w-70[ <img src="images/lecture-03B/plot-diamonds-resid-1.png" width="80%" style="display: block; margin: auto;" /> ] .w-30[ <br> Residual = Observed - Fitted <br><br> Residual plot: Plot the residual against explanatory variable (or Fitted value) .monash-orange2[Best] residual plot has not .monash-orange2[obvious pattern]. ] ] --- # Alternative approach: linearise relationship .flex[ .w-50[ <img src="images/lecture-03B/plot-diamonds-lm3-1.png" width="432" style="display: block; margin: auto;" /> ] .w-50[ <img src="images/lecture-03B/plot-diamonds-lm4-1.png" width="432" style="display: block; margin: auto;" /> ]] The .monash-orange2[log transformation of both variables] linearises the relationship, so that a simple linear model can be used, and also corrects the heteroskedasticity. --- # Residuals .f4[2/2] .flex[ .w-70[ <img src="images/lecture-03B/plot-diamonds-resid2-1.png" width="80%" style="display: block; margin: auto;" /> ] .w-30[ <br> Which has the best residual plot? ] ] --- class: middle center .blockquote[ "Teaching of Statistics should provide a more balanced blend of IDA and inference" .pull-right[Chatfield (1985)] ] -- <br><br> Yet there is still very little emphasis of it in teaching and also at times in practice. -- <br> So don't forget to do IDA! --- # Take away messages .flex[ .w-90.f2[ <ul class="fa-ul"> {{content}} </ul> ] ] -- <li><span class="fa-li"><i class="fas fa-paper-plane"></i></span><b><i>Initial data analysis</i></b> (IDA) is a model-focused exploration to support a confirmatory analysis with: <ul> <li><b><i> data description and collection</i></b> </li> <li><b><i> data quality checking, and</i></b></li> <li><b><i> checking assumptions</i></b> </li> <li><b><i> model fit</i></b> without any formal statistical inference.</li> </ul> </li> {{content}} -- <li><span class="fa-li"><i class="fas fa-paper-plane"></i></span>IDA may never see the limelight BUT it forms the foundation that the main analysis is built upon. Do it well!</li> --- background-size: cover class: title-slide background-image: url("images/bg-12.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 3 - Session 2 <br> ] <br><br>Lecture materials originally developed by Dr Emi Tanaka