class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-02A.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-12.png") # .monash-blue[ETC5521: Exploratory Data Analysis] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Learning from history</h2> .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 2 - Session 2 <br> ] --- class: transition middle animated slideInLeft ## <i class="fas fa-pencil-alt"></i> <i class="fas fa-sticky-note"></i> Easy summaries -- numerical and graphical --- # Hinges and 5-number summaries .pull-left[ .font_smaller[ ``` ## [1] -3.2 -1.7 -0.4 0.1 ## [5] 0.3 1.2 1.5 1.8 ## [9] 2.4 3.0 4.3 6.4 ## [13] 9.8 ``` ] You know the median is the middle number. What's a hinge? There are 13 data values here, provided already sorted. We are going to write them into a Tukey named down-up-down-up pattern, evenly. .font_smaller[.monash-blue2[Median will be 7th, hinge will be 4th from each end.]] ] .pull-right[
] --- # Hinges and 5-number summary .pull-left[ <img src="images/lecture-02B/canvas3-IMG.png" width="80%"> ] .pull-right[ <img src="images/lecture-02B/hinges.png" width="80%"> ] hinges are alternatively known as Q1 and Q3. --- # box-and-whisker display .pull-left[ <img src="images/lecture-02B/canvas3-IMG.png" width="80%"> ] .pull-right[ Starting with a 5-number summary <img src="images/lecture-02B/5-number.png" width="80%"> ] --- # box-and-whisker display .pull-left[ Starting with a 5-number summary <img src="images/lecture-02B/5-number.png" width="80%"> ] .pull-right[
] --- # Identified end values .pull-left[ <img src="images/lecture-02B/box_and_whisker.png" width="50%"> Why are some individual points singled out? ] .pull-right[ <img src="images/lecture-02B/schematic.png" width="50%"> Rules for this one may be clearer? ] --- class: motivator middle center ## 🙀 Isn't this imposing a belief? --- class: middle center .outline-text[## There is no excuse for failing to plot and look] <br> <br> .font_smaller[Another Tukey wisdom drop] --- background-image: \url(images/lecture-02B/schematic.png) background-size: 20% background-position: 99% 50% # Fences and outside values - H-spread: difference between the hinges (we would call this Inter-Quartile Range) - step: 1.5 times H-spread - inner fences: 1 step outside the hinges - outer fences: 2 steps outside the hinges - the value at each end closest to, but still inside the inner fence are "adjacent" - values between an inner fence and its neighbouring outer fence are "outside" - values beyond outer fences are "far out" - these rules produce a SCHEMATIC PLOT --- # New statistics: trimeans The number that comes closest to `$$\frac{\text{lower hinge} + 2\times \text{median} + \text{upper hinge}}{4}$$` is the **trimean**. <br> <br> Think about trimmed means, where we might drop the highest and lowest 5% of observations. --- # Letter value plots .pull-left[ Why break the data into quarters? Why not eighths, sixteenths? k-number summaries? What does a 7-number summary look like? <img src="images/lecture-02B/7-number.png" width="80%"> .monash-orange2[How would you make an 11-number summary?] ] .pull-right[ .font_smaller[ ```r library(lvplot) p <- ggplot(mpg, aes(class, hwy)) *p + geom_lv(aes(fill=..LV..)) + scale_fill_brewer() ``` <img src="images/lecture-02B/lvplot-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- class: informative middle ## Box plots are ubiquitous in use today. 🐶 🐱 Mostly used to compare distributions, multiple subsets of the data. Puts the emphasis on the
middle 50%
of observations, although variations can put emphasis on other aspects. --- class: transition middle animated slideInLeft ## Easy re-expression --- # Logs, square roots, reciprocals .pull-left[ What you need to know about logs? - how to find good enough logs fast and easily - that equal differences in logs correspond to equal ratios of raw values. .font_smaller[(This means that wherever you find people using products or ratios-- even in such things as price indexes--using logs--thus converting producers to sums and ratios to differences--is likely to help.)] ] -- .pull-right[ The most common transformations are logs, sqrt root, reciprocals, reciprocals of square roots <center> -1, -1/2, +1/2, +1 </center> What happened to ZERO? --
It turns out that the role of a zero power, is for the purposes of re-expression, neatly filles by the logarithm.
] --- ## Re-express to symmetrize the distribution <img src="images/lecture-02B/logs.png" width="50%"> --- class: center middle ## Power ladder <br> <br> <br>
<i class="fas fa-arrow-left faa-passing animated-hover " style=" color:orangered;"></i>
fix RIGHT-skewed values
<br> <br> -2, -1, -1/2, 0 (log), 1/3, 1/2, .font_large[.monash-orange2[1]], 2, 3, 4 <br>
<i class="fas fa-arrow-right faa-passing-reverse animated-hover " style=" color:orangered;"></i>
fix LEFT-skewed values
--- class: middle center .outline-text[## We now regard re-expression as a tool, something to let us do a better job of grasping. The grasping is done with the eye and the better job is through a more symmetric appearance.] <br> <br> <br> .font_smaller[Another Tukey wisdom drop] --- # Linearising bivariate relationships <img src="images/lecture-02B/linearise1.png" width="34%"> <img src="images/lecture-02B/linearise2.png" width="34%"> <img src="images/lecture-02B/linearise3.png" width="29%"> <br> .monash-orange2[Surprising observation: The small fluctuations in later years]. Apparently these were tracked down to be data collection errors or problems. .monash-blue2[I think there is another possible reason. Do you?] --- # Linearising bivariate relationships <img src="images/lecture-02B/linearise4.png" width="32%"> <img src="images/lecture-02B/linearise5.png" width="32%"> <img src="images/lecture-02B/linearise6.png" width="32%"> <br> See some fluctuations in the early years, too. .monash-blue2[Note that the log transformation couldn't linearise.] --- class: informative center middle .outline-text[ ## Whatever the data, we can try to gain by straightening or by flattening. <br> ## When we succeed in doing one or both, we almost always see more clearly what is going on. ] --- # Rules and advice .pull-left[ .font_medium2[ 1.Graphics are friendly.<br> 2.Arithmetic often exists to make graphs possible.<br> 3..monash-orange2[Graphs force us to note the unexpected]; nothing could be more important.<br> 4.Different graphs show us quite different aspects of the same data.<br> 5.There is .monash-orange2[no more reason to expect one graph to "tell all"] than to expect one number to do the same.<br> 6."Plotting `\(y\)` against `\(x\)`" involves significant choices--how we express one or both variables can be crucial.<br> ]] -- .pull-right[ .font_smaller[ 7.The first step in penetrating plotting is to straighten out the dependence or point scatter as much as reasonable.<br> 8.Plotting `\(y^2\)`, `\(\sqrt{y}\)`, `\(log(y)\)`, `\(-1/y\)` or the like instead of `\(y\)` is one plausible step to take in search of straightness.<br> 9.Plotting `\(x^2\)`, `\(\sqrt{x}\)`, `\(log(x)\)`, `\(-1/x\)` or the like instead of `\(x\)` is another.<br> 10.Once the plot is straightened, we can usually gain much by flattening it, usually by plotting residuals.<br> 11.When plotting scatters, we may need to be careful about how we express `\(x\)` and `\(y\)` in order to avoid concealment by crowding.<br> ]] --- class: middle background-image: \url(https://vignette.wikia.nocookie.net/starwars/images/d/d6/Yoda_SWSB.png/revision/latest?cb=20150206140125) background-size: cover .monash-white[The book is a digest of] ⭐ .monash-white[tricks and treats] ⭐ .monash-white[of massaging numbers and drafting displays.] .monash-white[Many of the tools have made it into today's analyses in various ways. Many have not.] .monash-white[Notice the word developments too:] .monash-pink2[froots, fences]. .monash-white[Tukey brought you the word] .monash-pink2["software"].monash-white[!] .monash-white[The temperament of the book is an inspiration for the mind-set for this unit. There is such delight in working with numbers!]
We love data!
--- # Resources - [wikipedia](https://en.wikipedia.org/wiki/Exploratory_data_analysis) - John W. Tukey (1977) Exploratory data analysis - Data coding using [`tidyverse` suite of R packages](https://www.tidyverse.org) - Sketching canvases made using [`fabricerin`](https://ihaddadenfodil.com/post/fabricerin-a-tutorial/) - Slides constructed with [`xaringan`](https://github.com/yihui/xaringan), [remark.js](https://remarkjs.com), [`knitr`](http://yihui.name/knitr), and [R Markdown](https://rmarkdown.rstudio.com). --- background-size: cover class: title-slide background-image: url("images/bg-12.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 2 - Session 2 <br> ]