class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-02A.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-12.png") # .monash-blue[ETC5521: Exploratory Data Analysis] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Learning from history</h2> .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 2 - Session 1 <br> ] --- background-image: \url(images/lecture-02A/tukey_cover.png) background-size: 50% background-position: 5% 15% # Birth of EDA .pull-right[ The field of exploratory data analysis came of age when this book appeared in 1977. <br> .monash-blue[*Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test.*] ] --- # John W. Tukey .pull-left[ <img src="https://upload.wikimedia.org/wikipedia/en/e/e9/John_Tukey.jpg" style="width: 400px; border-radius: 50%"> ] .pull-right[ - Born in 1915, in New Bedford, Massachusetts. - Mum was a private tutor who home-schooled John. Dad was a Latin teacher. - BA and MSc in Chemistry, and PhD in Mathematics - Awarded the National Medal of Science in 1973, by President Nixon - By some reports, his home-schooling was unorthodox and contributed to his thinking and working differently. ] --- class: informative # Taking a glimpse back in time is possible with the [American Statistical Association video lending library](https://www.youtube.com/watch?v=B7XoW2qiFUA). <br> We're going to watch John Tukey talking about exploring high-dimensional data with an amazing new computer in 1973, four years before the EDA book. <br>
<i class="fas fa-lightbulb faa-float animated " style=" color:yellow;"></i>
.monash-pink2[Look out for these things:]
Tukey's expertise is described as *for trial and error learning* and the computing equipment.
.footnote[First 4.25 minutes of https://www.youtube.com/embed/B7XoW2qiFUA)] --- <iframe width="840" height="630" src="https://www.youtube.com/embed/B7XoW2qiFUA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- .pull-left[ <img src="images/lecture-02A/tukey_cover.png" style="width: 400px; border-radius: 30%"> ] .pull-right[ <img src="images/lecture-02A/pencil_and_paper.png" width="80%"> ] --- # Setting the frame of mind .overflow-scroll.h-80[.border-box[ This book is based on an important principle. <br> **It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it.** <br> Learning first what you can do will help you to work more easily and effectively. <br> This book is about exploratory data analysis, about looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights. Its concern is with appearance, not with confirmation. <br> **Examples, NOT case histories** <br> The book does not exist to make the case that exploratory data analysis is useful. Rather it exists to expose its readers and users to a considerable variety of techniques for looking more effectively at one's data. The examples are not intended to be complete case histories. Rather they should isolated techniques in action on real data. The emphasis is on general techniques, rather than specific problems. <br> A basic problem about any body of data is to make it more easily and effectively handleable by minds -- our minds, her mind, his mind. To this general end: - anything that make a simpler description possible makes the description more easily handleable. - anything that looks below the previously described surface makes the description more effective. <br> So we shall always be glad (a) to simplify description and (b) to describe one layer deeper. In particular, - to be able to say that we looked one layer deeper, and found nothing, is a definite step forward -- though not as far as to be able to say that we looked deeper and found thus-and-such. - to be able to say that "if we change our point of view in the following way ... things are simpler" is always a gain--though not quite so much as to be able to say "if we don't bother to change out point of view (some other) things are equally simple." <br> ... <br> Consistent with this view, we believe, is a clear demand that pictures based on exploration of data should *force* their messages upon us. Pictures that emphasize what we already know--"security blankets" to reassure us--are frequently not worth the space they take. Pictures that have to be gone over with a reading glass to see the main point are wasteful of time and inadequate of effect. **The greatest value of a picture** is when it *forces* us to notice **what we never expected to see.** <br> <center> <b>Confirmation</b> </center> <br> The principles and procedures of what we call confirmatory data analysis are both widely used and one of the great intellectual products of our century. In their simplest form, these principles and procedures look at a sample--and at what that sample has told us about the population from which it came--and assess the precision with which our inference from sample to population is made. We can no longer get along without confirmatory data analysis. <b>But we need not start with it.</b> <br> The best way to <b>understand what CAN be done is not longer</b>--if it ever was--<b>to ask what things could</b>, in the current state of our skill techniques, <b>be confirmed</b> (positively or negatively). Even more understanding is <em>lost</em> if we consider each thing we can do to data <em>only</em> in terms of some set of very restrictive assumptions under which that thing is best possible--assumptions we <em>know we CANNOT check in practice</em>. <center> <b>Exploration AND confirmation</b> </center> Once upon a time, statisticians only explored. Then they learned to confirm exactly--to confirm a few things exactly, each under very specific circumstances. As they emphasized exact confirmation, their techniques inevitably became less flexible. The connection of the most used techniques with past insights was weakened. Anything to which confirmatory procedure was not explicitly attached was decried as "mere descriptive statistics", no matter how much we learned from it. <br> Today, the flexibility of (approximate) confirmation by the jacknife makes it relatively easy to ask, for almost any clearly specified exploration, "How far is it confirmed?" <br> **Today, exploratory and confirmatory can--and should--proceed side by side**. This book, of course, considers only exploratory techniques, leaving confirmatory techniques to other accounts. <br> <center> <b> About the problems </b> </center> <br> The teacher needs to be careful about assigning problems. Not too many, please. They are likely to take longer than you think. The number supplied is to accommodate diversity of interest, not to keep everybody busy. <br> Besides the length of our problems, both teacher and student need to realise that many problems do not have a single "right answer". There can be many ways to approach a body of data. Not all are equally good. For some bodies of data this may be clear, but for others we may not be able to tell from a single body of data which approach is preferred. Even several bodies of data about very similar situations may not be enough to show which approach should be preferred. Accordingly, it will often be quite reasonable for different analysts to reach somewhat different analyses. <br> Yet more--to unlock the analysis of a body of day, to find the good way to approach it, may require a key, whose finding is a creative act. Not everyone can be expected to create the key to any one situation. And to continue to paraphrase Barnum, no one can be expected to create a key to each situation he or she meets. <br> **To learn about data analysis, it is right that each of us try many things that do not work**--that we tackle more problems than we make expert analyses of. We often learn less from an expertly done analysis than from one where, by not trying something, we missed--at least until we were told about it--an opportunity to learn more. Each teacher needs to recognize this in grading and commenting on problems. <br> <center><b> Precision</b></center> The teacher who heeds these words and admits that there need be *no one correct approach* may, I regret to contemplate, still want whatever is done to be digit perfect. (Under such a requirement, the write should still be able to pass the course, but it is not clear whether she would get an "A".) One does, from time to time, have to produce digit-perfect, carefully checked results, but forgiving techniques that are not too distributed by unusual data are also, usually, *little disturbed by SMALL arithmetic errors*. The techniques we discuss here have been chosen to be forgiving. It is hoped, then, that small arithmetic errors will take little off the problem's grades, leaving severe penalties for larger errors, either of arithmetic or concept. ]] --- # Outline .pull-left[ .monash-orange2[1.Scratching down numbers]<br> .monash-orange2[2.Schematic summary]<br> .monash-orange2[3.Easy re-expression]<br> 4.Effective comparison<br> 5.Plots of relationship<br> 6.Straightening out plots (using three points) 7.Smoothing sequences<br> 8.Parallel and wandering schematic plots<br> 9.Delineations of batches of points<br> 10.Using two-way analyses<br> ] .pull-right[ 11.Making two-way analyses<br> 12.Advanced fits<br> 13.Three way fits<br> 14.Looking in two or more ways at batched of points<br> 15.Counted fractions<br> 16.Better smoothing<br> 17.Counts in bin after bin<br> 18.Product-ratio plots<br> 19.Shapes of distributions<br> 20.Mathematical distributions<br> ] --- class: transition middle animated slideInLeft ## Here we go <i class="fas fa-pencil-alt"></i> <i class="fas fa-sticky-note"></i> --- # Scratching down numbers Prices of Chevrolet in the local used car newspaper ads of 1968. Stem-and-leaf plot: still seen introductory statistics books
--- .pull-left[ First stem-and-leaf, first digit on stem, second digit on leaf <img src="images/lecture-02A/canvas1-IMG (5).png" width="90%"> ] -- .pull-right[ Order any leaves which need it, eg stem 6 <img src="images/lecture-02A/canvas1-IMG (6).png" width="90%"> ] -- <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> .monash-pink2[*A benefit is that the numbers can be read off the plot, but the focus is still on the pattern. Also quantiles like the median, can be computed easily.*] --- .pull-left[ Shrink the stem <img src="images/lecture-02A/canvas1-IMG (8).png" width="90%"> ] .pull-right[ Shrink the stem more <img src="images/lecture-02A/canvas1-IMG (9).png" width="90%"> ] --- # And, in R ... .font_small[ ```r stem(chevrolets$prices) ``` ``` ## ## The decimal point is 3 digit(s) to the right of the | ## ## 0 | 23 ## 0 | 7788999 ## 1 | 123 ## 1 | 57789 ``` ] --- # <i class="fas fa-bookmark"></i> Remember the tips data .font_smaller2[ ```r tips <- read_csv("http://ggobi.org/book/data/tips.csv") stem(tips$tip, scale=0.5, width=120) ``` ``` ## ## The decimal point is at the | ## ## 1 | 000001233334445555555555556666667777788889 ## 2 | 000000000000000000000000000000000000000001122222223333555555555555556666677788899 ## 3 | 00000000000000000000000011111112222222333344445555555555555666778889 ## 4 | 0000000000001112233335777 ## 5 | 00000000001122226799 ## 6 | 05577 ## 7 | 6 ## 8 | ## 9 | 0 ## 10 | 0 ``` ] --- # Refining the size .pull-left[ <img src="images/lecture-02A/stem_stretched.png" width="90%"> ] .pull-right[ <img src="images/lecture-02A/stem_5line.png" width="80%"> ] --- .scroll-output.h-60[.dc_font_smaller2[ ```r stem(tips$tip, scale=2) ``` ``` ## ## The decimal point is 1 digit(s) to the left of the | ## ## 10 | 0000107 ## 12 | 55526 ## 14 | 44578000000000678 ## 16 | 1346781356 ## 18 | 032678 ## 20 | 00000000000000000000000000000000011233598 ## 22 | 0033440114 ## 24 | 5700000000002456 ## 26 | 01412455 ## 28 | 382 ## 30 | 00000000000000000000000267891245688 ## 32 | 133557159 ## 34 | 0188800000000015 ## 36 | 0181566 ## 38 | 2 ## 40 | 0000000000006889 ## 42 | 09004 ## 44 | 0 ## 46 | 713 ## 48 | ## 50 | 000000000074567 ## 52 | 0 ## 54 | ## 56 | 05 ## 58 | 52 ## 60 | 0 ## 62 | ## 64 | 00 ## 66 | 03 ## 68 | ## 70 | ## 72 | ## 74 | 8 ## 76 | ## 78 | ## 80 | ## 82 | ## 84 | ## 86 | ## 88 | ## 90 | 0 ## 92 | ## 94 | ## 96 | ## 98 | ## 100 | 0 ``` ]] --- class: informative middle ## 💠Similar information to the histogram. Generally it is possible to also read off the numbers, and to then easily calculate median or Q1 or Q3. However, its really designed for small data sets and for pencil and paper. --- # a different style of number scratching .pull-left[ We know about <img src="images/lecture-02A/tally.png" width="90%"> but its too easy to <img src="images/lecture-02A/tally_error.png" width="90%"> make a mistake ] .pull-right[ Try this instead <img src="images/lecture-02A/squares.png" width="90%"> ] --- # Count this data using the squares approach. .pull-left[ ``` ## [1] "Sun" "Sun" "Sun" "Sun" ## [5] "Sun" "Sun" "Sun" "Sun" ## [9] "Sun" "Sun" "Sun" "Sun" ## [13] "Sun" "Sun" "Sun" "Sun" ## [17] "Sun" "Sun" "Sun" "Sat" ## [21] "Sat" "Sat" "Sat" "Sat" ## [25] "Sat" "Sat" "Sat" "Sat" ## [29] "Sat" "Sat" "Sat" "Sat" ## [33] "Sat" "Sat" "Sat" "Sat" ## [37] "Sat" "Sat" "Sat" "Sat" ## [41] "Sat" "Sun" "Sun" "Sun" ## [45] "Sun" "Sun" "Sun" "Sun" ## [49] "Sun" "Sun" "Sun" "Sun" ## [53] "Sun" "Sun" "Sun" "Sun" ## [57] "Sat" "Sat" "Sat" "Sat" ## [61] "Sat" "Sat" "Sat" "Sat" ## [65] "Sat" "Sat" "Sat" "Sat" ``` ] .pull-right[
] --- class: middle center .info-box[ ## What does it mean to "feel what the data are like?" ] --- .pull-left[ <img src="images/lecture-02A/stem_alaska.png" width="90%"> ] .pull-right[ This is a stem and leaf of the height of the highest peak in each of the 50 US states. <br> The states roughly fall into three groups. <br> .font_smaller[.monash-blue2[It's not really surprising, but we can imagine this grouping. Alaska is in a group of its own, with a much higher high peak. Then the Rocky Mountain states, California, Washington and Hawaii also have high peaks, and the rest of the states lump together.]] ] --- class: middle .info-box[ ## Exploratory data analysis is detective work -- in the purest sense -- finding and revealing the clues.] --- # Resources - [wikipedia](https://en.wikipedia.org/wiki/Exploratory_data_analysis) - John W. Tukey (1977) Exploratory data analysis - Data coding using [`tidyverse` suite of R packages](https://www.tidyverse.org) - Sketching canvases made using [`fabricerin`](https://ihaddadenfodil.com/post/fabricerin-a-tutorial/) - Slides constructed with [`xaringan`](https://github.com/yihui/xaringan), [remark.js](https://remarkjs.com), [`knitr`](http://yihui.name/knitr), and [R Markdown](https://rmarkdown.rstudio.com). --- background-size: cover class: title-slide background-image: url("images/bg-12.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 2 - Session 1 <br> ]