class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-08B.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-12.png") # .monash-blue[ETC5521: Exploratory Data Analysis] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Going beyond two variables, exploring high dimensions</h2> .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 8 - Session 2 <br> ] --- class: transition # This lecture linked brushing between plots<br> parallel rather than orthogonal axes<br> tours (rotations) through high-dimensions --- # .orange[Case study] .bg-orange.circle[4] Penguins .panelset[ .panel[.panel-name[🖼️] .flex[ .w-50[ Size measurements for adult foraging penguins near Palmer Station, Antarctica - species: a factor denoting penguin species (Adélie, Chinstrap and Gentoo) - bill_length_mm: a number denoting bill length (millimeters) - bill_depth_mm: a number denoting bill depth (millimeters) - flipper_length_mm: an integer denoting flipper length (millimeters) - body_mass_g:an integer denoting body mass (grams) ] .w-50[ <img src="images/lecture-08B/penguinsscat-1.png" width="80%" style="display: block; margin: auto;" /> ]] ] .panel[.panel-name[️R] ```r library(dplyr) library(palmerpenguins) penguins <- penguins %>% filter(!is.na(bill_length_mm)) %>% rename(bl = bill_length_mm, bd = bill_depth_mm, fl = flipper_length_mm, bm = body_mass_g) ggscatmat(penguins, columns=3:6, col="species", alpha=0.5) + scale_colour_ochre("", palette="nolan_ned") + theme(aspect.ratio=1, legend.position="bottom") ``` ] ] --- # Linking between plots .flex[ .w-60[
] .w-40[ If you have interactive plots you can investigate whether - an outlier in one or two variables is one of the extremes in other variables `\(\longrightarrow\)` then it is a multivariate outlier - clusters of observations in one plot are concentrated in a cluster in other variables `\(\longrightarrow\)` the data is multimodal in multivariate space, and there are likely sub-populations. ] ] </div> --- <center>
<i class="fas fa-wrench fa-3x faa-wrench animated-hover faa-slow " style=" color:#D93F00;"></i>
Your turn, .monash-blue[cut and paste the code] into your R console, and
select
regions of the resulting plot to examine where these points lie in other plots. </center> <br> <br> .font_medium[ ```r # Load the penguins data with code from previous slide library(tidyverse) library(tourr) library(plotly) highlight_key(penguins) %>% GGally::ggpairs(aes(color = species), columns = 3:6) %>% ggplotly() %>% highlight("plotly_selected") ``` ]
−
+
05
:
00
--- class: transition middle # History --- background-image: \url(https://senate.universityofcalifornia.edu/_files/images/in-memoriam/Newton-Carol.jpg) background-size: 20% background-position: 95% 5% <img src="images/week7/newton_brushing.png" width="750px"> .monash-purple2[First linked brushing across a set of scatterplots done by Carol Newton (1978) "Graphics from Alpha to Omega in Data Analysis" in Graphical Representation of Multivariate Data.] --- <center> <iframe width="896" height="504" src="https://www.youtube.com/embed/At-tPS-mFsE" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </center> Brushing on a scatterplot matrix, Rick Becker (Becker and Cleveland, 1987 Brushing Scatterplots, Technometrics) --- class: transition # Parallel coordinate plots Another way to display multivariate display --- You can think about parallel coordinate plots like time series plots. Each variable is treated as a parallel axis. Observations are drawn as a line across the axes.
Median house price by date for a number of cities in Texas. --- .panelset[ .panel[.panel-name[learn] What can you learn? - lines that are parallel indicate positive linear association, possibly across many variables - lines that cross indicate negative linear association - lines that go up and down differently from any other lines indicate multivariate outliers - lines that go up and down together, but differently from other lines indicate multivariate clustering > You need to have an .monash-blue2[interactive] parallel coordinate plot for them to be effective for exploring data ] .panel[.panel-name[️R] ```r # load the `txhousing` dataset data(txhousing, package = "ggplot2") # declare `city` as the SQL 'query by' column tx <- highlight_key(txhousing, ~city) # initiate a plotly object base <- plot_ly(tx, color = I("black")) %>% group_by(city) # create a time series of median house price base %>% group_by(city) %>% add_lines(x = ~date, y = ~median) ``` ] ] --- <center>
<i class="fas fa-wrench fa-3x faa-wrench animated-hover faa-slow " style=" color:#D93F00;"></i>
Your turn, .monash-blue[cut and paste the code] into your R console, and
click
in the resulting plot to examine the line for an observation. </center> .s500[ ```r # Using the same penguins data subset as earlier in the slides library(shiny) library(plotly) library(tidyverse) ui <- fluidPage( plotlyOutput("parcoords"), verbatimTextOutput("data")) server <- function(input, output, session) { penguins_numeric <- penguins[,3:6] %>% na.omit() output$parcoords <- renderPlotly({ dims <- Map(function(x, y) { list(values = x, range = range(x), label = y) }, penguins_numeric, names(penguins_numeric), USE.NAMES = FALSE) plot_ly(type = 'parcoords', dimensions = dims, source = "pcoords") %>% layout(margin = list(r = 30)) %>% event_register("plotly_restyle") }) # maintain a collection of selection ranges # since each parcoord dimension is allowed to have multiple # selected ranges, this reactive values data structure is # allowed # list( # var1 = list(c(min1, max1), c(min2, max2), ...), # var2 = list(c(min1, max1)), # ... # ) ranges <- reactiveValues() observeEvent(event_data("plotly_restyle", source = "pcoords"), { d <- event_data("plotly_restyle", source = "pcoords") # what is the relevant dimension (i.e. variable)? dimension <- as.numeric(stringr::str_extract(names(d[[1]]), "[0-9]+")) # If the restyle isn't related to a dimension, exit early. if (!length(dimension)) return() # careful of the indexing in JS (0) versus R (1)! dimension_name <- names(penguins_numeric)[[dimension + 1]] # a given dimension can have multiple selected ranges # these will come in as 3D arrays, but a list of vectors # is nicer to work with info <- d[[1]][[1]] ranges[[dimension_name]] <- if (length(dim(info)) == 3) { lapply(seq_len(dim(info)[2]), function(i) info[,i,]) } else { list(as.numeric(info)) } }) ## filter to the rows that match the selection ranges penguins_selected <- reactive({ keep <- TRUE for (i in names(ranges)) { range_ <- ranges[[i]] keep_var <- FALSE for (j in seq_along(range_)) { rng <- range_[[j]] keep_var <- keep_var | dplyr::between(penguins[[i]], min(rng), max(rng)) } keep <- keep & keep_var } penguins[keep, ] }) output$data <- renderPrint({ tibble::as_tibble(penguins_selected()) }) } shinyApp(ui, server) ``` ]
−
+
05
:
00
--- class: transition # Dynamic graphics: tours Remember Tukey's PRIM-9?? This is the evolution of that, and allows us to look at all possible linear combination of variables. --- # Touring between pairs of variables .flex[ .w-50[ Dinosaurus dozen in wide form as multivariate data. Rotate one pair of variables into the other - special type of interpolation/animation called a *little tour* <center> <img src="images/week7/datasaurus_diagram.png" width="400px"> </center> ] .w-50[ <img src="images/week7/datasaurustour.gif" width="80%"> ] ] --- # Tours of multivariate data We look at .monash-orange2[combinations of variables], as well as the individual variables. The grand tour does this in an elegant way so that there's a chance of seeing all possible combinations of the variables, and it glues these together so that there is a smooth change from one to another. With an important note: that always when scatterplots (or higher dimensional combinations are shown) that the axes are always orthogonal. Using combinations of variables, it can be possible to see .monash-blue2[separations between groups, outliers, linear and non-linear dependencies] that were not visible in the single variable plots. <br><br><center> .monash-orange2[3D plots aren't enough. You need tours to find unusual multiple variable relationsips.] </center> --- # Our first grand tour .flex[ .w-50[ Here's the code ```r clrs <- ochre_pal( palette="nolan_ned")(3) col <- clrs[ as.numeric( penguins$species)] *animate_xy(penguins[,3:6], * col=col, * axes="off") ``` ] .w-50[ <center> <img src="images/week7/penguins2d.gif" width="80%"> </center> ] ] --- # .orange[Case study] .bg-orange.circle[4] Penguins What do you learn about this data? - clusters ✅ -- - outliers ✅ -- - linear dependence ✅ -- - elliptical clusters with slightly different shapes ✅ -- - separated elliptical clusters with slightly different shapes ✅ -- --- # What is a tour? .flex[ .w-50[ A grand tour is by definition a movie of low-dimensional projections constructed in such a way that it comes arbitrarily close to showing all possible low-dimensional projections; in other words, a grand tour is a space-filling curve in the manifold of low-dimensional projections of high-dimensional data spaces. ] .w-50[ `\({\mathbf x}_i \in R^p\)`, `\(i^{th}\)` data vector `\(F\)` is a `\(p\times d\)` orthonormal matrix, `\(F'F=I_d\)`, where `\(d\)` is the projection dimension. The projection of `\({\mathbf x_i}\)` onto `\(F\)` is `\({\mathbf y}_i=F'{\mathbf x}_i\)`. Tour is indexed by time, `\(F(t)\)`, where `\(t\in [a, z]\)`. Starting and target frame denoted as `\(F_a = F(a), F_z=F(t)\)`. The animation of the projected data is given by a path `\({\mathbf y}_i(t)=F'(t){\mathbf x}_i\)`. ] ] --- # Geodesic interpolation between planes .flex[ .w-50[ Tour is indexed by time, `\(F(t)\)`, where `\(t\in [a, z]\)`. Starting and target frame denoted as `\(F_a = F(a), F_z=F(t)\)`. The animation of the projected data is given by a path `\({\mathbf y}_i(t)=F'(t){\mathbf x}_i\)`. ] .w-50[ <img src="images/week7/geodesic.png" width="120%"> ] ] --- # Reading axes - interpretation .flex[ .w-50[ <br> <br> Length and direction of axes relative to the pattern of interest ] .w-50[ <img src="images/week7/reading_axes.png" width="100%"> ] ] --- # .orange[Case study] .bg-orange.circle[4] Penguins .flex[ .w-50[ <img src="images/lecture-08B/runthis13-1.png" width="90%" style="display: block; margin: auto;" /> Gentoo from others in contrast of fl, bd ] .w-50[ <img src="images/lecture-08B/runthis14-1.png" width="90%" style="display: block; margin: auto;" /> Chinstrap from others in contrast of bl, bm ] ] --- class: inverse middle left There may be multiple and different combinations of variables that reveal similar structure. ☹️ The tour can help to discover these, too. 😂 --- # Other tour types - .orange[guided]: follows the optimisation path for a projection pursuit index. - .orange[little]: interpolates between all variables. - .orange[local]: rocks back and forth from a given projection, so shows all possible projections within a radius. - .orange[dependence]: two independent 1D tours - .orange[frozen]: fixes some variable coefficients, others vary freely. - .orange[manual]: control coefficient of one variable, to examine the sensitivity of structure this variable. (In the .orange[spinifex] package) - .orange[slice]: use a section instead of a projection. --- # Guided tour .flex[ .w-50[ New target bases are chosen using a projection pursuit index function `$$\mathop{\text{maximize}}_{F} g(F'x) ~~~\text{ subject to } F \text{ being orthonormal}$$` ] .w-50[ - `holes`: This is an inverse Gaussian filter, which is optimised when there is not much data in the center of the projection, i.e. a "hole" or donut shape in 2D. - `central mass`: The opposite of holes, high density in the centre of the projection, and often "outliers" on the edges. - `LDA`/`PDA`: An index based on the linear discriminant dimension reduction (and penalised), optimised by projections where the named classes are most separated. ] ] --- # .orange[Case study] .bg-orange.circle[4] Penguins .panelset[ .panel[.panel-name[🖼️] .flex[ .w-50[ Grand <center> <img src="images/week7/penguins2d.gif" width="60%"> </center> Might accidentally see best separation ] .w-50[ Guided, using LDA index <center> <img src="images/week7/penguins2d_guided.gif" width="60%"> </center> Moves to the best separation ] ] ] .panel[.panel-name[R] ```r clrs <- ochre_pal( palette="nolan_ned")(3) col <- clrs[ as.numeric( penguins$species)] set.seed(20200622) render_gif(penguins[,3:6], guided_tour(lda_pp(penguins$species)), display_xy(col=col, axes="bottomleft"), "penguins2d_guided.gif", frames=17, width=300, height=300, loop=FALSE) ``` ```r animate_xy(penguins[,3:6], grand_tour(), axes = "bottomleft", col=col) animate_xy(penguins[,3:6], guided_tour(lda_pp(penguins$species)), axes = "bottomleft", col=col) best_proj <- matrix(c(0.940, 0.058, -0.253, 0.767, -0.083, -0.393, -0.211, -0.504), ncol=2, byrow=TRUE) ``` ] ] --- # Manual tour <br> <br> Control the coefficient of one variable, reduce it to zero, then increase it to 1, maintaining orthonormality --- # .orange[Case study] .bg-orange.circle[4] Penguins .panelset[ .panel[.panel-name[🖼️] .flex[ .w-50[ <br> <br> - start from best projection, given by projection pursuit - `bl` contribution controlled - if `bl` is removed from projection, Adelie and chinstrap are mixed - `bl` is important for Adelie ] .w-50[ <img src="images/week7/penguins_manual_bl.gif" width="70%"> ] ] ] .panel[.panel-name[R] <br> ```r mtour1 <- manual_tour(basis = best_proj, manip_var = 3) render_manual(penguins_s[,3:6], mtour1, "penguins_manual_fl.gif", col=col, dir = "images/manual1/") mtour2 <- manual_tour(basis = best_proj, manip_var = 1) render_manual(penguins_s[,3:6], mtour2, "penguins_manual_bl.gif", col=col, dir = "images/manual2") ``` ] ] --- # .orange[Case study] .bg-orange.circle[4] Penguins .flex[ .w-50[ <br> <br> - start from best projection, given by projection pursuit - `fl` contribution controlled - cluster less separated when fl is fully contributing - `fl` is important, in small amounts, for Gentoo ] .w-50[ <img src="images/week7/penguins_manual_fl.gif" width="70%"> ] ] --- # Resources - Wickham et al (2011). tourr: An R Package for Exploring Multivariate Data with Projections. http://www.jstatsoft.org/v40/i02/. - Cook and Laa (2023) Interactively exploring high-dimensional data and models in R, https://dicook.github.io/mulgar_book/ - Sievert (2019) Interactive web-based data visualization with R, plotly, and shiny, https://plotly-r.com - Horst et al (2020 https://allisonhorst.github.io/palmerpenguins/ - Gorman KB, Williams TD, Fraser WR (2014) [Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis).](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081) PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081 --- background-size: cover class: title-slide background-image: url("images/bg-12.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 8 - Session 2 <br> ]