class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-10B.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-12.png") # .monash-blue[ETC5521: Exploratory Data Analysis] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Exploring data having a space and time context</h2> .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 10 - Session 2 <br> ] <style type="text/css"> .gray80 { color: #505050!important; font-weight: 300; } .bg-gray80 { background-color: #DCDCDC!important; } </style> --- class: transition # Spatial data --- # Spatial components .flex[ .w-50[ .info-box[Spatial data can be considered to have both .monash-orange2[trend] and .monash-orange2[error].] .monash-blue2[Trend] purely on spatial coordinates: expect .monash-blue2[north-south] trend in latitude (position of sun during the year), and possibly .monash-blue2[east-west] in longitude (earth rotation). .monash-orange2[Trend might be more complicated], localised ecosystems, or related to other factors like elevation. After trend is removed, the .monash-blue2[residuals (error)] are likely to have .monash-blue2[spatial dependence: closer sites are likely to have similar values]. ] .w-50[ <img src="images/lecture-10B/toy-spatial-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- # Check trend in longitude and latitude <img src="images/lecture-10B/unnamed-chunk-3-1.png" width="80%" style="display: block; margin: auto;" /> There is a trend in both directions, but it is stronger in the y (north-south) direction. --- # Trend + error <img src="images/lecture-10B/unnamed-chunk-4-1.png" width="85%" style="display: block; margin: auto;" /> Observed have trend + error. Note the apparent clustering in residuals is strong .monash-blue2[spatial dependence]. --- class: transition middle ## A flash back to the 1970s: Tukey's median polish This is a useful data scratching technique, particularly for spatial data, to remove complicated trends. --- # Median polish technique .pull-left[
] .pull-right[ .font_medium[ 1. Compute row medians, and the median of the row medians, called **row overall effect**. 2. Subtract each element in a row by its row median. 3. Subtract the row overall effect from each row median. 4. Do the same columns. Add the column overall effect to row overall effect. 5. Repeat 1-4 until negligible change occur with row or column medians. ] ] --- # Median polish technique .pull-left[ <img src="images/week10A/twoway-medpolish.png" width="100%"> ] .pull-right[ .s500[ ```r # check calculations x <- matrix(c(10, 8, 6, 4, 2, 8, 6, 4, 2, 4, 6, 4, 2, 4, 6, 4, 2, 4, 6, 8, 2, 4, 6, 8, 10), nrow=5, byrow=T) medpolish(x, maxiter = 1) ``` ``` ## 1: 42 ``` ``` ## ## Median Polish Results (Dataset: "x") ## ## Overall: 4 ## ## Row Effects: ## [1] 2 0 0 0 2 ## ## Column Effects: ## [1] 2 0 0 0 2 ## ## Residuals: ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 2 0 -2 -6 ## [2,] 2 2 0 -2 -2 ## [3,] 0 0 -2 0 0 ## [4,] -2 -2 0 2 2 ## [5,] -6 -2 0 2 2 ``` ] ] --- # Median polish technique .pull-left[ .s500[ ```r medpolish(x, maxiter = 5) ``` ``` ## 1: 42 ## Final: 42 ``` ``` ## ## Median Polish Results (Dataset: "x") ## ## Overall: 4 ## ## Row Effects: ## [1] 2 0 0 0 2 ## ## Column Effects: ## [1] 2 0 0 0 2 ## ## Residuals: ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 2 0 -2 -6 ## [2,] 2 2 0 -2 -2 ## [3,] 0 0 -2 0 0 ## [4,] -2 -2 0 2 2 ## [5,] -6 -2 0 2 2 ``` ] ] .pull-right[ Median polish is effectively fitting a model of this form: *overall effect + row effect + column effect* which can be written as: `$$y_{ij} = \mu + \alpha_i + \beta_j + \varepsilon_{ij}$$` <br> <br> Nice explanation by [Manny Gimond](https://mgimond.github.io/tukeyedar/articles/polish.html) ] --- # .orange[Case study] .bg-orange.circle[2] Soils .panelset[ .panel[.panel-name[plot] <img src="images/lecture-10B/med_polish-1.png" width="80%" style="display: block; margin: auto;" /> This is the Baker field data that we have seen before. The .monash-blue2[heatmap shows corn yield in a farm field] in Iowa. High values are yellow and low values are dark blue. The .monash-blue2[right-side heatmap shows the residuals] from median polish, and the row and column marginal effects. After a median polish, the values should look randomly distributed. ] .panel[.panel-name[R] .s500[ ```r baker <- read_csv(here::here("data/baker.csv")) p1 <- ggplot(baker, aes(x=X, y=Y, fill=Corn97BU)) + geom_tile() + scale_fill_viridis_c("") + theme_map() + theme(legend.position = "none") baker_mat <- baker %>% select(X, Y, Corn97BU) %>% pivot_wider(names_from = "Y", values_from = Corn97BU) baker_mat <- as.matrix(baker_mat) rownames(baker_mat) <- baker_mat[,1] baker_mat <- baker_mat[,-1] baker_mp <- medpolish(baker_mat[,-1], na.rm=TRUE) baker_mp$residuals <- cbind(as.numeric(rownames(baker_mp$residuals)), baker_mp$residuals) colnames(baker_mp$residuals)[1] <- "X" baker_mp_res <- baker_mp$residuals %>% as_tibble() %>% pivot_longer(cols=`48`:`336`, names_to = "Y", values_to = "Corn97BU") %>% mutate(Y = as.numeric(Y)) p2 <- ggplot(baker_mp_res, aes(x=X, y=Y, fill=Corn97BU)) + geom_tile() + scale_fill_viridis_c("", na.value="white") + theme_map() + theme(legend.position = "none") col_marg <- bind_rows(col=as.numeric(names(baker_mp$col)), Corn97BU = as.numeric(baker_mp$col)) row_marg <- bind_rows(row=as.numeric(names(baker_mp$row)), Corn97BU = as.numeric(baker_mp$row)) p3 <- ggplot(col_marg, aes(x = col, y = Corn97BU)) + xlab("") + ylab("") + geom_col() + theme_map() + coord_flip() p4 <- ggplot(row_marg, aes(x = row, y = Corn97BU)) + geom_col() + theme_map() + xlab("") + ylab("") design <- " ##11# 22334 22334 22334" p4 + p1 + p2 + p3 + plot_layout(design = design) ``` ] ] ] --- class: transition middle ## Spatial data needs maps Maps provide a familiar framework for spatial coordinates. For data analysis, you want fast to draw maps, not detailed maps. The important information from maps can be delivered with polygons. --- # Spatial polygon data
--- # Spatial polygon data .panelset[ .panel[.panel-name[plot] <img src="images/lecture-10B/mappolygon-1.png" width="100%" style="display: block; margin: auto;" /> Measured values (variables) associated with a spatial polygon. ] .panel[.panel-name[R] .s500.f5[ ```r oz <- world_map %>% filter(region == "Australia") %>% filter(lat > -50) m1 <- ggplot(oz, aes(x = long, y = lat)) + * geom_point(size=0.2) + coord_map() + ggtitle("Points") m2 <- ggplot(oz, aes(x = long, y = lat, * group = group)) + * geom_path() + coord_map() + ggtitle("Path") m3 <- ggplot(oz, aes(x = long, y = lat, * group = group)) + * geom_polygon(fill = "#607848", colour = "#184848") + coord_map() + ggtitle("Filled polygon") m1 + m2 + m3 ``` ] ] ] --- # `sf`: Simple spatial polygon objects in R .s400.f5[ ```r library(sf) nc <- st_read(system.file("shape/nc.shp", package="sf")) ``` ``` ## Reading layer `nc' from data source `/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/sf/shape/nc.shp' using driver `ESRI Shapefile' ## Simple feature collection with 100 features and 14 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 ## Geodetic CRS: NAD27 ``` ```r nc %>% slice_head(n=5) ``` ``` ## Simple feature collection with 5 features and 14 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965 ## Geodetic CRS: NAD27 ## AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79 geometry ## 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0 19 MULTIPOLYGON (((-81.47276 3... ## 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3 12 MULTIPOLYGON (((-81.23989 3... ## 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6 260 MULTIPOLYGON (((-80.45634 3... ## 4 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1 123 830 2 145 MULTIPOLYGON (((-76.00897 3... ## 5 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9 1066 1606 3 1197 MULTIPOLYGON (((-77.21767 3... ``` ] <br> Like the `cubble` object but more strictly a map object. Has a .monash-blue2[coordinate system] (projection), and bounding box. Supports .monash-blue2[technically accurate distance calculations] between coordinates (on a sphere). --- # `sf`: Simple spatial polygon objects in R .flex[ .w-40[ .s500.f5[ ```r nc_geom <- st_geometry(nc) nc_geom[[1]] %>% flatten() ``` ``` ## [[1]] ## [,1] [,2] ## [1,] -81.47276 36.23436 ## [2,] -81.54084 36.27251 ## [3,] -81.56198 36.27359 ## [4,] -81.63306 36.34069 ## [5,] -81.74107 36.39178 ## [6,] -81.69828 36.47178 ## [7,] -81.70280 36.51934 ## [8,] -81.67000 36.58965 ## [9,] -81.34530 36.57286 ## [10,] -81.34754 36.53791 ## [11,] -81.32478 36.51368 ## [12,] -81.31332 36.48070 ## [13,] -81.26624 36.43721 ## [14,] -81.26284 36.40504 ## [15,] -81.24069 36.37942 ## [16,] -81.23989 36.36536 ## [17,] -81.26424 36.35241 ## [18,] -81.32899 36.36350 ## [19,] -81.36137 36.35316 ## [20,] -81.36569 36.33905 ## [21,] -81.35413 36.29972 ## [22,] -81.36745 36.27870 ## [23,] -81.40639 36.28505 ## [24,] -81.41233 36.26729 ## [25,] -81.43104 36.26072 ## [26,] -81.45289 36.23959 ## [27,] -81.47276 36.23436 ``` ] ] .w-10.white[ - ] .w-50[ The geometry contains a list of spatial locations when connected in the right order can be used to draw the spatial polygon. ] ] --- class: transition middle ## Choropleth maps and cartograms and hexagon tiles --- # .orange[Case study] .bg-orange.circle[3] Thyroid cancer in women .panelset[ .panel[.panel-name[plot] .flex[ .w-50[ A choropleth map is used to show a measured variable associated with a political or geographic region. Polygons for the region are filled with colour. The purpose is to examine the spatial distribution of a variable. <br><br> The choropleth map at right shows thyroid cancer incidence for females across Australia, measured at an SA2 level. Red indicates higher incidence. ] .w-50[ <img src="images/lecture-10B/choro-1.png" width="100%" style="display: block; margin: auto;" /> ] ] ] .panel[.panel-name[learn] <br> Highest incidence appears in the area around Brisbane, and northern NSW. There may be some high spots near Sydney and Perth. <br><br><br> .info-box[The problem is that .monash-orange2[high density population areas may be very small geographically]. They can disappear in a choropleth map, which means that we get a biased sense of the spatial distribution of a variable.] ] .panel[.panel-name[R] .s500[ ```r library(sf) library(sugarbag) invthm <- theme_map() + theme( panel.background = element_rect(fill = "black", colour = NA), plot.background = element_rect(fill = "black", colour = NA), legend.background = element_rect(fill = "transparent", colour = NA), legend.key = element_rect(fill = "transparent", colour = NA), text = element_text(colour = "white"), axis.text = element_blank() ) # function to allocate colours to regions aus_colours <- function(sir_p50){ value <- case_when( sir_p50 < 0.74 ~ "#33809d", sir_p50 >= 0.74 & sir_p50 < 0.98 ~ "#aec6c7", sir_p50 >= 0.98 & sir_p50 < 1.05 ~ "#fff4bc", sir_p50 >= 1.05 & sir_p50 < 1.45 ~ "#ff9a64", sir_p50 >= 1.45 ~ "#ff3500", TRUE ~ "#FFFFFF") return(value) } ``` ```r sa2 <- strayr::read_absmap("sa22011") %>% filter(!st_is_empty(geometry)) %>% filter(!state_name_2011 == "Other Territories") %>% filter(!sa2_name_2011 == "Lord Howe Island") # sa2 <- sa2 %>% rmapshaper::ms_simplify(keep = 0.05, keep_shapes = TRUE) # Not necessary here SIR <- read_csv(here::here("data/SIR Downloadable Data.csv")) %>% filter(SA2_name %in% sa2$sa2_name_2011) %>% dplyr::select(Cancer_name, SA2_name, Sex_name, p50) %>% filter(Cancer_name == "Thyroid", Sex_name == "Females") ERP <- read_csv(here::here("data/ERP.csv")) %>% filter(REGIONTYPE == "SA2", Time == 2011, Region %in% SIR$SA2_name) %>% dplyr::select(Region, Value) # Alternative maps # Join with sa2 sf object sa2thyroid_ERP <- SIR %>% left_join(sa2, ., by = c("sa2_name_2011" = "SA2_name")) %>% left_join(., ERP %>% dplyr::select(Region, Population = Value), by = c("sa2_name_2011"= "Region")) %>% filter(!st_is_empty(geometry)) sa2thyroid_ERP <- sa2thyroid_ERP %>% #filter(!is.na(Population)) %>% filter(!sa2_name_2011 == "Lord Howe Island") %>% mutate(SIR = map_chr(p50, aus_colours)) %>% st_as_sf() ``` ```r # Plot the choropleth aus_ggchoro <- ggplot(sa2thyroid_ERP) + geom_sf(aes(fill = SIR), size = 0.1) + scale_fill_identity() + invthm aus_ggchoro ``` ] ] ] --- # .orange[Case study] .bg-orange.circle[3] Thyroid cancer in women .panelset[ .panel[.panel-name[plot] .flex[ .w-50[ A hexagon tile map represents every spatial polygon with an equal sized hexagon. In dense areas these will be tesselated, but separated hexagons are placed at centroids of the remote spatial regions. ] .w-50[ <img src="images/lecture-10B/hexagontilemap-1.png" width="100%" style="display: block; margin: auto;" /> ] ] ] .panel[.panel-name[learn] <br> High incidence can now be seen to accur in areas around Brisbane, northern NSW, Sydney and Perth, and even some locations in Melbourne. ] .panel[.panel-name[R] .s500[ ```r if (!file.exists(here::here("data/aus_hexmap.rda"))) { ## Create centroids set centroids <- sa2 %>% create_centroids(., "sa2_name_2011") ## Create hexagon grid grid <- create_grid(centroids = centroids, hex_size = 0.2, buffer_dist = 5) ## Allocate polygon centroids to hexagon grid points aus_hexmap <- allocate( centroids = centroids, hex_grid = grid, sf_id = "sa2_name_2011", ## same column used in create_centroids hex_size = 0.2, ## same size used in create_grid hex_filter = 10, focal_points = capital_cities, width = 35, verbose = FALSE ) save(aus_hexmap, file = here::here("data/aus_hexmap.rda")) } load(here::here("data/aus_hexmap.rda")) ## Prepare to plot fort_hex <- fortify_hexagon(data = aus_hexmap, sf_id = "sa2_name_2011", hex_size = 0.2) %>% left_join(sa2thyroid_ERP %>% select(sa2_name_2011, SIR, p50)) fort_aus <- sa2thyroid_ERP %>% fortify_sfc() ## Make a plot aus_hexmap_plot <- ggplot() + geom_polygon(aes(x = long, y = lat, group = interaction(sa2_name_2011, polygon)), fill = "black", colour = "darkgrey", size = 0.1, data = fort_aus) + geom_polygon(aes(x = long, y = lat, group = hex_id, fill = SIR), data = fort_hex) + scale_fill_identity() + invthm + coord_equal() aus_hexmap_plot ``` ]] ] --- background-image: url(images/week10A/sugarbagflow.png) background-size: 60% --- background-image: url(images/week10A/sugarbagprocess.png) background-size: 80% --- # Cartograms A [cartogram](https://www.r-graph-gallery.com/cartogram.html) transforms the .monash-orange2[geographic shape to match the value of a statistic]. Its a useful exploratory technique for examining the spatial distribution of a measured variable. .panelset[ .panel[.panel-name[plot] <img src="images/lecture-10B/cartogram-1.png" width="100%" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] .s400[ ```r data(wrld_simpl) afr_raw <- wrld_simpl[wrld_simpl$REGION==2,] afr_sf = st_as_sf(afr_raw) afr <- st_transform(afr_sf, crs = 23038) # epsg:3395 means mercator projection p1 <- ggplot() + geom_sf(data = afr, aes(fill = POP2005), size=0, alpha=0.9) + scale_fill_viridis_c() + theme_map() + ggtitle("Choropleth map") afr_cartogram <- cartogram_cont(afr, "POP2005", itermax=10) p2 <- ggplot() + geom_sf(data = afr_cartogram, aes(fill = POP2005), size=0, alpha=0.9) + scale_fill_viridis_c() + theme_map() + ggtitle("Cartogram") afr_dorling <- cartogram_dorling(afr, "POP2005") p3 <- ggplot() + geom_sf(data = afr_dorling, aes(fill = POP2005), size=0, alpha=0.9) + scale_fill_viridis_c() + theme_map() + ggtitle("Dorling cartogram") lemon::grid_arrange_shared_legend(p1, p2, p3, ncol=3) ``` ] ] ] ---
--- <center> <img src="images/week10A/sugarbag_tas.png" width="80%"> </center> Three different cartogram displays for Tasmania: (a) contiguous cartogram, (b) non-contiguous cartogram and (c) Dorling cartogram. .info-box[The .monash-orange2[cartogram algorithm] can dramatically .monash-orange2[alter the geography], so that it is no longer recognisable. In the case of the whole of .monash-orange2[Australia, it simply does not converge].] --- class: center middle
<i class="fas fa-wrench fa-3x faa-wrench animated-hover faa-slow " style=" color:#D93F00;"></i>
.monash-orange2[Your turn! Point your browser to Michael Gastner's cartogram web site:] <br> <br> [https://go-cart.io/cartogram](https://go-cart.io/cartogram) --- # Resources and Acknowledgement - [Wickham et al (2012) Glyph-maps for Visually Exploring Temporal Patterns in Climate Data and Models](https://vita.had.co.nz/papers/glyph-maps.html) - [sf](https://r-spatial.github.io/sf/): Simple Features for R - Hexmaps with [sugarbag](https://srkobakian.github.io/sugarbag/) and [documentation](https://journal.r-project.org/articles/RJ-2023-021/) - Making [cartograms](https://www.r-graph-gallery.com/cartogram.html) in R - [Gastner et al (2018) Fast flow-based algorithm for creating density-equalizing map projections](https://www.pnas.org/content/115/10/E2156) - Median polish on two way tables from Tukey, J. W. (1977). Exploratory Data Analysis, Reading Massachusetts: Addison-Wesley, see [Manny Gimond's explanation](https://mgimond.github.io/tukeyedar/articles/polish.html). --- background-size: cover class: title-slide background-image: url("images/bg-12.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 10 - Session 2 <br> ]