class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-07A.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-12.png") # .monash-blue[ETC5521: Exploratory Data Analysis] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Making comparisons between groups and strata</h2> .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 7 - Session 1 <br> ] <style type="text/css"> .gray80 { color: #505050!important; font-weight: 300; } </style> --- # .orange[Case study] .circle.bg-orange.white[1] Pest resistance maize .font_small[Part 1/2] * <i class="fas fa-bug"></i> Pests, like thrips and spiders, damage maize crops. .font_small[Note: Maize = Corn] * <i class="fas fa-dna"></i> One strategy to protect crops against pests is to cultivate *genetically modified* (GM) maize that expresses a toxic protein. <div class="grid" style="grid-template-columns: 300px auto;"> <div class="item" width="100px"> <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Thrips </th> <th style="text-align:right;"> Spiders </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 16.6 </td> <td style="text-align:right;"> 0.8 </td> </tr> <tr> <td style="text-align:right;"> 16.4 </td> <td style="text-align:right;"> 0.8 </td> </tr> <tr> <td style="text-align:right;"> 11.0 </td> <td style="text-align:right;"> 0.6 </td> </tr> <tr> <td style="text-align:right;"> 16.8 </td> <td style="text-align:right;"> 0.4 </td> </tr> <tr> <td style="text-align:right;"> 10.6 </td> <td style="text-align:right;"> 0.6 </td> </tr> <tr> <td style="text-align:right;"> 18.4 </td> <td style="text-align:right;"> 0.8 </td> </tr> <tr> <td style="text-align:right;"> 14.2 </td> <td style="text-align:right;"> 0.0 </td> </tr> <tr> <td style="text-align:right;"> 10.2 </td> <td style="text-align:right;"> 0.6 </td> </tr> </tbody> </table> </div> <div class="item" width="1000px"> <ul> <li>The species abundance on 8 <i>Bt</i> GM maize is shown.</li> {{content}} </ul> </div> </div> .footnote.f6[ Kevin Wright (2018). agridat: Agricultural Datasets. R package version 1.16. L. A. Hothorn, 2005. Evaluation of Bt-Maize Field Trials by a Proof of Safety. http://www.seedtest.org/upload/cms/user/presentation7Hothorn.pdf ] -- <li>Is the strategy working?</li> {{content}} -- <li>Well it didn't completely eliminate pests but <em class="monash-blue">did it lower the abundance?</em></li> {{content}} -- <li>We can't tell without knowing what the typical abundance is.</li> {{content}} -- <div class="blockquote" style="margin-left:-30px;margin-right:-1px;"> At the heart of quantitative reasoning is a single question: <b>Compared to what?</b><br><br> <footer class="pull-right">—Edward Tufte</footer> </div> --- # .orange[Case study] .circle.bg-orange.white[1] Pest resistance maize .font_small[Part 2/2] * The actual experiment compared *Bt* variety to the **isogenic control variety**. -- * How would you compare graphically? .panelset[ .panel[.panel-name[📊] <img src="images/week7A/crop-plot1-1.png" width="1008" style="display: block; margin: auto;" /> ] .panel[.panel-name[data] .h100.f5.scroll-sign[ ```r data(gathmann.bt, package = "agridat") df1 <- gathmann.bt %>% pivot_longer(-gen, values_to = "abundance", names_to = "species" ) %>% mutate(species = factor(case_when( species == "thysan" ~ "Thrip", TRUE ~ "Spider" ))) skimr::skim(df1) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name df1 ## Number of rows 32 ## Number of columns 3 ## _______________________ ## Column type frequency: ## factor 2 ## numeric 1 ## ________________________ ## Group variables None ## ## ── Variable type: factor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 gen 0 1 FALSE 2 Bt: 16, ISO: 16 ## 2 species 0 1 FALSE 2 Spi: 16, Thr: 16 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 abundance 0 1 6.01 6.36 0 0.6 2.5 10.7 18.4 ▇▂▃▁▂ ``` ]] .panel[.panel-name[R] .scroll-sign[ .s300.f5[ ```r g1 <- ggplot(df1, aes(x = gen, y = abundance, color = species)) + geom_point(size = 3) + facet_wrap(~species, scales = "free") + scale_color_discrete_qualitative() + guides(color = FALSE) + labs(x = "", y = "Abundance", tag = "(A)") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) g2 <- ggplot(df1, aes(gen, abundance, color = species)) + geom_point(size = 3) + scale_color_discrete_qualitative() + guides(color = FALSE) + labs(x = "", y = "Abundance", tag = "(B)", color = "Species") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) g3 <- ggplot(df1, aes(gen, abundance, color = species)) + geom_point(size = 3) + facet_wrap(~species) + scale_color_discrete_qualitative() + labs(x = "", y = "Abundance", tag = "(C)", color = "Species") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) g4 <- ggplot(df1, aes(species, abundance, color = gen)) + geom_point(size = 3) + facet_wrap(~gen, scales = "free") + scale_color_discrete_qualitative(palette = "Harmonic") + guides(color = FALSE) + labs(x = "", y = "Abundance", tag = "(D)") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) g5 <- ggplot(df1, aes(species, abundance, color = gen)) + geom_point(size = 3) + scale_color_discrete_qualitative(palette = "Harmonic") + guides(color = FALSE) + labs(x = "", y = "Abundance", tag = "(E)") + theme(axis.text.x = element_text(angle = 90)) g6 <- ggplot(df1, aes(species, abundance, color = gen)) + geom_point(size = 3) + facet_wrap(~gen) + scale_color_discrete_qualitative(palette = "Harmonic") + labs(x = "", y = "Abundance", tag = "(F)", color = "Genotype") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) (g1 + g2 + g3) / (g4 + g5 + g6) ``` ]]] ] --- # Comparing like-with-like .font_small[Part 1] .info-box.w-70[ Comparison should be fair - any differences should be due to the factor you wish to investigate. ] -- .grid[.item[ <br> <center> <img src="images/spiders.jpg" style="border: 5px solid black;" height = "120px"> <img src="images/thrips.jpeg" style="border: 5px solid black;" height = "120px"> </center> <img src="images/week7A/crop-plot2-1.png" width="576" style="display: block; margin: auto;" /> ] .item[ **Comparable populations and measurements** * Abundance is measured for two species: spiders and thrips. * The abundance metric differ between species. * Would you compare the .gray80[abundance of spiders on a *Bt* variety] to the .gray80[abundance of thrips on a isogenic variety]? ] ] .footnote.f6[ Image source: https://www.infonet-biovision.org/PlantHealth/Pests/Thrips and https://cropwatch.unl.edu/2016/managing-spider-mites-corn-and-soybean ] --- # Checking your perception with visual inference The primary comparison is randomised: `gen`. Separate lineups for each species. .panelset[ .panel[.panel-name[📊] <img src="images/week7A/pest-thrip-1.png" width="720" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] ```r library(nullabor) set.seed(1034) df1_l <- df1 %>% filter(species == "Thrip") %>% mutate(gen01 = as.numeric(gen)-1) %>% lineup(method=null_permute("gen01"), true=., n=12) # Check order gen_check <- df1_l %>% group_by(.sample) %>% slice_head(n=8) %>% summarise(num1 = sum(gen01)) df1_l <- df1_l %>% left_join(gen_check, by=".sample") %>% mutate(gen01 = ifelse(num1 > 4, 1-gen01, gen01)) # Check none are the same or simply flipped labels df_check <- df1_l %>% group_by(.sample) %>% summarise(e1 = sum(abs(gen01 - c(rep(0, 8), rep(1, 8))))) # All have at least two observations switched ggplot(df1_l, aes(x = gen01, y = abundance)) + geom_point(size = 2, alpha=0.7) + stat_summary(mapping=aes(group=1), fun = mean, geom = "line") + stat_summary(fun = mean, geom = "point", shape = 8, size = 4) + facet_wrap(~.sample, ncol=4) + scale_x_continuous(expand=c(0.1, 0.1)) + scale_y_continuous(expand=c(0.1, 0.1)) + theme(axis.text = element_blank(), axis.title = element_blank(), axis.line.x = element_blank(), axis.line.y = element_blank()) ``` ] .panel[.panel-name[📊] <img src="images/week7A/pest-spider-1.png" width="720" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] ```r library(nullabor) set.seed(1133) df1_l2 <- df1 %>% filter(species == "Spider") %>% mutate(gen01 = as.numeric(gen)-1) %>% lineup(method=null_permute("gen01"), true=., n=12) # Check order gen_check <- df1_l2 %>% group_by(.sample) %>% slice_head(n=8) %>% summarise(num1 = sum(gen01)) df1_l2 <- df1_l2 %>% left_join(gen_check, by=".sample") %>% mutate(gen01 = ifelse(num1 > 4, 1-gen01, gen01)) # Check none are the same or simply flipped labels df_check <- df1_l2 %>% group_by(.sample) %>% summarise(e1 = sum(abs(gen01 - c(rep(0, 8), rep(1, 8))))) # All have at least four observations switched ggplot(df1_l2, aes(x = gen01, y = abundance)) + geom_point(size = 2, alpha=0.7) + stat_summary(mapping=aes(group=1), fun = mean, geom = "line") + stat_summary(fun = mean, geom = "point", shape = 8, size = 4) + facet_wrap(~.sample, ncol=4) + scale_x_continuous(expand=c(0.1, 0.1)) + scale_y_continuous(expand=c(0.1, 0.1)) + theme(axis.text = element_blank(), axis.title = element_blank(), axis.line.x = element_blank(), axis.line.y = element_blank()) ``` ] .panel[.panel-name[Answers] <br><br> <center> Lineup 1: Position 12 <br> Lineup 2: Position 10 </center> <br> Conclusion: New genotype ISO makes no difference for spiders, but it does for thrip. .f5[Note: Conventional statistical testing would be an option for this comparison.] ] ] --- # .orange[Case study] .circle.bg-orange.white[2] Maize kernels .font_small[Part 1/2] <div class="grid" style="grid-template-columns: 350px 800px;"> .item.font_small[ <img src="images/corn.jpg" style="border: 5px solid black;" height = "130px"> 1. Plant pathologist 2. Associate plant pathologist 3. Professor of agronomy 4. Assistant professor of agronomy 5. Professor of philosophy 6. Biologist 7. Biologist (also author) 8. Assistant in biology 9. Data entry clerk (a.k.a. "Computer") 10. Farmer 11. Professor of plant physiology 12. Instructor in plant physiology 13. Assistant in plant physiology 14. Assistant in plant physiology 15. Professor of biology ] .item[ * 4 maize ears selected. * 15 observers asked to classify kernels to (i) starchy yellow, (ii) starchy white, (iii) sweet yellow or (iv) sweet white. * Ear 11 was slightly abnormal due to a fungus attack giving some pinkish tinge to some kernels. * Is Ear 11 different? <img src="images/week7A/kernel-plot1-1.png" width="864" style="display: block; margin: auto;" /> ] </div> .footnote.f6[ Raymond Pearl, 1911. The Personal Equation In Breeding Experiments Involving Certain Characters of Maize, Biol. Bull., 21, 339-366. ] --- # .orange[Case study] .circle.bg-orange.white[2] Maize kernels .font_small[Part 2/2] <div class="grid" style="grid-template-columns: 350px 800px;"> .item.font_small[ <img src="images/corn.jpg" style="border: 5px solid black;" height = "130px"> 1. Plant pathologist 2. Associate plant pathologist 3. Professor of agronomy 4. Assistant professor of agronomy 5. Professor of philosophy 6. Biologist 7. Biologist (also author) 8. Assistant in biology 9. Data entry clerk (a.k.a. "Computer") 10. Farmer 11. Professor of plant physiology 12. Instructor in plant physiology 13. Assistant in plant physiology 14. Assistant in plant physiology 15. Professor of biology ] .item[ * All observer are counting the kernels of the same ears, however there are variations across observers. * Notice Observer 1 classifies more kernels as yellow for Ear 11. .panelset[ .panel[.panel-name[plot 1] <img src="images/week7A/kernel-plot2-1.png" width="828" style="display: block; margin: auto;" /> ] .panel[.panel-name[plot 2] <img src="images/week7A/kernel-plot4-1.png" width="828" style="display: block; margin: auto;" /> ] ] ] </div> --- # Comparing like-with-like .font_small[Part 2] .flex[ .w-40[ **Comparable conditions** * The variability due to other sources need to be accounted, removed or "averaged" out for a fair comparison. ] .w-60[ <img src="images/week7A/kernel-plot3-1.png" width="396" style="display: block; margin: auto;" /> ]] --- # Comparing like-with-like .font_small[Part 3] <center> <img src="images/virus.png" height = "100px" style="padding-top:10px;"> </center> **Comparable variables and sources** * Data collected by different sources may have different rules. E.g. in Australia, "a COVID-19 death is defined for surveillance purposes as a death in a probable or confirmed COVID-19 case, unless there is a clear alternative cause of death that cannot be related to COVID19 (e.g. trauma)"<sup>[1]</sup> * .gray80[Do other countries use the same definition?] * The COVID-19 death data often have delays in reporting and data would be updated or corrected later. .footnote.f6[ <sup>[1]</sup> https://www.health.gov.au/resources/total-covid-19-cases-and-deaths-by-states-and-territories ] --- # .orange[Case study] .circle.bg-orange.white[3] Multi-environment soybean trial * 58 soy varieties are grown in four locations in Queensland in 1970 then 1971. * Soy breeders are interested in finding the "best" variety. * So which variety is the best for yield? .panelset[ .panel[.panel-name[📊] <img src="images/week7A/soy-plot1-1.png" width="1065.6" style="display: block; margin: auto;" /> ] .panel[.panel-name[data] .h300.scroll-sign.f5[ ```r data(australia.soybean, package = "agridat") skimr::skim(australia.soybean) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name australia.soybean ## Number of rows 464 ## Number of columns 10 ## _______________________ ## Column type frequency: ## factor 3 ## numeric 7 ## ________________________ ## Group variables None ## ## ── Variable type: factor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 env 0 1 FALSE 8 B70: 58, B71: 58, L70: 58, L71: 58 ## 2 loc 0 1 FALSE 4 Bro: 116, Law: 116, Nam: 116, Red: 116 ## 3 gen 0 1 FALSE 58 G01: 8, G02: 8, G03: 8, G04: 8 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 year 0 1 1970. 0.501 1970 1970 1970. 1971 1971 ▇▁▁▁▇ ## 2 yield 0 1 2.05 0.752 0.282 1.52 2.07 2.56 4.38 ▂▆▇▃▁ ## 3 height 0 1 0.883 0.272 0.25 0.708 0.888 1.04 1.73 ▂▆▇▂▁ ## 4 lodging 0 1 2.31 0.976 1 1.5 2.25 3 4.75 ▇▅▅▂▁ ## 5 size 0 1 11.1 4.45 4 7.84 9.5 14.0 23.6 ▅▇▂▃▁ ## 6 protein 0 1 40.3 2.93 33.2 38.1 40.2 42.2 48.5 ▁▇▇▅▁ ## 7 oil 0 1 19.9 2.67 13.0 18.0 19.8 22.1 26.8 ▁▆▇▆▂ ``` ]] .panel[.panel-name[R] .f5[ ```r australia.soybean %>% mutate(gen = reorder(gen, yield)) %>% ggplot(aes(gen, yield, color = loc, shape = as.factor(year))) + geom_point(size = 3) + labs(x = "Genotype", y = "Yield (tons/hectare)", shape = "Year", color = "Location") + scale_color_discrete_qualitative() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) ``` ]] ] --- # Types of comparison * Is the interest to find the best variety *for a location*? * In that case, should the comparison be within location? * Or is the interest to find the overall best variety *at any location*? * Comparisons may be specific or general. * A different type of comparison may require a different calculation or graphic for investigation. .panelset[ .panel[.panel-name[📊] <img src="images/week7A/soy-plot2-1.png" width="720" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] .f5[ ```r ggplot(australia.soybean, aes(env, yield, group = gen)) + geom_point(size = 6, color = "gray") + geom_line(size = 1.3, color = "gray") + geom_point(data = filter(australia.soybean, gen %in% c("G49", "G48", "G50")), aes(color = gen), size = 6) + geom_line(data = filter(australia.soybean, gen %in% c("G49", "G48", "G50")), aes(color = gen), size = 1.3) + scale_color_discrete_qualitative() + labs( x = "Environment", y = "Yield", color = "Genotype" ) ``` ] ]] --- # .orange[Case study] .circle.bg-orange.white[4] Weight of calves with different diets .font_small[Part 1/3] * 67 calves born in 1975 across 11 herds are fed of one of three diets with low, medium or high energy with their initial and final weights recorded. * Different graphics and metrics will help to make comparison easier and fair. .panelset[ .panel[.panel-name[📊] <img src="images/week7A/diet-plot1-1.png" width="864" style="display: block; margin: auto;" /> ] .panel[.panel-name[data] .h200.scroll-sign.f5[ ```r data(urquhart.feedlot, package = "agridat") df4 <- urquhart.feedlot %>% pivot_longer(c(weight1, weight2), names_to = "when", values_to = "weight") %>% mutate( when = factor(as.character(when), labels = c("initial", "final"), levels = c("weight1", "weight2") ), diet = factor(diet, levels = c("High", "Medium", "Low")) ) skimr::skim(df4) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name df4 ## Number of rows 134 ## Number of columns 5 ## _______________________ ## Column type frequency: ## factor 2 ## numeric 3 ## ________________________ ## Group variables None ## ## ── Variable type: factor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 diet 0 1 FALSE 3 Low: 64, Med: 46, Hig: 24 ## 2 when 0 1 FALSE 2 ini: 67, fin: 67 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 animal 0 1 34 19.4 1 17.2 34 50.8 67 ▇▇▇▇▇ ## 2 herd 0 1 25.6 10.7 3 19 31 34 36 ▂▁▂▂▇ ## 3 weight 0 1 863. 182. 530 705 842. 1027 1248 ▃▇▃▆▃ ``` ]] .panel[.panel-name[R] .f5[ ```r df4 %>% ggplot(aes(diet, weight, color = diet)) + geom_point(size = 3) + facet_grid(when ~ herd, scale = "free_y") + scale_color_discrete_qualitative() + labs(x = "Diet", y = "Weight", title = "Weight by herd, timing and diet") + guides(color = FALSE) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) ``` ]] ] --- # .orange[Case study] .circle.bg-orange.white[4] Weight of calves with different diets .font_small[Part 2/3] * Weight data are *paired* so comparison of initial and final weights should be matched with the animal. .panelset[ .panel[.panel-name[📊] <img src="images/week7A/diet-plot2-1.png" width="1008" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] ```r ggplot(df4, aes(when, weight, color = diet, group = animal)) + geom_point(size = 3) + facet_wrap(~herd, nrow = 2) + geom_line() + labs(x = "", y = "Weight", color = "Diet") ``` ]] --- # .orange[Case study] .circle.bg-orange.white[4] Weight of calves with different diets .font_small[Part 3/3] * People are generally better at comparing lengths on a common scale instead of angles .font_small[(see Cleveland & McGill, 1985)] * We could compare the *difference in inital and final weight*. * Weight gain doesn't take into account the initial weight though. * We could consider computing the relative weight gain with respect to its initial weight. .panelset[ .panel[.panel-name[📊] <img src="images/week7A/diet-plot3-1.png" width="1008" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] .scroll-sign[.s250.f5[ ```r g1 <- urquhart.feedlot %>% mutate(diet = factor(diet, levels = c("High", "Medium", "Low") )) %>% ggplot(aes(diet, weight2 - weight1, color = diet)) + geom_boxplot() + labs(x = "", y = "Weight gain", color = "Diet") + guides(color = FALSE) g2 <- urquhart.feedlot %>% mutate(diet = factor(diet, levels = c("High", "Medium", "Low") )) %>% ggplot(aes(diet, (weight2 - weight1) / weight1, color = diet)) + geom_boxplot() + labs(x = "", y = "Relative weight\ngain", color = "Diet") + guides(color = FALSE) g1 + g2 ``` ]]]] .footnote.f6[ Cleveland, William S., and Robert Mc Gill. (1985) "Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods." ] --- # .orange[Case study] .circle.bg-orange.white[5] Swiss bank notes * Comparisons are not always based on point estimates. * Below is the comparison of distribution for the diagonal length of genuine and forged Swiss bank notes. .panelset[ .panel[.panel-name[📊] <img src="images/week7A/bank-plot1-1.png" width="720" style="display: block; margin: auto;" /> ] .panel[.panel-name[data] .h300.scroll-sign.f5[ ```r data(bank, package = "gclus") df5 <- bank %>% mutate(status = ifelse(Status == 0, "genuine", "forgery")) skimr::skim(bank) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name bank ## Number of rows 200 ## Number of columns 7 ## _______________________ ## Column type frequency: ## numeric 7 ## ________________________ ## Group variables None ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 Status 0 1 0.5 0.501 0 0 0.5 1 1 ▇▁▁▁▇ ## 2 Length 0 1 215. 0.377 214. 215. 215. 215. 216. ▁▇▇▂▁ ## 3 Left 0 1 130. 0.361 129 130. 130. 130. 131 ▁▅▇▇▁ ## 4 Right 0 1 130. 0.404 129 130. 130 130. 131. ▃▆▇▅▁ ## 5 Bottom 0 1 9.42 1.44 7.2 8.2 9.1 10.6 12.7 ▇▆▅▅▂ ## 6 Top 0 1 10.7 0.803 7.7 10.1 10.6 11.2 12.3 ▁▂▆▇▃ ## 7 Diagonal 0 1 140. 1.15 138. 140. 140. 142. 142. ▂▇▅▅▇ ``` ]] .panel[.panel-name[R] .f5[ ```r g1 <- ggplot(df5, aes(Diagonal, fill = status)) + geom_histogram(binwidth = 0.2, color = "white") + facet_grid(status ~ .) + labs( x = "Diagonal length (mm)", y = "Count" ) + guides(fill = FALSE) + scale_fill_manual(values = c("#C7A76C", "#7DB0DD")) g1 ``` ]] ] --- # Comparing graphically .font_small[Part 1] .flex[ .w-40[ * From (A) we can see that the diagonal length distribution is quite different between forged and genuine notes. * Comparing (B) and (C) is however difficult due to different **aspect ratio** and graphs are not in **common scale** nor **alignment**. ] .w-60[ <img src="images/week7A/bank-plot2-1.png" width="612" style="display: block; margin: auto;" /> ]] --- # .orange[Case study] .bg-orange.circle.white[6] Barley Yield .font_small[Part 1/2] .grid[.item50[ .panelset[ .panel[.panel-name[📊] <img src="images/week7A/barley-plot1-1.png" width="720" style="display: block; margin: auto;" /> ] .panel[.panel-name[data] .h300.scroll-sign[ ```r data("barley", package = "lattice") skimr::skim(barley) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name barley ## Number of rows 120 ## Number of columns 4 ## _______________________ ## Column type frequency: ## factor 3 ## numeric 1 ## ________________________ ## Group variables None ## ## ── Variable type: factor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 variety 0 1 FALSE 10 Sva: 12, No.: 12, Man: 12, No.: 12 ## 2 year 0 1 FALSE 2 193: 60, 193: 60 ## 3 site 0 1 FALSE 6 Gra: 20, Dul: 20, Uni: 20, Mor: 20 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 yield 0 1 34.4 10.3 14.4 26.9 32.9 41.4 65.8 ▃▇▅▂▁ ``` ]] .panel[.panel-name[R] ```r ggplot(barley, aes(yield, variety, shape = year)) + geom_point(size = 3) + facet_wrap(~site) + theme( plot.title.position = "plot", plot.title = element_text(face = "bold") ) + labs(x = "Yield", shape = "Year", y = "Variety") ``` ] ] ] .item[ * 10 barley varieties were tested at 6 locations in 1931 and in 1932 * Do you notice anything about the yield with respect to the years? ] ] .footnote.f6[ Immer, R. F., H. K. Hayes, and LeRoy Powers. (1934). Statistical Determination of Barley Varietal Adaptation. *Journal of the American Society of Agronomy* **26** 403–419 ] --- count: false # .orange[Case study] .bg-orange.circle.white[6] Barley Yield .font_small[Part 1/2] .grid[.item50[ .panelset[ .panel[.panel-name[📊] <img src="images/week7A/barley-plot2-1.png" width="720" style="display: block; margin: auto;" /> ] .panel[.panel-name[data] .h300.scroll-sign[ ```r data("barley", package = "lattice") skimr::skim(barley) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name barley ## Number of rows 120 ## Number of columns 4 ## _______________________ ## Column type frequency: ## factor 3 ## numeric 1 ## ________________________ ## Group variables None ## ## ── Variable type: factor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 variety 0 1 FALSE 10 Sva: 12, No.: 12, Man: 12, No.: 12 ## 2 year 0 1 FALSE 2 193: 60, 193: 60 ## 3 site 0 1 FALSE 6 Gra: 20, Dul: 20, Uni: 20, Mor: 20 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 yield 0 1 34.4 10.3 14.4 26.9 32.9 41.4 65.8 ▃▇▅▂▁ ``` ]] .panel[.panel-name[R] ```r ggplot(barley, aes(yield, variety, color = year)) + geom_point(size = 3) + facet_wrap(~site) + theme( plot.title.position = "plot", plot.title = element_text(face = "bold") ) + labs(x = "Yield", y = "Variety", color = "Year") + scale_color_discrete_qualitative() ``` ] ] ] .item[ * 10 barley varieties were tested at 6 locations in 1931 and in 1932 * Do you notice anything about the yield with respect to the years? <br> *How about now?* ] ] .footnote.f6[ Immer, R. F., H. K. Hayes, and LeRoy Powers. (1934). Statistical Determination of Barley Varietal Adaptation. *Journal of the American Society of Agronomy* **26** 403–419 ] --- count: false # .orange[Case study] .bg-orange.circle.white[6] Barley Yield .font_small[Part 1/2] .grid[.item50[ .panelset[ .panel[.panel-name[📊] <img src="images/week7A/barley-plot3-1.png" width="720" style="display: block; margin: auto;" /> ] .panel[.panel-name[data] .h300.scroll-sign[ ```r data("barley", package = "lattice") skimr::skim(barley) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name barley ## Number of rows 120 ## Number of columns 4 ## _______________________ ## Column type frequency: ## factor 3 ## numeric 1 ## ________________________ ## Group variables None ## ## ── Variable type: factor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 variety 0 1 FALSE 10 Sva: 12, No.: 12, Man: 12, No.: 12 ## 2 year 0 1 FALSE 2 193: 60, 193: 60 ## 3 site 0 1 FALSE 6 Gra: 20, Dul: 20, Uni: 20, Mor: 20 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 yield 0 1 34.4 10.3 14.4 26.9 32.9 41.4 65.8 ▃▇▅▂▁ ``` ]] .panel[.panel-name[R] ```r ggplot(barley, aes(yield, variety, color = year)) + geom_point(size = 3, alpha = 0.4) + geom_point(data = subset(barley, (site == "University Farm" & variety == "No. 475") | (site == "Grand Rapids" & variety == "Velvet")), size = 3) + facet_wrap(~site) + theme( plot.title.position = "plot", plot.title = element_text(face = "bold") ) + labs(x = "Yield", y = "Variety", color = "Year") + scale_color_discrete_qualitative() ``` ] ] ] .item[ * 10 barley varieties were tested at 6 locations in 1931 and in 1932 * Do you notice anything about the yield with respect to the years? <br> *How about now?* ] ] .footnote.f6[ Immer, R. F., H. K. Hayes, and LeRoy Powers. (1934). Statistical Determination of Barley Varietal Adaptation. *Journal of the American Society of Agronomy* **26** 403–419 ] --- # .orange[Case study] .bg-orange.circle.white[6] Barley Yield .font_small[Part 2/2] .grid[.item[ <img src="images/week7A/barley-plot3-1.png" width="720" style="display: block; margin: auto;" /> ] .item[ * Cleveland (1993) speculated that the year labels may have been reversed for some data. * Wright (2013) investigated this by examining extended data from 1927 to 1936, in addition to weather covariates, and found that the observations are not particularly unusual. ] ] .footnote.f6[ Cleveland, W. S. (1993) Visualising Data, Summit, NJ: Hobart Press.<br> Wright, Kevin (2013). Revisiting Immer's Barley Data. The American Statistician. **67** (3) 129–133. ] --- # .orange[Case study] .circle.bg-orange.white[7] Olive oils .panelset[ .panel[.panel-name[📊] <img src="images/week7A/olives-plot1-1.png" width="1008" style="display: block; margin: auto;" /> ] .panel[.panel-name[data] .h300.scroll-sign[ ```r data(olives, package = "classifly") df2 <- olives %>% mutate(Region = factor(Region, labels = c("South", "Sardinia", "North"))) skimr::skim(df2) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name df2 ## Number of rows 572 ## Number of columns 10 ## _______________________ ## Column type frequency: ## factor 2 ## numeric 8 ## ________________________ ## Group variables None ## ## ── Variable type: factor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 Region 0 1 FALSE 3 Sou: 323, Nor: 151, Sar: 98 ## 2 Area 0 1 FALSE 9 Sou: 206, Inl: 65, Cal: 56, Umb: 51 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 palmitic 0 1 1232. 169. 610 1095 1201 1360 1753 ▁▂▇▆▁ ## 2 palmitoleic 0 1 126. 52.5 15 87.8 110 169. 280 ▂▇▅▃▁ ## 3 stearic 0 1 229. 36.7 152 205 223 249 375 ▂▇▃▁▁ ## 4 oleic 0 1 7312. 406. 6300 7000 7302. 7680 8410 ▁▇▇▇▁ ## 5 linoleic 0 1 981. 243. 448 771. 1030 1181. 1470 ▃▅▃▇▃ ## 6 linolenic 0 1 31.9 13.0 0 26 33 40.2 74 ▂▅▇▂▁ ## 7 arachidic 0 1 58.1 22.0 0 50 61 70 105 ▂▁▇▇▂ ## 8 eicosenoic 0 1 16.3 14.1 1 2 17 28 58 ▇▃▅▂▁ ``` ]] .panel[.panel-name[R] .s400.scroll-sign[ ```r g1 <- df2 %>% mutate(Area = fct_reorder(Area, palmitic)) %>% ggplot(aes(Area, palmitic, color = Region)) + geom_boxplot() + scale_color_discrete_qualitative() + guides(color = FALSE, x = guide_axis(n.dodge = 2)) g2 <- ggplot(df2, aes(Region, palmitic, color = Region)) + geom_boxplot() + scale_color_discrete_qualitative() + guides(color = FALSE) g3 <- ggplot(df2, aes(palmitic, color = Region)) + geom_density() + scale_color_discrete_qualitative() + guides(color = FALSE) g4 <- ggplot(df2, aes(palmitic, color = Region)) + stat_ecdf() + scale_color_discrete_qualitative() g1 / (g2 | (g3 / g4)) + plot_layout(guides = "collect", byrow = FALSE) ``` ] ]] ??? * The olive oil data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic) found in the lipid fraction of 572 Italian olive oils. * There are 9 collection areas, 4 from southern Italy (North and South Apulia, Calabria, Sicily), two from Sardinia (Inland and Coastal) and 3 from northern Italy (Umbria, East and West Liguria). --- # Comparing graphically .font_small[Part 2] .grid[.item[ .panelset[ .panel[.panel-name[📊] <img src="images/week7A/color-olives-1.png" width="720" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] ```r ggplot(olives, aes(palmitoleic, palmitic, color = Area)) + geom_point() + scale_color_discrete_qualitative() ``` ] ] ] .item[ * Color is a great way to differentiate categories but if there are too many categories then it becomes hard to compare. * In this scatter plot, there are too many overlapping points so splitting the data to **multiple windows** via facetting may make it easier to compare. ] ] --- # Comparing graphically .font_small[Part 3] .grid[.item50[ .panelset[ .panel[.panel-name[📊] <img src="images/week7A/no-shadow-1.png" width="504" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] ```r ggplot(olives, aes(palmitoleic, palmitic, color = Area)) + geom_point() + facet_wrap(~Area) + scale_color_discrete_qualitative() + guides(color = FALSE) ``` ] ] ] .item[ {{content}} ] ] -- <div class="panelset"> <div class="panel"> <span class="panel-name">📊</span> <img src="images/week7A/shadow-1.png" width="504" style="display: block; margin: auto;" /> </div> <div class="panel"> <span class="panel-name">R</span> ```r ggplot(olives, aes(palmitoleic, palmitic)) + geom_point(data = dplyr::select(olives, -Area), color = "gray") + geom_point(aes(color = Area)) + facet_wrap(~Area) + scale_color_discrete_qualitative() + guides(color = FALSE) ``` </div> </div> --- # .orange[Case study] .circle.bg-orange.white[8] England and East Indies trade data .grid[ .item50[ .panelset[ .panel[.panel-name[📊] <img src="images/week7A/trade-plot1-1.png" width="504" style="display: block; margin: auto;" /> ] .panel[.panel-name[data] .h300.scroll-sign[ ```r data(EastIndiesTrade, package = "GDAdata") skimr::skim(EastIndiesTrade) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name EastIndiesTrade ## Number of rows 81 ## Number of columns 3 ## _______________________ ## Column type frequency: ## numeric 3 ## ________________________ ## Group variables None ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 Year 0 1 1740 23.5 1700 1720 1740 1760 1780 ▇▇▇▇▇ ## 2 Exports 0 1 518. 421. 100 145 370 840 1395 ▇▂▃▂▂ ## 3 Imports 0 1 1005. 320. 460 835 975 1000 1550 ▃▃▇▁▅ ``` ]] .panel[.panel-name[R] .s400.scroll-sign[ ```r g1 <- ggplot(EastIndiesTrade, aes(Year, Exports)) + annotate("rect", xmin = 1701, xmax = 1714, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.3 ) + annotate("rect", xmin = 1756, xmax = 1763, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.3 ) + annotate("rect", xmin = 1775, xmax = 1780, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.3 ) + geom_line(color = "#339933", size = 2) + geom_line(aes(Year, Imports), color = "red", size = 2) + geom_ribbon(aes(ymin = Exports, ymax = Imports), fill = "gray") + labs(y = "<span style='color:#339933'>Export</span>/<span style='color:red'>Import</span>", tag = "(A)") + theme(axis.title.y = ggtext::element_markdown()) g2 <- ggplot(EastIndiesTrade, aes(Year, Exports - Imports)) + annotate("rect", xmin = 1701, xmax = 1714, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.3 ) + annotate("rect", xmin = 1756, xmax = 1763, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.3 ) + annotate("rect", xmin = 1775, xmax = 1780, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.3 ) + geom_line(size = 2) + labs(tag = "(B)") g3 <- ggplot(EastIndiesTrade, aes(Year, (Exports - Imports) / (Exports + Imports) * 2)) + annotate("rect", xmin = 1701, xmax = 1714, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.3 ) + annotate("rect", xmin = 1756, xmax = 1763, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.3 ) + annotate("rect", xmin = 1775, xmax = 1780, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.3 ) + geom_line(color = "#001a66", size = 2) + labs(y = "Relative difference", tag = "(C)") g1 / g2 / g3 ``` ]] ] ] .item[ * (A) shows the export from England to the East Indies and the import to England from the East Indies in millions of pounds. * Import and export figures are easier to compare by plotting the difference like in (B). * Relative difference may be more of an interest - (C) plots the relative difference with respect to the average of export and import values. * The red area correspond to War of the Spanish Succession (1701-14), Seven Years' War (1756-63) and the American Revolutionary War (1775-83). ] ] --- # .orange[Case study] .circle.bg-orange.white[9] Melbourne's daily maximum temperature .grid[ .item50[ .panelset[ .panel[.panel-name[📊] <img src="images/week7A/temp-plot1-1.png" width="720" style="display: block; margin: auto;" /> ] .panel[.panel-name[data] .h300.scroll-sign[ ```r df9 <- read_csv(here::here("data", "melb_temp.csv")) %>% janitor::clean_names() %>% rename(temp = maximum_temperature_degree_c) %>% filter(!is.na(temp)) %>% dplyr::select(year, month, day, temp) skimr::skim(df9) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name df9 ## Number of rows 18310 ## Number of columns 4 ## _______________________ ## Column type frequency: ## character 2 ## numeric 2 ## ________________________ ## Group variables None ## ## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 month 0 1 2 2 0 12 0 ## 2 day 0 1 2 2 0 31 0 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 year 0 1 1995. 14.5 1970 1983 1995 2008 2020 ▇▇▇▇▇ ## 2 temp 0 1 19.9 6.48 5.7 14.8 18.6 23.6 46.8 ▃▇▃▁▁ ``` ]] .panel[.panel-name[R] ```r ggplot(df9, aes(month, temp)) + geom_boxplot() + labs(x = "Month", y = "Maximum temperature (°C)") ``` ] ]] .item[ * Melbourne's daily maximum temperature from 1970 to 2020. * How are the temperatures different across months? * What about the temperature within a month? ] ] --- # Resources and Acknowledgement - Slides originally written by Emi Tanaka and constructed with [`xaringan`](https://github.com/yihui/xaringan), [remark.js](https://remarkjs.com), [`knitr`](http://yihui.name/knitr), and [R Markdown](https://rmarkdown.rstudio.com). --- background-size: cover class: title-slide background-image: url("images/bg-12.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 7 - Session 1 <br> ]