class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-05A.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-01.png") # .monash-blue[ETC5521: Exploratory Data Analysis] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Working with a single variable, making transformations, detecting outliers, using robust statistics</h2> .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 5 - Session 1 <br> ] --- class: transition middle # Continuous variables This lecture is partly based on Chapter 3 of Unwin (2015) Graphical Data Analysis with R --- # Possible features of a single continuous variable <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Feature </th> <th style="text-align:left;"> Example </th> <th style="text-align:left;"> Description </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Asymmetry </td> <td style="text-align:left;"> <img src="images/week4A/plots-1.png" height="54px"> </td> <td style="text-align:left;"> The distribution is not symmetrical. </td> </tr> <tr> <td style="text-align:left;"> Outliers </td> <td style="text-align:left;"> <img src="images/week4A/plots-2.png" height="54px"> </td> <td style="text-align:left;"> Some observations are that are far from the rest. </td> </tr> <tr> <td style="text-align:left;"> Multimodality </td> <td style="text-align:left;"> <img src="images/week4A/plots-3.png" height="54px"> </td> <td style="text-align:left;"> There are more than one "peak" in the observations. </td> </tr> <tr> <td style="text-align:left;"> Gaps </td> <td style="text-align:left;"> <img src="images/week4A/plots-4.png" height="54px"> </td> <td style="text-align:left;"> Some continuous interval that are contained within the range but no observations exists. </td> </tr> <tr> <td style="text-align:left;"> Heaping </td> <td style="text-align:left;"> <img src="images/week4A/plots-5.png" height="54px"> </td> <td style="text-align:left;"> Some values occur unexpectedly often. </td> </tr> <tr> <td style="text-align:left;"> Discretized </td> <td style="text-align:left;"> <img src="images/week4A/plots-6.png" height="54px"> </td> <td style="text-align:left;"> Only certain values are found, e.g. due to rounding. </td> </tr> <tr> <td style="text-align:left;"> Implausible </td> <td style="text-align:left;"> <img src="images/week4A/plots-7.png" height="54px"> </td> <td style="text-align:left;"> Values outside of plausible or likely range. </td> </tr> </tbody> </table> --- # Numerical features of a single continuous variables <img src="images/week5A/example-plot-1.png" width="432" style="display: block; margin: auto;" /> * A measure of .monash-blue[**_central tendency_**], e.g. mean, median and mode -- * A measure of .monash-blue[**_dispersion_**] (also called variability or spread), e.g. variance, standard deviation and interquartile range -- * There are other measures, e.g. .monash-blue[**_skewness_**] and .monash-blue[**_kurtosis_**] that measures "tailedness", but these are not as common as the measures of first two -- * The mean is also the _first moment_ and variance, skewness and kurtosis are _second, third, and fourth central moments_ -- **Significance tests** or **hypothesis tests** * Testing for `\(H_0: \mu = \mu_0\)` vs. `\(H_1: \mu \neq \mu_0\)` (often `\(\mu_0 = 0\)`) * The `\(t\)`-test is commonly used if the underlying data are believed to be normally distributed --- # .orange[Case study] .circle.bg-orange.white[1] 2019 Australian Federal Election .f4[Part 1/8] .flex[ .w-70[ **Context** * There are 151 seats in the House of Representative for the 2019 Australian federal election * The major parties in Australia are: * the .monash-blue[**Coalition**], comprising of the: * **Liberal**, * **Liberal National** <span class="f6">(Qld)</span>, * **National**, and * **Country Liberal** <span class="f6">(NT)</span> parties, and * the Australian .monash-blue[**Labor**] party * The .green[**Greens**] party is a small but notable party ] .w-30.center[ <img src="https://upload.wikimedia.org/wikipedia/commons/3/39/Scott_Morrison_2014_%28cropped_2%29.jpg" class="w-50 ba" alt="Scott Morrison"> <img src="https://upload.wikimedia.org/wikipedia/commons/7/7d/Bill_Shorten-crop.jpg" class="w-50 ba" alt="Bill Shorten"> ] ] --- # .orange[Case study] .circle.bg-orange.white[1] 2019 Australian Federal Election .f4[Part 2/8] .f5[<i class="fas fa-database"></i> https://results.aec.gov.au/24310/Website/Downloads/HouseFirstPrefsByCandidateByVoteTypeDownload-24310.csv]
.footnote.f5[ Data source: Australian Electoral Commission. (2019). Federal Elections (website), accessed August 2021. URL: https://results.aec.gov.au/ ] --- # .orange[Case study] .circle.bg-orange.white[1] 2019 Australian Federal Election .f4[Part 3/8] .question-box[ What is the number of the seats won in the House of Representatives by parties? ] -- .panelset[ .panel[.panel-name[📊] .flex[ .w-50[ <table class=" lightable-classic" style='font-size: 20px; font-family: "Arial Narrow", "Source Sans Pro", sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Party </th> <th style="text-align:right;"> # of seats </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Coalition </td> <td style="text-align:right;"> 77 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;color: #C8C8C8 !important;" indentlevel="1"> Liberal </td> <td style="text-align:right;color: #C8C8C8 !important;"> 44 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;color: #C8C8C8 !important;" indentlevel="1"> Liberal National Party Of Queensland </td> <td style="text-align:right;color: #C8C8C8 !important;"> 23 </td> </tr> <tr> <td style="text-align:left;padding-left: 2em;color: #C8C8C8 !important;" indentlevel="1"> The Nationals </td> <td style="text-align:right;color: #C8C8C8 !important;"> 10 </td> </tr> <tr> <td style="text-align:left;"> Australian Labor Party </td> <td style="text-align:right;"> 68 </td> </tr> <tr> <td style="text-align:left;"> The Greens </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Centre Alliance </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Katter's Australian Party (Kap) </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Independent </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> ] .w-50[ **What does this table tell you?** {{content}} ] ]] .panel[.panel-name[data] .scroll-sign[ .f5.s400[ ```r df1 <- read_csv(here::here("data/HouseFirstPrefsByCandidateByVoteTypeDownload-24310.csv"), skip = 1, col_types = cols( .default = col_character(), OrdinaryVotes = col_double(), AbsentVotes = col_double(), ProvisionalVotes = col_double(), PrePollVotes = col_double(), PostalVotes = col_double(), TotalVotes = col_double(), Swing = col_double() ) ) ``` ```r skimr::skim(df1) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name df1 ## Number of rows 1207 ## Number of columns 18 ## _______________________ ## Column type frequency: ## character 11 ## numeric 7 ## ________________________ ## Group variables None ## ## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 StateAb 0 1 2 3 0 8 0 ## 2 DivisionID 0 1 3 3 0 151 0 ## 3 DivisionNm 0 1 4 15 0 151 0 ## 4 CandidateID 0 1 3 5 0 1057 0 ## 5 Surname 0 1 2 18 0 890 0 ## 6 GivenNm 0 1 1 25 0 613 0 ## 7 BallotPosition 0 1 1 3 0 14 0 ## 8 Elected 0 1 1 1 0 2 0 ## 9 HistoricElected 0 1 1 1 0 2 0 ## 10 PartyAb 151 0.875 2 4 0 40 0 ## 11 PartyNm 2 0.998 5 61 0 45 0 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 OrdinaryVotes 0 1 10401. 12446. 167 1867 4317 14768. 54535 ▇▁▁▁▁ ## 2 AbsentVotes 0 1 511. 569. 13 117 246 711 3287 ▇▂▁▁▁ ## 3 ProvisionalVotes 0 1 41.4 51.7 0 8 20 56 444 ▇▁▁▁▁ ## 4 PrePollVotes 0 1 514. 607. 11 108. 211 761 5248 ▇▂▁▁▁ ## 5 PostalVotes 0 1 1033. 1476. 14 181 317 1216. 9837 ▇▁▁▁▁ ## 6 TotalVotes 0 1 12501. 14860. 250 2348 5196 18142 61202 ▇▁▁▁▁ ## 7 Swing 0 1 1.07 4.26 -28.1 -0.73 1.21 2.75 43.5 ▁▆▇▁▁ ``` ```r recode_party_names <- c( "Australian Labor Party (Northern Territory) Branch" = "Australian Labor Party", "Labor" = "Australian Labor Party", "The Greens (Vic)" = "The Greens", "The Greens (Wa)" = "The Greens", "Katter's Australian Party (KAP)" = "Katter's Australian Party", "Country Liberals (Nt)" = "Country Liberals (NT)" ) ``` ```r tdf1 <- df1 %>% filter(Elected == "Y") %>% mutate( PartyNm = str_to_title(PartyNm), PartyNm = recode(PartyNm, !!!recode_party_names) ) %>% count(PartyNm, sort = TRUE) %>% slice(2:4, 1, 8, 6, 7, 5) ``` ] ]] .panel[.panel-name[R] .f5[ <i class="fas fa-pencil-alt"></i> Note: `tidyverse` is expected to be loaded already. ```r data.frame(PartyNm = "Coalition", n = sum(tdf1$n[1:3])) %>% rbind(tdf1) %>% knitr::kable(col.names = c("Party", "# of seats")) %>% kableExtra::add_indent(2:4) %>% kableExtra::row_spec(2:4, color = "#C8C8C8") %>% kableExtra::kable_classic( full_width = FALSE, font_size = 20 ) ``` ]]] -- * The Coalition won the government * Labor and Coalition hold majority of the seats in the House of Representatives (lower house) * Parties such as The Greens, Centre Alliance and Katter's Australian Party (KAP) won _only_ a single seat {{content}} -- Only? {{content}} -- Wait... **Did the parties compete in all electoral districts?** --- # .orange[Case study] .circle.bg-orange.white[1] 2019 Australian Federal Election .f4[Part 4/8] .panelset[ .panel[.panel-name[📊] .flex[ .w-50[
] .w-50[ **What do you notice from this table?** {{content}} ] ]] .panel[.panel-name[data] .f5[ ```r tdf2 <- df1 %>% mutate( PartyNm = str_to_title(PartyNm), PartyNm = recode(PartyNm, !!!recode_party_names) ) %>% count(PartyNm, sort = TRUE) ``` ]] .panel[.panel-name[R] .f5[ You can omit `table_options` and `toggle_select` or have a look at the source Rmd to find out what it is ```r tdf2 %>% DT::datatable( rownames = FALSE, escape = FALSE, width = "500px", options = table_options( scrollY = "400px", title = "Australian Federal Election 2019 - Party Distribution", csv = "aus-election-2019-party-dist" ), elementId = "tab1B", colnames = c("Party", "# of electorates"), callback = toggle_select ) ``` ]]] -- * The Greens are represented in every electoral districts * United Australia Party is the only other non-major party to be represented in every electoral district * KAP is represented in 7 electoral districts * Centre Alliance is only represented in 3 electoral districts! {{content}} -- Let's have a closer look at the Greens party... --- # .orange[Case study] .circle.bg-orange.white[1] 2019 Australian Federal Election .f4[Part 5/8] .panelset[ .panel[.panel-name[📊] .flex[ .w-70[ <img src="images/week5A/aus-election-plot1-1.png" width="720" style="display: block; margin: auto;" /> ] .w-30[ **What does this graph tell you?** {{content}} ] ]] .panel[.panel-name[data] .scroll-sign[ .f5.s500[ ```r tdf3 <- df1 %>% group_by(DivisionID) %>% summarise( DivisionNm = unique(DivisionNm), State = unique(StateAb), votes_GRN = TotalVotes[which(PartyAb == "GRN")], votes_total = sum(TotalVotes) ) %>% mutate(perc_GRN = votes_GRN / votes_total * 100) ``` ```r skimr::skim(tdf3) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name tdf3 ## Number of rows 151 ## Number of columns 6 ## _______________________ ## Column type frequency: ## character 3 ## numeric 3 ## ________________________ ## Group variables None ## ## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 DivisionID 0 1 3 3 0 151 0 ## 2 DivisionNm 0 1 4 15 0 151 0 ## 3 State 0 1 2 3 0 8 0 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 votes_GRN 0 1 9821. 5581. 2744 6555 8676 11532. 45876 ▇▂▁▁▁ ## 2 votes_total 0 1 99925. 9801. 51009 96372. 100936 105588 116216 ▁▁▁▇▅ ## 3 perc_GRN 0 1 9.87 5.63 2.89 6.43 8.55 11.4 47.8 ▇▂▁▁▁ ``` ]]] .panel[.panel-name[R] .f5[ ```r tdf3 %>% ggplot(aes(perc_GRN)) + geom_histogram(color = "white", fill = "#00843D") + labs( x = "Percentage of first preference votes per division", y = "Count", title = "First preference votes for the Greens party" ) ``` ] ]] ??? * Australia uses full-preference instant-runoff voting in single member seats * Following the full allocation of preferences, it is possible to derive a two-party-preferred figure, where the votes have been allocated between the two main candidates in the election. * In Australia, this is usually between the candidates from the Coalition parties and the Australian Labor Party. -- <ul> <li>Majority of the country does not have first preference for the Greens</li> <li>Some constituents are slightly more supportive than the others</li> </ul> {{content}} -- **What further questions does it raise?** --- # Formulating questions for EDA vs making observations from a plot .flex[ .w-50[ * BEFORE plotting or making summaries think .monash-blue[**broad (open-ended) questions**] that promotes discussion and divergent thinking * Questions with simple answers (i.e. yes or no) less helpful in encouraging exploration * For example, .center[ <div class="question-box w-80 tl"> What is the distribution of the first preference vote percentages for the Labor party across Australia? Is it evenly spread across electorates or are there clusters of popularity? </div> ] ] .w-50[ {{content}} ] ] -- * AFTER plotting or making summaries think <text style="color: #006DAE;"> **was this what you expected, are there any surprises**</text>. Detail what you learn, and how you should follow up on these observations. <img src="images/week5A/aus-election-plot1-1.png" width="432" style="display: block; margin: auto;" /> <div class="question-box w-80 tl"> Is the outlying observation the electoral district that won the seat? </div> {{content}} --- # Visual inference .flex[ .item.w-50[ Typical plot description: ```r ggplot(data, aes(x=var1)) + geom_histogram() ``` <br><br> *Is the distribution consistent with a sample from a particular statistical distribution?* ] .item.w-50[ Potential simulation methods from specific distributions ```r # Symmetric, unimodal, bell-shaped null_dist("var1", "norm") null_dist("var1", "cauchy") null_dist("var1", "t") # Skewed right null_dist("var1", "exp") null_dist("var1", "chisq") null_dist("var1", "gamma") # Constant null_dist("var1", "uniform") ``` ] ] --- # Lineup of Greens first preference percentages .panelset[ .panel[.panel-name[📊] <img src="images/week5A/votes-lineup-1.png" width="100%" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] .f5[ ```r library(nullabor) set.seed(241) ggplot(lineup(null_dist("perc_GRN", "exp"), tdf3, n=10), aes(x=perc_GRN)) + geom_histogram(color = "white", fill = "#00843D", bins = 30) + facet_wrap(~.sample, ncol=5, scales="free") + theme(axis.text = element_blank(), axis.title = element_blank(), panel.grid.major = element_blank()) ``` ] ] ] --- # .orange[Case study] .circle.bg-orange.white[1] 2019 Australian Federal Election .f4[Part 6/8] .panelset[ .panel[.panel-name[📊] .flex[ .w-50[ <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="empty-cells: hide;" colspan="1"></th> <th style="padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="4"><div style="border-bottom: 1px solid #111111; margin-bottom: -1px; ">% of first preference for the Greens</div></th> <th style="empty-cells: hide;" colspan="2"></th> </tr> <tr> <th style="text-align:left;"> State </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> SD </th> <th style="text-align:right;"> IQR </th> <th style="text-align:right;"> Skewness </th> <th style="text-align:right;"> Kurtosis </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> ACT </td> <td style="text-align:right;"> 16.406 </td> <td style="text-align:right;"> 13.988 </td> <td style="text-align:right;"> 5.602 </td> <td style="text-align:right;"> 5.196 </td> <td style="text-align:right;"> 0.645 </td> <td style="text-align:right;"> 1.500 </td> </tr> <tr> <td style="text-align:left;"> VIC </td> <td style="text-align:right;"> 11.400 </td> <td style="text-align:right;"> 8.570 </td> <td style="text-align:right;"> 8.210 </td> <td style="text-align:right;"> 6.717 </td> <td style="text-align:right;"> 2.603 </td> <td style="text-align:right;"> 11.360 </td> </tr> <tr> <td style="text-align:left;"> WA </td> <td style="text-align:right;"> 10.993 </td> <td style="text-align:right;"> 10.756 </td> <td style="text-align:right;"> 3.018 </td> <td style="text-align:right;"> 3.116 </td> <td style="text-align:right;"> 0.802 </td> <td style="text-align:right;"> 3.026 </td> </tr> <tr> <td style="text-align:left;"> QLD </td> <td style="text-align:right;"> 9.764 </td> <td style="text-align:right;"> 8.808 </td> <td style="text-align:right;"> 5.096 </td> <td style="text-align:right;"> 4.753 </td> <td style="text-align:right;"> 1.092 </td> <td style="text-align:right;"> 3.886 </td> </tr> <tr> <td style="text-align:left;"> TAS </td> <td style="text-align:right;"> 9.721 </td> <td style="text-align:right;"> 9.339 </td> <td style="text-align:right;"> 4.009 </td> <td style="text-align:right;"> 0.985 </td> <td style="text-align:right;"> 0.326 </td> <td style="text-align:right;"> 2.493 </td> </tr> <tr> <td style="text-align:left;"> NT </td> <td style="text-align:right;"> 9.572 </td> <td style="text-align:right;"> 9.572 </td> <td style="text-align:right;"> 2.473 </td> <td style="text-align:right;"> 1.748 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 1.000 </td> </tr> <tr> <td style="text-align:left;"> SA </td> <td style="text-align:right;"> 9.120 </td> <td style="text-align:right;"> 8.903 </td> <td style="text-align:right;"> 3.024 </td> <td style="text-align:right;"> 3.412 </td> <td style="text-align:right;"> 0.384 </td> <td style="text-align:right;"> 2.920 </td> </tr> <tr> <td style="text-align:left;"> NSW </td> <td style="text-align:right;"> 8.101 </td> <td style="text-align:right;"> 6.635 </td> <td style="text-align:right;"> 4.087 </td> <td style="text-align:right;"> 3.948 </td> <td style="text-align:right;"> 1.502 </td> <td style="text-align:right;"> 4.859 </td> </tr> <tr> <td style="text-align:left;border-top: 2px solid black;"> National </td> <td style="text-align:right;border-top: 2px solid black;"> 9.874 </td> <td style="text-align:right;border-top: 2px solid black;"> 8.547 </td> <td style="text-align:right;border-top: 2px solid black;"> 5.632 </td> <td style="text-align:right;border-top: 2px solid black;"> 5.001 </td> <td style="text-align:right;border-top: 2px solid black;"> 2.671 </td> <td style="text-align:right;border-top: 2px solid black;"> 15.798 </td> </tr> </tbody> </table> ] .w-50.pl3[ {{content}} ]]] .panel[.panel-name[data] .f5[ ```r tdf3 <- df1 %>% group_by(DivisionID) %>% summarise( DivisionNm = unique(DivisionNm), State = unique(StateAb), votes_GRN = TotalVotes[which(PartyAb == "GRN")], votes_total = sum(TotalVotes) ) %>% mutate(perc_GRN = votes_GRN / votes_total * 100) ``` ]] .panel[.panel-name[R] .f5[ ```r tdf3 %>% group_by(State) %>% summarise( mean = mean(perc_GRN), median = median(perc_GRN), sd = sd(perc_GRN), iqr = IQR(perc_GRN), skewness = moments::skewness(perc_GRN), kurtosis = moments::kurtosis(perc_GRN) ) %>% arrange(desc(mean)) %>% rbind(data.frame( State = "National", mean = mean(tdf3$perc_GRN), median = median(tdf3$perc_GRN), sd = sd(tdf3$perc_GRN), iqr = IQR(tdf3$perc_GRN), skewness = moments::skewness(tdf3$perc_GRN), kurtosis = moments::kurtosis(tdf3$perc_GRN) )) %>% knitr::kable(col.names = c("State", "Mean", "Median", "SD", "IQR", "Skewness", "Kurtosis"), digits = 3) %>% kableExtra::kable_classic() %>% kableExtra::add_header_above(c(" ", "% of first preference for the Greens" = 4, " " = 2)) %>% kableExtra::row_spec(9, extra_css = "border-top: 2px solid black;") ``` ]]] -- * Why are the means and the medians different? * How are the standard deviations and the interquartile ranges similar or different? * Are there some other numerical statistics we should show? --- # Robust measure of central tendency .flex[ .w-40[ * <span style="color:#D81B60">**Mean**</span> is a non-robust measure of location. * <span style="color:#1E88E5">**Median**</span> is the 50% quantile of the observations * <span style="color:#FFC107">**Trimmed mean**</span> is the sample mean after discarding observations at the tails. * <span style="color:#004D40">**Winsorized mean**</span> is the sample mean after replacing observations at the tails with the minimum or maximum of the observations that remain. ] .w-60[ <img src='images/week5A/robust-mean-1.png' class='ba pl2' height ='150px'/> <img src='images/week5A/robust-mean-2.png' class='ba pl2' height ='150px'/> <img src='images/week5A/robust-mean-3.png' class='ba pl2' height ='150px'/> <img src='images/week5A/robust-mean-4.png' class='ba pl2' height ='150px'/> <img src='images/week5A/robust-mean-5.png' class='ba pl2' height ='150px'/> <img src='images/week5A/robust-mean-6.png' class='ba pl2' height ='150px'/> <table class=" lightable-classic" style='font-size: 12px; font-family: "Arial Narrow", "Source Sans Pro", sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Plot </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Trimmed Mean<sup>*</sup> </th> <th style="text-align:right;"> Winsorized Mean<sup>*</sup> </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;color: #D81B60 !important;"> 0.109 </td> <td style="text-align:right;color: #1E88E5 !important;"> 0.114 </td> <td style="text-align:right;color: #FFC107 !important;"> 0.120 </td> <td style="text-align:right;color: #004D40 !important;"> 0.103 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;color: #D81B60 !important;"> 0.054 </td> <td style="text-align:right;color: #1E88E5 !important;"> -0.045 </td> <td style="text-align:right;color: #FFC107 !important;"> -0.016 </td> <td style="text-align:right;color: #004D40 !important;"> -0.029 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;color: #D81B60 !important;"> 1.177 </td> <td style="text-align:right;color: #1E88E5 !important;"> 0.729 </td> <td style="text-align:right;color: #FFC107 !important;"> 0.820 </td> <td style="text-align:right;color: #004D40 !important;"> 0.888 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;color: #D81B60 !important;"> 0.533 </td> <td style="text-align:right;color: #1E88E5 !important;"> 0.541 </td> <td style="text-align:right;color: #FFC107 !important;"> 0.543 </td> <td style="text-align:right;color: #004D40 !important;"> 0.542 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;color: #D81B60 !important;"> 0.468 </td> <td style="text-align:right;color: #1E88E5 !important;"> 0.329 </td> <td style="text-align:right;color: #FFC107 !important;"> 0.355 </td> <td style="text-align:right;color: #004D40 !important;"> 0.390 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;color: #D81B60 !important;"> 5.626 </td> <td style="text-align:right;color: #1E88E5 !important;"> 6.656 </td> <td style="text-align:right;color: #FFC107 !important;"> 5.918 </td> <td style="text-align:right;color: #004D40 !important;"> 5.688 </td> </tr> </tbody> </table> .f5[ <sup>*</sup> Both trimmed and Winsorized mean trimmed 20% of the tails. ] ] ] --- # Robust measure of dispersion .flex[ .w-50[ * <span style="color:#648FFF">**Standard deviation**</span> or its square, **variance**, is a popular choice of measure of dispersion but is not robust to outliers * Standard deviation for sample `\(x_1, ..., x_n\)` is `$$\sqrt{\sum_{i=1}^n \frac{(x_i - \bar{x})^2}{n - 1}}$$` * <span style="color:#785EF0">**Interquartile range**</span> difference between 1st and 3rd quartile, more robust measure of spread * <span style="color:#FE6100">**Median absolute deviance**</span> (MAD) is even more robust `$$\text{median}(|x_i - \text{median}(x_i)|)$$` ] .w-50.pl3[ <img src='images/week5A/robust-mean-1.png' class='ba pl2' height ='150px'/> <img src='images/week5A/robust-mean-2.png' class='ba pl2' height ='150px'/> <img src='images/week5A/robust-mean-3.png' class='ba pl2' height ='150px'/> <img src='images/week5A/robust-mean-4.png' class='ba pl2' height ='150px'/> <img src='images/week5A/robust-mean-5.png' class='ba pl2' height ='150px'/> <img src='images/week5A/robust-mean-6.png' class='ba pl2' height ='150px'/> <table class=" lightable-classic" style='font-size: 12px; font-family: "Arial Narrow", "Source Sans Pro", sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="empty-cells: hide;" colspan="1"></th> <th style="padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="3"><div style="border-bottom: 1px solid #111111; margin-bottom: -1px; ">Measure of dispersion</div></th> <th style="empty-cells: hide;" colspan="2"></th> </tr> <tr> <th style="text-align:right;"> Plot </th> <th style="text-align:right;"> SD </th> <th style="text-align:right;"> IQR </th> <th style="text-align:right;"> MAD </th> <th style="text-align:right;"> Skewness </th> <th style="text-align:right;"> Kurtosis </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;color: #648FFF !important;"> 0.898 </td> <td style="text-align:right;color: #785EF0 !important;"> 1.186 </td> <td style="text-align:right;color: #FE6100 !important;"> 0.870 </td> <td style="text-align:right;"> -0.072 </td> <td style="text-align:right;"> 3.008 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;color: #648FFF !important;"> 0.986 </td> <td style="text-align:right;color: #785EF0 !important;"> 1.411 </td> <td style="text-align:right;color: #FE6100 !important;"> 1.077 </td> <td style="text-align:right;"> 0.358 </td> <td style="text-align:right;"> 2.212 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;color: #648FFF !important;"> 1.326 </td> <td style="text-align:right;color: #785EF0 !important;"> 1.176 </td> <td style="text-align:right;color: #FE6100 !important;"> 0.793 </td> <td style="text-align:right;"> 1.944 </td> <td style="text-align:right;"> 7.184 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;color: #648FFF !important;"> 0.288 </td> <td style="text-align:right;color: #785EF0 !important;"> 0.450 </td> <td style="text-align:right;color: #FE6100 !important;"> 0.335 </td> <td style="text-align:right;"> -0.126 </td> <td style="text-align:right;"> 1.837 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;color: #648FFF !important;"> 0.468 </td> <td style="text-align:right;color: #785EF0 !important;"> 0.499 </td> <td style="text-align:right;color: #FE6100 !important;"> 0.343 </td> <td style="text-align:right;"> 1.691 </td> <td style="text-align:right;"> 6.372 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;color: #648FFF !important;"> 2.784 </td> <td style="text-align:right;color: #785EF0 !important;"> 5.362 </td> <td style="text-align:right;color: #FE6100 !important;"> 2.984 </td> <td style="text-align:right;"> -0.351 </td> <td style="text-align:right;"> 1.678 </td> </tr> </tbody> </table> ]] --- # .orange[Case study] .circle.bg-orange.white[1] 2019 Australian Federal Election .f4[Part 7/8] .panelset[ .panel[.panel-name[📊] .flex[ .w-50[ <img src="images/week5A/aus-election-plot2-1.png" width="432" style="display: block; margin: auto;" /> ] .w-50[ **We should plot the data!** * The width of the boxplot is proportional to the number of electoral districts in the corresponding state (which is roughly proportional to the population) ] ]] .panel[.panel-name[data] .f5[ ```r tdf3 <- df1 %>% group_by(DivisionID) %>% summarise( DivisionNm = unique(DivisionNm), State = unique(StateAb), votes_GRN = TotalVotes[which(PartyAb == "GRN")], votes_total = sum(TotalVotes) ) %>% mutate(perc_GRN = votes_GRN / votes_total * 100) ``` ]] .panel[.panel-name[R] .f5[ ```r tdf3 %>% mutate(State = fct_reorder(State, perc_GRN)) %>% ggplot(aes(perc_GRN, State)) + geom_boxplot(varwidth = TRUE) + labs( x = "Percentage of first preference votes per division", y = "State", title = "First preference votes for the Greens party" ) ``` ]]] --- # Outliers .info-box.w-60[ **Outliers** are *observations* that are significantly different from the majority. ] <br> .flex[ .w-50[ * Outliers can _**occur by chance in almost all distributions**_, but could be indicative of: * a measurement error, * a different population, or * an issue with the sampling process. ] .w-50[ <img src="images/week5A/aus-election-plot2-1.png" width="432" style="display: block; margin: auto;" /> ] ] --- # Closer look at the _boxplot_ <img src="images/week5A/annotated-boxplot-1.png" width="432" style="display: block; margin: auto;" /> * Observations that are outside the range of lower to upper fence (1.5 times the box length) are referred at times as .monash-blue[outliers] * Plotting boxplots for data from a skewed distribution will almost always show these "outliers" but these are not necessary outliers * Some definitions of outliers assume a symmetrical population distribution (e.g. in boxplots or observations a certain standard deviations away from the mean) and these definitions are ill-suited for asymmetrical distributions -- .center[ **But are there some things we .red[*cannot*] see from boxplots?** ] --- # .orange[Case study] .circle.bg-orange.white[1] 2019 Australian Federal Election .f4[Part 8/8] .panelset[ .panel[.panel-name[📊] .flex[ .w-50[ <img src="images/week5A/aus-election-2019-plot3-1.png" width="432" style="display: block; margin: auto;" /> ] .w-50[ {{content}} ] ]] .panel[.panel-name[data] .f5[ ```r tdf3 <- df1 %>% group_by(DivisionID) %>% summarise( DivisionNm = unique(DivisionNm), State = unique(StateAb), votes_GRN = TotalVotes[which(PartyAb == "GRN")], votes_total = sum(TotalVotes) ) %>% mutate(perc_GRN = votes_GRN / votes_total * 100) ``` ]] .panel[.panel-name[R] .f5[ ```r tdf3 %>% mutate(State = fct_reorder(State, perc_GRN)) %>% ggplot(aes(perc_GRN, State)) + ggbeeswarm::geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) + labs( x = "Percentage of first preference votes per division", y = "State", title = "First preference votes for the Greens party" ) ``` ]]] -- **Now what do you notice from this graph that you didn't notice before?** {{content}} -- * There are only two electoral districts in NT! * And only 3 and 5 electoral districts in ACT and TAS, respectively! {{content}} -- * We have _not_ computed the number of electoral districts for each state so far! {{content}} -- <div class="info-box"> <i class="fas fa-book-reader"></i> Both numerical and graphical summaries can either <i>reveal</i> and/or <i>hide</i> aspects of the data </div> --- class: transition # Transformations --- # .orange[Case study] .bg-orange.circle.white[2] Melbourne Housing Prices .f4[Part 1/5] .flex[ .w-50[ <table class=" lightable-classic" style='font-size: 12px; font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Suburb </th> <th style="text-align:right;"> Rooms </th> <th style="text-align:left;"> Type </th> <th style="text-align:right;"> Price ($) </th> <th style="text-align:left;"> Date </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Abbotsford </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 1,490,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Abbotsford </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 1,220,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Abbotsford </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 1,420,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Aberfeldie </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 1,515,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Airport West </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 670,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Airport West </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Townhouse </td> <td style="text-align:right;"> 530,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Airport West </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Unit </td> <td style="text-align:right;"> 540,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Airport West </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 715,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Albanvale </td> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Albert Park </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 1,925,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Albion </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Unit </td> <td style="text-align:right;"> 515,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Albion </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 717,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Alphington </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 1,675,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Alphington </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 2,008,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Altona </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 860,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Altona Meadows </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Altona North </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 720,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Armadale </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Unit </td> <td style="text-align:right;"> 836,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Armadale </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 2,110,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> <tr> <td style="text-align:left;"> Armadale </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 1,386,000 </td> <td style="text-align:left;"> 2017-04-01 </td> </tr> </tbody> </table> ] .w-50.pl3[ * This data was scrapped each week from domain.com.au from 2016-01-28 to 2018-10-13 * In total there are **63,023** observations * All variables shown .f5[(there are more variables not shown here)], except price, have complete records * The are **48,433** property prices across Melbourne (roughly 23% missing) {{content}} ]] .footnote.f5[ Data source: Tony Pio (2018) Melbourne Housing Market, Version 27. Retrieved August 2021 from https://www.kaggle.com/anthonypino/melbourne-housing-market. ] -- **How would you explore this data first?**
−
+
01
:
00
--- # .orange[Case study] .bg-orange.circle.white[2] Melbourne Housing Prices .f4[Part 2/5] .panelset[ .panel[.panel-name[📊] .flex[ .w-50[ Observations arranged by Suburb and Date: <img src="images/week5A/melb-house-plot-miss-1.png" width="432" style="display: block; margin: auto;" /> ] .w-50[ Comparing distribution of room number for observations with missing and non-missing price records: <img src="images/week5A/melb-house-plot-room-miss-1.png" width="576" style="display: block; margin: auto;" /> {{content}} ]]] .panel[.panel-name[data] .scroll-sign[ .f5.s400[ ```r df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"), col_types = cols( .default = col_character(), Rooms = col_double(), Price = col_double(), Date = col_date(format = "%d/%m/%Y"), Propertycount = col_double(), Distance = col_double() ) ) ``` ```r skimr::skim(df2) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name df2 ## Number of rows 63023 ## Number of columns 13 ## _______________________ ## Column type frequency: ## character 8 ## Date 1 ## numeric 4 ## ________________________ ## Group variables None ## ## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 Suburb 0 1 3 18 0 380 0 ## 2 Address 0 1 7 27 0 57754 0 ## 3 Type 0 1 1 1 0 3 0 ## 4 Method 0 1 1 2 0 9 0 ## 5 SellerG 0 1 1 27 0 476 0 ## 6 Postcode 0 1 4 4 0 225 0 ## 7 Regionname 0 1 16 26 0 8 0 ## 8 CouncilArea 0 1 17 30 0 34 0 ## ## ── Variable type: Date ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max median n_unique ## 1 Date 0 1 2016-01-28 2018-10-13 2017-09-03 112 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 Rooms 0 1 3.11 0.958 1 3 3 4 31 ▇▁▁▁▁ ## 2 Price 14590 0.768 997898. 593499. 85000 620000 830000 1220000 11200000 ▇▁▁▁▁ ## 3 Propertycount 0 1 7618. 4424. 39 4380 6795 10412 21650 ▅▇▅▂▁ ## 4 Distance 0 1 12.7 7.59 0 7 11.4 16.7 64.1 ▇▆▁▁▁ ``` ]]] .panel[.panel-name[R] .f5[ ```r df2 %>% select(Suburb, Rooms, Type, Price, Date) %>% arrange(Suburb, Date) %>% visdat::vis_miss() df2 %>% mutate(miss = ifelse(is.na(Price), "Missing", "Recorded")) %>% count(Rooms, miss) %>% group_by(miss) %>% mutate(perc = n / sum(n) * 100) %>% ggplot(aes(as.factor(Rooms), perc, fill = miss)) + geom_col(position = "dodge") + scale_fill_viridis_d(begin=0.3, end=0.7) + labs(x = "Rooms", y = "Percentage", fill = "Price") ``` ]] .panel[.panel-name[lineup] <img src="images/week5A/melb-house-lineup-1.png" width="80%" style="display: block; margin: auto;" /> ] .panel[.panel-name[R] .f5[ ```r library(nullabor) df2_d <- df2 %>% mutate(miss = ifelse(is.na(Price), "Missing", "Recorded")) %>% select(Rooms, miss) df2_l <- lineup(null_permute("miss"), df2_d, n=10, pos=7) df2_l_agg <- df2_l %>% group_by(.sample) %>% count(Rooms, miss) %>% ungroup() %>% group_by(miss) %>% mutate(perc = n / sum(n) * 100) %>% mutate(Rooms = as.factor(Rooms)) ggplot(df2_l_agg, aes(x=Rooms, y=perc, fill = miss)) + geom_col(position = "dodge") + scale_fill_viridis_d(begin=0.3, end=0.7) + facet_wrap(~.sample, ncol=5) + theme(legend.position = "none", axis.text = element_blank(), axis.title = element_blank(), panel.grid.major.x = element_blank()) ``` ]]] -- * Seems to be okay nothing very notable - but check with a <text style="color: #D93F00;"> lineup </text> * What next? --- # .orange[Case study] .bg-orange.circle.white[2] Melbourne Housing Prices .f4[Part 3/5] .panelset[ .panel[.panel-name[📊] .flex[ .w-50[ <img src="images/week5A/melb-house-price-plot1-1.png" width="432" style="display: block; margin: auto;" /> ] .w-50[ **What can we say from this plot?** {{content}} ]]] .panel[.panel-name[data] .f5[ ```r df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"), col_types = cols( .default = col_character(), Rooms = col_double(), Price = col_double(), Date = col_date(format = "%d/%m/%Y"), Propertycount = col_double(), Distance = col_double() ) ) ``` ]] .panel[.panel-name[R] .f5[ ```r df2 %>% ggplot(aes(Price / 1e6)) + geom_histogram(color = "white") + labs( x = "Price ($1,000,000)", y = "Count" ) ``` ]]] -- * The housing prices are right-skewed {{content}} -- * There appears to be a lot of outlying housing prices (how can we tell?) --- # .orange[Case study] .bg-orange.circle.white[2] Melbourne Housing Prices .f4[Part 4/5] .panelset[ .panel[.panel-name[📊] .flex[ .w-50[ <img src="images/week5A/melb-house-price-plot2-1.png" width="432" style="display: block; margin: auto;" /> ] .w-50[ {{content}} ]]] .panel[.panel-name[data] .f5[ ```r df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"), col_types = cols( .default = col_character(), Rooms = col_double(), Price = col_double(), Date = col_date(format = "%d/%m/%Y"), Propertycount = col_double(), Distance = col_double() ) ) ``` ]] .panel[.panel-name[R] .f5[ ```r df2 %>% ggplot(aes(Price / 1e6)) + geom_histogram(color = "white") + labs( x = "Price ($1,000,000)", y = "Count" ) + scale_x_log10() ``` ]]] -- * The x-axis has been `\(\log_{10}\)`-transformed in this plot {{content}} -- * The plot appears more symmetrical now {{content}} -- * What is a measure of central tendancy here? <span class='f4'>With no transformation:</span> <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Trimmed Mean </th> <th style="text-align:right;"> Winsorised Mean </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> $997,898 </td> <td style="text-align:right;"> $830,000 </td> <td style="text-align:right;"> $871,375 </td> <td style="text-align:right;"> $903,823 </td> </tr> </tbody> </table> <span class='f4'>With log transformation (and back transformed to original scale):</span> <table class=" lightable-classic" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Trimmed Mean </th> <th style="text-align:right;"> Winsorised Mean </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> $874,166 </td> <td style="text-align:right;"> $830,000 </td> <td style="text-align:right;"> $847,973 </td> <td style="text-align:right;"> $859,325 </td> </tr> </tbody> </table> --- class: transition # Multi-modality --- # .orange[Case study] .bg-orange.circle.white[2] Melbourne Housing Prices .f4[Part 5/5] .panelset[ .panel[.panel-name[📊] .flex[ .w-50[ <img src="images/week5A/melb-house-by-room-1.png" width="432" style="display: block; margin: auto;" /> ] .w-50[ {{content}} ]]] .panel[.panel-name[data] .f5.s500[ ```r df2 <- read_csv(here::here("data/MELBOURNE_HOUSE_PRICES_LESS.csv"), col_types = cols( .default = col_character(), Rooms = col_double(), Price = col_double(), Date = col_date(format = "%d/%m/%Y"), Propertycount = col_double(), Distance = col_double() ) ) ``` ]] .panel[.panel-name[R] .f5[ ```r df2 %>% ggplot(aes(x=as.factor(Rooms), y=Price / 1e6, )) + ggbeeswarm::geom_quasirandom(varwidth=TRUE, alpha=0.3) + scale_y_log10() + labs(y = "Price ($1,000,000)", x = "# of Rooms") ``` ]]] -- * You can see that drawing separate univariate plots for each room number show that higher number of rooms generally are pricier * You could not see this, however, when the data are combined <img src="images/week5A/melb-house-price-plot2-1.png" width="432" style="display: block; margin: auto;" /> --- class: transition # Bins and Bandwidths: More details --- # .orange[Case study] .circle.bg-orange.white[3] Boston housing data .f4[Part 1/4] .panelset[ .panel[.panel-name[📊] .grid[ .item[ <img src="images/week5A/boston-plot1-1.png" width="460.8" style="display: block; margin: auto;" /> ] .item[ {{content}} ] ] ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r data(bostonc, package = "DAAG") df3 <- read_tsv(I(bostonc[10:length(bostonc)])) skimr::skim(df3) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name df3 ## Number of rows 506 ## Number of columns 21 ## _______________________ ## Column type frequency: ## character 2 ## numeric 19 ## ________________________ ## Group variables None ## ## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 TOWN 0 1 4 23 0 92 0 ## 2 TRACT 0 1 4 4 0 506 0 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 OBS. 0 1 254. 146. 1 127. 254. 380. 506 ▇▇▇▇▇ ## 2 TOWN# 0 1 47.5 27.6 0 26.2 42 78 91 ▅▆▅▃▇ ## 3 LON 0 1 -71.1 0.0754 -71.3 -71.1 -71.1 -71.0 -70.8 ▁▂▇▂▁ ## 4 LAT 0 1 42.2 0.0618 42.0 42.2 42.2 42.3 42.4 ▁▃▇▃▁ ## 5 MEDV 0 1 22.5 9.20 5 17.0 21.2 25 50 ▂▇▅▁▁ ## 6 CMEDV 0 1 22.5 9.18 5 17.0 21.2 25 50 ▂▇▅▁▁ ## 7 CRIM 0 1 3.61 8.60 0.00632 0.0820 0.257 3.68 89.0 ▇▁▁▁▁ ## 8 ZN 0 1 11.4 23.3 0 0 0 12.5 100 ▇▁▁▁▁ ## 9 INDUS 0 1 11.1 6.86 0.46 5.19 9.69 18.1 27.7 ▇▆▁▇▁ ## 10 CHAS 0 1 0.0692 0.254 0 0 0 0 1 ▇▁▁▁▁ ## 11 NOX 0 1 0.555 0.116 0.385 0.449 0.538 0.624 0.871 ▇▇▆▅▁ ## 12 RM 0 1 6.28 0.703 3.56 5.89 6.21 6.62 8.78 ▁▂▇▂▁ ## 13 AGE 0 1 68.6 28.1 2.9 45.0 77.5 94.1 100 ▂▂▂▃▇ ## 14 DIS 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.1 ▇▅▂▁▁ ## 15 RAD 0 1 9.55 8.71 1 4 5 24 24 ▇▂▁▁▃ ## 16 TAX 0 1 408. 169. 187 279 330 666 711 ▇▇▃▁▇ ## 17 PTRATIO 0 1 18.5 2.16 12.6 17.4 19.0 20.2 22 ▁▃▅▅▇ ## 18 B 0 1 357. 91.3 0.32 375. 391. 396. 397. ▁▁▁▁▇ ## 19 LSTAT 0 1 12.7 7.14 1.73 6.95 11.4 17.0 38.0 ▇▇▅▂▁ ``` ]] .panel[.panel-name[R] .f5[ ```r ggplot(df3, aes(MEDV)) + geom_histogram(binwidth = 1, color = "black", fill = "#008A25") + labs(x = "Median housing value (US$1000)", y = "Frequency") ``` ]] ] .footnote.f6[ Harrison, David, and Daniel L. Rubinfeld (1978) Hedonic Housing Prices and the Demand for Clean Air, *Journal of Environmental Economics and Management* **5** 81-102. Original data.<br> Gilley, O.W. and R. Kelley Pace (1996) On the Harrison and Rubinfeld Data. *Journal of Environmental Economics and Management* **31** 403-405. Provided corrections and examined censoring.<br> Maindonald, John H. and Braun, W. John (2020). DAAG: Data Analysis and Graphics Data and Functions. R package version 1.24 ] -- * There is a large frequency in the final bin. * There is a decline in observations in the $40-49K range as well as dip in observations around $26K and $34K. -- * The histogram is using a bin width of 1 unit and is **left-open** (or **right-closed**): (4.5, 5.5], (5.5, 6.5] ... (49.5, 50.5], so that 5.5 is in the smaller bin, where as right-open would place it in the larger bin. * Occasionally, whether it is **left-** or **right-open** can make a difference. Or, you might also **set the breaks** controlling the min value where binning starts. --- # .orange[Case study] .circle.bg-orange.white[3] Boston housing data .f4[Part 2/4] .panelset[ .panel[.panel-name[📊] .grid[ .item[ <img src="images/week5A/boston-plot2-1.png" width="432" style="display: block; margin: auto;" /> ] .item[ * Density plots depend on the **bandwidth** (binwidth) chosen and more than often do not estimate well at boundary cases * There are various way to present features of the data using a plot and what works for one person, may not be as straightforward for another * Be prepared to do multiple plots! ] ] ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r data(bostonc, package = "DAAG") df3 <- read_tsv(I(bostonc[10:length(bostonc)])) skimr::skim(df3) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name df3 ## Number of rows 506 ## Number of columns 21 ## _______________________ ## Column type frequency: ## character 2 ## numeric 19 ## ________________________ ## Group variables None ## ## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 TOWN 0 1 4 23 0 92 0 ## 2 TRACT 0 1 4 4 0 506 0 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 OBS. 0 1 254. 146. 1 127. 254. 380. 506 ▇▇▇▇▇ ## 2 TOWN# 0 1 47.5 27.6 0 26.2 42 78 91 ▅▆▅▃▇ ## 3 LON 0 1 -71.1 0.0754 -71.3 -71.1 -71.1 -71.0 -70.8 ▁▂▇▂▁ ## 4 LAT 0 1 42.2 0.0618 42.0 42.2 42.2 42.3 42.4 ▁▃▇▃▁ ## 5 MEDV 0 1 22.5 9.20 5 17.0 21.2 25 50 ▂▇▅▁▁ ## 6 CMEDV 0 1 22.5 9.18 5 17.0 21.2 25 50 ▂▇▅▁▁ ## 7 CRIM 0 1 3.61 8.60 0.00632 0.0820 0.257 3.68 89.0 ▇▁▁▁▁ ## 8 ZN 0 1 11.4 23.3 0 0 0 12.5 100 ▇▁▁▁▁ ## 9 INDUS 0 1 11.1 6.86 0.46 5.19 9.69 18.1 27.7 ▇▆▁▇▁ ## 10 CHAS 0 1 0.0692 0.254 0 0 0 0 1 ▇▁▁▁▁ ## 11 NOX 0 1 0.555 0.116 0.385 0.449 0.538 0.624 0.871 ▇▇▆▅▁ ## 12 RM 0 1 6.28 0.703 3.56 5.89 6.21 6.62 8.78 ▁▂▇▂▁ ## 13 AGE 0 1 68.6 28.1 2.9 45.0 77.5 94.1 100 ▂▂▂▃▇ ## 14 DIS 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.1 ▇▅▂▁▁ ## 15 RAD 0 1 9.55 8.71 1 4 5 24 24 ▇▂▁▁▃ ## 16 TAX 0 1 408. 169. 187 279 330 666 711 ▇▇▃▁▇ ## 17 PTRATIO 0 1 18.5 2.16 12.6 17.4 19.0 20.2 22 ▁▃▅▅▇ ## 18 B 0 1 357. 91.3 0.32 375. 391. 396. 397. ▁▁▁▁▇ ## 19 LSTAT 0 1 12.7 7.14 1.73 6.95 11.4 17.0 38.0 ▇▇▅▂▁ ``` ]] .panel[.panel-name[R] .f4[ ```r library(patchwork) library(lvplot) library(ggbeeswarm) h1 <- ggplot(df3, aes(MEDV, y = "")) + geom_boxplot(fill = "#008A25") + labs(x = "Median housing value (US$1000)", y = "") + theme(axis.line.y = element_blank()) h2 <- ggplot(df3, aes(y=MEDV, x = 1)) + geom_lv(aes(fill = after_stat(LV))) + scale_fill_brewer() + xlim(c(0.4, 1.6)) + labs(y = "Median housing value (US$1000)", x = "") + theme(axis.line.x = element_blank(), axis.text.y = element_blank()) + coord_flip() h3 <- ggplot(df3, aes(y=MEDV, x = "")) + geom_quasirandom() + labs(y = "Median housing value (US$1000)", x = "") + theme(axis.line.x = element_blank()) + coord_flip() h4 <- ggplot(df3, aes(MEDV)) + geom_density() + geom_rug() + labs(x = "Median housing value (US$1000)", y = "") + theme(axis.line.y = element_blank()) (h1 + h3)/(h2 + h4) ``` ] ] ] --- # .orange[Case study] .circle.bg-orange.white[3] Boston housing data .f4[Part 3/4] .panelset[ .panel[.panel-name[📊] .grid[ .item[ <img src="images/week5A/boston-plot5-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week5A/boston-plot6-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week5A/boston-plot7-1.png" width="432" style="display: block; margin: auto;" /> ] .item[ <img src="images/week5A/boston-plot8-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week5A/boston-plot9-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week5A/boston-plot10-1.png" width="432" style="display: block; margin: auto;" /> ] ] ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r data(bostonc, package = "DAAG") df3 <- read_tsv(I(bostonc[10:length(bostonc)])) skimr::skim(df3) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name df3 ## Number of rows 506 ## Number of columns 21 ## _______________________ ## Column type frequency: ## character 2 ## numeric 19 ## ________________________ ## Group variables None ## ## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 TOWN 0 1 4 23 0 92 0 ## 2 TRACT 0 1 4 4 0 506 0 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 OBS. 0 1 254. 146. 1 127. 254. 380. 506 ▇▇▇▇▇ ## 2 TOWN# 0 1 47.5 27.6 0 26.2 42 78 91 ▅▆▅▃▇ ## 3 LON 0 1 -71.1 0.0754 -71.3 -71.1 -71.1 -71.0 -70.8 ▁▂▇▂▁ ## 4 LAT 0 1 42.2 0.0618 42.0 42.2 42.2 42.3 42.4 ▁▃▇▃▁ ## 5 MEDV 0 1 22.5 9.20 5 17.0 21.2 25 50 ▂▇▅▁▁ ## 6 CMEDV 0 1 22.5 9.18 5 17.0 21.2 25 50 ▂▇▅▁▁ ## 7 CRIM 0 1 3.61 8.60 0.00632 0.0820 0.257 3.68 89.0 ▇▁▁▁▁ ## 8 ZN 0 1 11.4 23.3 0 0 0 12.5 100 ▇▁▁▁▁ ## 9 INDUS 0 1 11.1 6.86 0.46 5.19 9.69 18.1 27.7 ▇▆▁▇▁ ## 10 CHAS 0 1 0.0692 0.254 0 0 0 0 1 ▇▁▁▁▁ ## 11 NOX 0 1 0.555 0.116 0.385 0.449 0.538 0.624 0.871 ▇▇▆▅▁ ## 12 RM 0 1 6.28 0.703 3.56 5.89 6.21 6.62 8.78 ▁▂▇▂▁ ## 13 AGE 0 1 68.6 28.1 2.9 45.0 77.5 94.1 100 ▂▂▂▃▇ ## 14 DIS 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.1 ▇▅▂▁▁ ## 15 RAD 0 1 9.55 8.71 1 4 5 24 24 ▇▂▁▁▃ ## 16 TAX 0 1 408. 169. 187 279 330 666 711 ▇▇▃▁▇ ## 17 PTRATIO 0 1 18.5 2.16 12.6 17.4 19.0 20.2 22 ▁▃▅▅▇ ## 18 B 0 1 357. 91.3 0.32 375. 391. 396. 397. ▁▁▁▁▇ ## 19 LSTAT 0 1 12.7 7.14 1.73 6.95 11.4 17.0 38.0 ▇▇▅▂▁ ``` ]] .panel[.panel-name[R] .f4.scroll-sign[.s500[ ```r ggplot(df3, aes(PTRATIO)) + geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.2) + labs( x = "Pupil-teacher ratio by town", y = "", title = "Bin width = 0.2, Left-open" ) ggplot(df3, aes(PTRATIO)) + geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.5) + labs( x = "Pupil-teacher ratio by town", y = "", title = "Bin width = 0.5, Left-open" ) ggplot(df3, aes(PTRATIO)) + geom_histogram(fill = "#9651A0", color = "black", bins = 30) + labs( x = "Pupil-teacher ratio by town", y = "", title = "Bin number = 30, Left-open" ) ggplot(df3, aes(PTRATIO)) + geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.2, closed = "left") + labs( x = "Pupil-teacher ratio by town", y = "", title = "Bin width = 0.2, Right-open" ) ggplot(df3, aes(PTRATIO)) + geom_histogram(fill = "#9651A0", color = "black", binwidth = 0.5, closed = "left") + labs( x = "Pupil-teacher ratio by town", y = "", title = "Bin width = 0.5, Right-open" ) ggplot(df3, aes(PTRATIO)) + geom_histogram( fill = "#9651A0", color = "black", bins = 30, closed = "left" ) + labs( x = "Pupil-teacher ratio by town", y = "", title = "Bin number = 30, Right-open" ) ``` ]]] ] --- # .orange[Case study] .circle.bg-orange.white[3] Boston housing data .f4[Part 4/4] .panelset[ .panel[.panel-name[📊] .grid[ .item[ <img src="images/week5A/boston-plotx-1.png" width="576" style="display: block; margin: auto;" /> ] .item.f4[ * CRIM: per capita crime rate by town * INDUS: proportion of non-retail business acres per town * NOX: nitrogen oxides concentration (parts per 10 million) * RM: average number of room per dwelling * AGE: proportion of owner-occupied units built prior to 1940 * DIS: weighted mean of distances to 5 Boston employment centres * RAD: index of accessibility to radial highways * TAX: full-value property tax rate per $10K * PTRATIO: pupil-teacher ratio by town * LSTAT: lower status of the population (%) * MEDV: median value of owner-occupied homes in $1000s ] ] ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r df3long <- df3 %>% pivot_longer(MEDV:LSTAT, names_to = "var", values_to = "value" ) %>% filter(!var %in% c("CHAS", "B", "ZN")) skimr::skim(df3long) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name df3long ## Number of rows 6072 ## Number of columns 8 ## _______________________ ## Column type frequency: ## character 3 ## numeric 5 ## ________________________ ## Group variables None ## ## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 TOWN 0 1 4 23 0 92 0 ## 2 TRACT 0 1 4 4 0 506 0 ## 3 var 0 1 2 7 0 12 0 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 OBS. 0 1 254. 146. 1 127 254. 380 506 ▇▇▇▇▇ ## 2 TOWN# 0 1 47.5 27.5 0 26 42 78 91 ▅▆▅▃▇ ## 3 LON 0 1 -71.1 0.0753 -71.3 -71.1 -71.1 -71.0 -70.8 ▁▂▇▂▁ ## 4 LAT 0 1 42.2 0.0617 42.0 42.2 42.2 42.3 42.4 ▁▃▇▃▁ ## 5 value 0 1 49.0 120. 0.00632 4 12.3 23.4 711 ▇▁▁▁▁ ``` ]] .panel[.panel-name[R] .f4[ ```r ggplot(df3long, aes(value)) + geom_histogram(color = "white") + facet_wrap(~var, scale = "free") + labs(x = "", y = "") + theme(axis.text = element_text(size = 12)) ``` ] ] ] --- # .orange[Case study] .circle.bg-orange.white[4] Hidalgo stamps thickness .panelset[ .panel[.panel-name[📊] <img src="images/week5A/hidalgo-plot-1.png" width="576" style="display: block; margin: auto;" /> * A stamp collector, Walton von Winkle, bought several collections of Mexican stamps from 1872-1874 and measured the thickness of all of them. * The different **bandwidth** for the density plot suggest either that there are two or seven modes. ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r load(here::here("data/Hidalgo1872.rda")) skimr::skim(Hidalgo1872) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name Hidalgo1872 ## Number of rows 485 ## Number of columns 3 ## _______________________ ## Column type frequency: ## numeric 3 ## ________________________ ## Group variables None ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 thickness 0 1 0.0860 0.0150 0.06 0.075 0.08 0.098 0.131 ▅▇▃▂▁ ## 2 thicknessA 195 0.598 0.0922 0.0162 0.068 0.0772 0.092 0.105 0.131 ▇▃▆▃▂ ## 3 thicknessB 289 0.404 0.0768 0.00508 0.06 0.072 0.078 0.08 0.097 ▁▃▇▁▁ ``` ]] .panel[.panel-name[R] .f4[ ```r ggplot(Hidalgo1872, aes(thickness)) + geom_histogram(binwidth = 0.001, aes(y = stat(density)), fill="grey80") + labs(x = "Thickness (0.001 mm)", y = "Density") + geom_density(color = "#E16A86", size = 2) + geom_density(color = "#00AD9A", size = 2, bw = "SJ") ``` ] ] ] --- class: transition # Focus --- # .orange[Case study] .circle.bg-orange.white[5] Movie length .panelset[ .panel[.panel-name[📊] .grid[ .item[ <img src="images/week5A/movies-plot1-1.png" width="432" style="display: block; margin: auto;" /> <img src="images/week5A/movies-plot2-1.png" width="381.6" style="display: block; margin: auto;" /> ] .item[ {{content}} ] ] ] .panel[.panel-name[data] .h300.f4.scroll-sign[ ```r data(movies, package = "ggplot2movies") skimr::skim(movies) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name movies ## Number of rows 58788 ## Number of columns 24 ## _______________________ ## Column type frequency: ## character 2 ## numeric 22 ## ________________________ ## Group variables None ## ## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 title 0 1 1 121 0 56007 0 ## 2 mpaa 0 1 0 5 53864 5 0 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 year 0 1 1976. 23.7 1893 1958 1983 1997 2005 ▁▁▃▃▇ ## 2 length 0 1 82.3 44.3 1 74 90 100 5220 ▇▁▁▁▁ ## 3 budget 53573 0.0887 13412513. 23350085. 0 250000 3000000 15000000 200000000 ▇▁▁▁▁ ## 4 rating 0 1 5.93 1.55 1 5 6.1 7 10 ▁▃▇▆▁ ## 5 votes 0 1 632. 3830. 5 11 30 112 157608 ▇▁▁▁▁ ## 6 r1 0 1 7.01 10.9 0 0 4.5 4.5 100 ▇▁▁▁▁ ## 7 r2 0 1 4.02 5.96 0 0 4.5 4.5 84.5 ▇▁▁▁▁ ## 8 r3 0 1 4.72 6.45 0 0 4.5 4.5 84.5 ▇▁▁▁▁ ## 9 r4 0 1 6.37 7.59 0 0 4.5 4.5 100 ▇▁▁▁▁ ## 10 r5 0 1 9.80 9.73 0 4.5 4.5 14.5 100 ▇▁▁▁▁ ## 11 r6 0 1 13.0 11.0 0 4.5 14.5 14.5 84.5 ▇▂▁▁▁ ## 12 r7 0 1 15.5 11.6 0 4.5 14.5 24.5 100 ▇▃▁▁▁ ## 13 r8 0 1 13.9 11.3 0 4.5 14.5 24.5 100 ▇▃▁▁▁ ## 14 r9 0 1 8.95 9.44 0 4.5 4.5 14.5 100 ▇▁▁▁▁ ## 15 r10 0 1 16.9 15.7 0 4.5 14.5 24.5 100 ▇▃▁▁▁ ## 16 Action 0 1 0.0797 0.271 0 0 0 0 1 ▇▁▁▁▁ ## 17 Animation 0 1 0.0628 0.243 0 0 0 0 1 ▇▁▁▁▁ ## 18 Comedy 0 1 0.294 0.455 0 0 0 1 1 ▇▁▁▁▃ ## 19 Drama 0 1 0.371 0.483 0 0 0 1 1 ▇▁▁▁▅ ## 20 Documentary 0 1 0.0591 0.236 0 0 0 0 1 ▇▁▁▁▁ ## 21 Romance 0 1 0.0807 0.272 0 0 0 0 1 ▇▁▁▁▁ ## 22 Short 0 1 0.161 0.367 0 0 0 0 1 ▇▁▁▁▂ ``` ]] .panel[.panel-name[R] .f4[ ```r ggplot(movies, aes(length)) + geom_histogram(color = "white") + labs(x = "Length of movie (minutes)", y = "Frequency") ggplot(movies, aes(length)) + geom_histogram(color = "white") + labs(x = "Length of movie (minutes)", y = "Frequency") + scale_x_log10() movies %>% filter(length < 180) %>% ggplot(aes(length)) + geom_histogram(binwidth = 1, fill = "#795549", color = "black") + labs(x = "Length of movie (minutes)", y = "Frequency") ``` ] ] ] -- * Upon further exploration, you can find the two movies that are well over 16 hours long are "<i>Cure for Insomnia</i>", "<i>Four Stars</i>", and "<i>Longest Most Meaningless Movie in the World</i>" {{content}} -- * We can restrict our attention to films under 3 hours: <img src="images/week5A/movies-plot3-1.png" width="648" style="display: block; margin: auto;" /> {{content}} -- * Notice that there is a peak at particular times. Why do you think so? --- # Take away messages .flex[ .w-70.f2[ <ul class="fa-ul"> {{content}} </ul> ] ] -- <li><span class="fa-li"><i class="fas fa-paper-plane"></i></span>Numerical and graphical summaries can reveal, but also hide, aspects of data</li> {{content}} -- <li><span class="fa-li"><i class="fas fa-paper-plane"></i></span><b>Do many numerical and graphical summaries of the data!</b></li> --- # Resources and Acknowledgement - Slides originally written by Emi Tanaka and constructed with [`xaringan`](https://github.com/yihui/xaringan), [remark.js](https://remarkjs.com), [`knitr`](http://yihui.name/knitr), and [R Markdown](https://rmarkdown.rstudio.com). --- background-size: cover class: title-slide background-image: url("images/bg-01.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 5 - Session 1 <br> ]