class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-03A.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-12.png") # .monash-blue[ETC5521: Exploratory Data Analysis] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Initial data analysis and model diagnostics</h2> .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 3 - Session 1 <br> ] --- # Initial Data Analysis, and Confirmatory Analysis .flex[ .w-50[ .info-box.w-100[ Prior to conducting a confirmatory data analysis, it is important to conduct an _initial data analysis_. ] <br> * .monash-blue2[Confirmatory data analysis] is focused on statistical inference and includes procedures for: * hypothesis testing, * predictive modelling, * parameter estimation including uncertainty, * model selection. ] -- .w-50[ .w-80[ * .monash-orange2[Initial data analysis] includes: * describing the data and collection procedures * scrutinise data for errors, outliers, missing observations * check assumptions needed for confirmatory data analysis hold ] .info-box.w-80[Initial data analysis is related to exploratory data analysis in the sense that it is primarily conducted graphically, and tends to rely on subjective assessment.] ] ] --- # Taxonomies are useful but rarely perfect <br><br> .w-80[ * Some people would be practicing IDA without realising that it is IDA. * Sometimes a different name is used to describe the same process, such as Chatfield (1985) referring to IDA also as **_"initial examination of data"_** and Cox & Snell (1981) as **_"preliminary data anlysis"_**. * Some people inadvertently confuse EDA with IDA. IDA should be practised without compromising the confirmatory data analysis. ] .footnote.f5[ Chatfield (1985) The Initial Examination of Data. *Journal of the Royal Statistical Society. Series A (General)* **148** <Br> Cox & Snell (1981) Applied Statistics. *London: Chapman and Hall.* ] -- --- # What is IDA? .info-box[ The .monash-blue[**main objective for IDA**] is to intercept any problems in the data that might adversely affect the confirmatory data analysis. ] -- .w-60[ * **_IDA differs from the main (confirmatory) analysis_** (i.e. usually fitting the model, conducting significance tests, making inferences or predictions). {{content}} ] -- * **_IDA is often unreported_** in the data analysis reports or scientific papers, for various reasons. It might not have been done, or it may have been conducted but there was no space in the paper to report on it. {{content}} -- * The role of **_the main (confirmatory) analysis is to answer the intended question(s) that the data were collected for_**. --- # Where IDA fits .blockquote[ ... a **statistical value chain** is constructed by defining a number of meaningful intermediate data products, for which a chosen set of quality attributes are well described ... .pull-right[— van der Loo & de Jonge (2018)] ] .center[ <img src="images/stats-value-chain.png"> ] .footnote.f4[ Data scientists in government perspective ] --- # Where IDA fits .center[ <img src="images/huebner.png" width="70%" > ] Huebner et al (2018)'s six steps of IDA: (1) Metadata setup, (2) .monash-orange2[Data cleaning], (3) .monash-orange2[Data screening], (4) Initial reporting, (5) Refining and updating the analysis plan, (6) Reporting IDA in documentation. .footnote.f4[ Health and medical research perspective ] --- class: middle .w-70[ # Next we'll see some _illustrative .blue[examples]_ and _.orange[cases]_. <Br> * Note: that there are a variety of ways to do IDA & EDA and different procedures might produce the same decision or conclusion. You don't need to prescribe to what we show you, but following the principles described is important. ] --- # .circle.bg-black.white[1] Data Screening .f4[Part 1/3] * Aside from checking the _data structure_ or _data quality_, it's important to check how the data are understood by the computer, i.e. checking for _data type_ is also important. E.g., * Was the date read in as character? * Was a factor read in as numeric? * Also important for making inference is to know whether the data supports making broader conclusions. How was the data collected? Is it clear what the population of interest is, and that the data is a representative sample? --- class: font_smaller # .blue[Example] .circle.bg-blue.white[1] Checking the data type .f4[Part 1/2] .grid[ .item[ `lecture3-example.xlsx` <center> <img src="images/lecture3-example.png" width = "400px"> </center> ] .item.pl2[ ```r library(readxl) library(here) df <- read_excel(here("data/lecture3-example.xlsx")) df ``` ``` ## # A tibble: 5 × 4 ## id date loc temp ## <dbl> <dttm> <chr> <dbl> ## 1 1 2010-01-03 00:00:00 New York 42 ## 2 2 2010-02-03 00:00:00 New York 41.4 ## 3 3 2010-03-03 00:00:00 New York 38.5 ## 4 4 2010-04-03 00:00:00 New York 41.1 ## 5 5 2010-05-03 00:00:00 New York 39.8 ``` Any issues here? ] ] --- # .blue[Example] .circle.bg-blue.white[1] Checking the data type .f4[Part 2/2] .grid[ .item[ ```r library(lubridate) df %>% mutate(id = as.factor(id), day = day(date), month = month(date), year = year(date)) %>% select(-date) ``` ``` ## # A tibble: 5 × 6 ## id loc temp day month year ## <fct> <chr> <dbl> <int> <dbl> <dbl> ## 1 1 New York 42 3 1 2010 ## 2 2 New York 41.4 3 2 2010 ## 3 3 New York 38.5 3 3 2010 ## 4 4 New York 41.1 3 4 2010 ## 5 5 New York 39.8 3 5 2010 ``` ] .item[ * `id` is now a `factor` instead of `integer` * `day`, `month` and `year` are now extracted from the `date` * Is it okay now? {{content}} ] ] -- * In the United States, it's common to use the date format MM/DD/YYYY <a class="font_small black" href="https://twitter.com/statsgen/status/1257959369448161281">(gasps)</a> while the rest of the world commonly use DD/MM/YYYY or YYYY/MM/DD. {{content}} -- * It's highly probable that the dates are 1st-5th March and not 3rd of Jan-May. {{content}} -- * You can validate this with other variables, say the temperature [here](https://www.wunderground.com/history/monthly/us/ny/new-york-city/KLGA/date/2010-3). --- # .blue[Example] .circle.bg-blue.white[1] Checking the data type with R .f4[Part 1/3] * You can robustify your workflow by ensuring you have a check for the expected data type in your code. .f4[ ```r xlsx_df <- read_excel(here("data/lecture3-example.xlsx"), col_types = c("text", "date", "text", "numeric")) %>% mutate(id = as.factor(id), date = as.character(date), date = as.Date(date, format = "%Y-%d-%m")) ``` ] * `read_csv` has a broader support for `col_types` .f4[ ```r csv_df <- read_csv(here("data/lecture3-example.csv"), col_types = cols( id = col_factor(), date = col_date(format = "%m/%d/%y"), loc = col_character(), temp = col_double())) ``` ] * The checks (or coercions) ensure that even if the data are updated, you can have some confidence that any data type error will be picked up before further analysis. --- # .blue[Example] .circle.bg-blue.white[1] Checking the data type with R .f4[Part 2/3] You can have a quick glimpse of the data type with: .f4[ ```r dplyr::glimpse(xlsx_df) ``` ``` ## Rows: 5 ## Columns: 4 ## $ id <fct> 1, 2, 3, 4, 5 ## $ date <date> 2010-03-01, 2010-03-02, 2010-03-03, 2010-03-04, 2010-03-05 ## $ loc <chr> "New York", "New York", "New York", "New York", "New York" ## $ temp <dbl> 42.0, 41.4, 38.5, 41.1, 39.8 ``` ```r dplyr::glimpse(csv_df) ``` ``` ## Rows: 5 ## Columns: 4 ## $ id <fct> 1, 2, 3, 4, 5 ## $ date <date> 2010-03-01, 2010-03-02, 2010-03-03, 2010-03-04, 2010-03-05 ## $ loc <chr> "New York", "New York", "New York", "New York", "New York" ## $ temp <dbl> 42.0, 41.4, 38.5, 41.1, 39.8 ``` ] --- # .blue[Example] .circle.bg-blue.white[1] Checking the data type with R .f4[Part 3/3] You can also visualise the data type with: .grid[.item.br[ ```r library(visdat) vis_dat(xlsx_df) ``` <img src="images/lecture-03A/unnamed-chunk-8-1.png" width="432" style="display: block; margin: auto;" /> ] .item[ ```r library(inspectdf) inspect_types(xlsx_df) %>% show_plot() ``` <img src="images/lecture-03A/unnamed-chunk-9-1.png" width="432" style="display: block; margin: auto;" /> ] ] --- # .circle.bg-black.white[2] Data Cleaning .f4[Part 2/3] .w-70[ * Data quality checks should be one of the first steps in the data analysis to **_assess any problems with the data_**. {{content}} ] -- * This is sometimes referred to as **_data sniffing_** or **_data scrutinizing_**. {{content}} -- * These include using common or domain knowledge to check if the recorded data have sensible values. {{content}} -- E.g. * Are positive values, e.g. height and weight, recorded as positive values with a plausible range? {{content}} -- * If the data are counts, do the recorded values contain non-integer values? {{content}} -- * For compositional data, do the values add up to 100% (or 1)? If not is that a measurement error or due to rounding? Or is another variable missing? {{content}} -- * Does the data contain only positives, ie disease occurrences, or warranty claims? If so, what would the no report group look like? --- # .circle.bg-black.white[2] Data Cleaning .f4[Part 2/3] .w-70[ * In addition, numerical or graphical summaries may reveal that there is unwanted structure in the data. E.g., * Does the treatment group have different demographic characteristics to the control group? * Does the distribution of the data imply violations of assumptions for the main analysis? {{content}} ] -- * *Data scrutinizing* is a process that you get better at with practice and have familiarity with the domain area. {{content}} --- # .blue[Example] .circle.bg-blue.white[2] Checking the data quality .grid[ .item.f4[ ```r df2 <- read_csv(here("data/lecture3-example2.csv"), col_types = cols(id = col_factor(), date = col_date(format = "%m/%d/%y"), loc = col_character(), temp = col_double())) df2 ``` ``` ## # A tibble: 9 × 4 ## id date loc temp ## <fct> <date> <chr> <dbl> ## 1 1 2010-03-01 New York 42 ## 2 2 2010-03-02 New York 41.4 ## 3 3 2010-03-03 New York 38.5 ## 4 4 2010-03-04 New York 41.1 ## 5 5 2010-03-05 New York 39.8 ## 6 6 2020-03-01 Melbourne 30.6 ## 7 7 2020-03-02 Melbourne 17.9 ## 8 8 2020-03-03 Melbourne 18.6 ## 9 9 2020-03-04 <NA> 21.3 ``` ] .item[ * Numerical or graphical summaries or even just eye-balling the data helps to uncover some data quality issues. * Any issues here? {{content}} ] ] -- <br><br> * There's a missing value in `loc`. * Temperature is in Farenheit for New York but Celsius in Melbourne (you can validate this again using external sources). --- # .orange[Case study] .circle.bg-orange.white[1] World development indicators .f4[Part 1/3] .flex[ .w-70[ <img src="images/lecture-03A/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" /> ] .w-30[ <br><br> - What are the data types? - How are missings distributed? - Which variables have insufficient values to analyse further? ] ] .footnote[World Development Indicators (WDI), sourced from the [World Bank Group (2019)](https://databank.worldbank.org/source/world-development-indicators/)] --- # .orange[Case study] .circle.bg-orange.white[1] World development indicators .f4[Part 2/3] .flex[ .w-50[ <img src="images/lecture-03A/unnamed-chunk-12-1.png" width="80%" style="display: block; margin: auto;" /> ] .w-50[ <br> <br> `en_pop_dnst` = Population density (people per sq. km of land area) `sp_urb_grow` = Urban population growth (annual %) <br><br> - How are missings distributed? - Is there a relationship between population density and urban growth? is there a better way to plot this to see relationship? ] ] --- # .orange[Case study] .circle.bg-orange.white[1] World development indicators .f4[Part 3/3] .flex[ .w-50[ <img src="images/lecture-03A/unnamed-chunk-13-1.png" width="80%" style="display: block; margin: auto;" /> ] .w-50[ <br> <br> `en_pop_dnst` = Population density (people per sq. km of land area) `sp_urb_grow` = Urban population growth (annual %) <br><br> - Is there a relationship between population density and urban growth? ] ] --- class: transition # Sanity check your data --- # .orange[Case study] .circle.bg-orange.white[2] Employment Data in Australia .f4[Part 1/3] Below is the data from ABS that shows the total number of people employed in a given month from February 1976 to December 2019 using the original time series. <br> ```r glimpse(employed) ``` ``` ## Rows: 533 ## Columns: 4 ## $ date <date> 1978-02-01, 1978-03-01, 1978-04-01, 1978-05-01, 1978-06-01, 1978-07-01, 1978-08-01, 1978-09-01, 1978-10-01, 1978-11-01, 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-01, 197… ## $ month <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, … ## $ year <fct> 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980… ## $ value <dbl> 5985.660, 6040.561, 6054.214, 6038.265, 6031.342, 6036.084, 6005.361, 6024.313, 6045.855, 6033.797, 6125.360, 5971.329, 6050.693, 6096.175, 6087.654, 6075.611, 6095.734, 6103.922, 6078… ``` .footnote.f4[ Australian Bureau of Statistics, 2020, Labour force, Australia, Table 01. Labour force status by Sex, Australia - Trend, Seasonally adjusted and Original, viewed 2023-08-07, [<i class="fas fa-link"></i>](https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/6202.0Jul%202020?OpenDocument) ] --- # .orange[Case study] .circle.bg-orange.white[2] Employment Data in Australia .f4[Part 2/3] Do you notice anything? <img src="images/lecture-03A/unnamed-chunk-16-1.png" width="864" style="display: block; margin: auto;" /> -- Why do you think the number of people employed is going up each year? ??? * Australian population is **25.39 million** in 2019 * 1.5% annual increase in population * Vic population is 6.681 million (Sep 2020) - 26% * NSW population is 8.166 (Sep 2020) - 32% --- # .orange[Case study] .circle.bg-orange.white[2] Employment Data in Australia .f4[Part 3/3] .grid[.item[ <img src="images/lecture-03A/unnamed-chunk-17-1.png" width="432" style="display: block; margin: auto;" /> ] .item[ {{content}} ] ] -- * There's a suspicious change in August numbers from 2014. <img src="images/lecture-03A/unnamed-chunk-18-1.png" width="432" style="display: block; margin: auto;" /> * A potential explanation for this is that there was a _change in the survey from 2014_. <div class="footnote"> Also see https://robjhyndman.com/hyndsight/abs-seasonal-adjustment-2/ </div> --- class: transition # Check if the _data collection_ method has been consistent --- # .blue[Example] .circle.bg-blue.white[3] Experimental layout and data .f4[Part 1/2] `lecture3-example3.csv` .f4[ ```r df3 <- read_csv(here::here("data/lecture3-example3.csv"), col_types = cols( row = col_factor(), col = col_factor(), yield = col_double(), trt = col_factor(), block = col_factor())) ``` .overflow-scroll.h5[ ```r skimr::skim(df3) ``` ``` ## ── Data Summary ──────────────────────── ## Values ## Name df3 ## Number of rows 48 ## Number of columns 5 ## _______________________ ## Column type frequency: ## factor 4 ## numeric 1 ## ________________________ ## Group variables None ## ## ── Variable type: factor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 row 0 1 FALSE 6 1: 8, 2: 8, 3: 8, 4: 8 ## 2 col 0 1 FALSE 8 1: 6, 2: 6, 3: 6, 4: 6 ## 3 trt 0 1 FALSE 9 non: 16, hi : 4, hi : 4, hi : 4 ## 4 block 0 1 FALSE 4 B3: 12, B1: 12, B2: 12, B4: 12 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 yield 0 1 246. 16.0 204 237 248 257. 273 ▂▂▇▇▅ ``` ] ] --- # .blue[Example] .circle.bg-blue.white[3] Experimental layout and data .f4[Part 2/2] .grid[ .item[ <img src="images/lecture-03A/unnamed-chunk-22-1.png" width="432" style="display: block; margin: auto;" /><img src="images/lecture-03A/unnamed-chunk-22-2.png" width="432" style="display: block; margin: auto;" /> ] .item[ * The experiment tests the effects of 9 fertilizer treatments on the yield of brussel sprouts on a field laid out in a rectangular array of 6 rows and 8 columns. <img src="images/lecture-03A/unnamed-chunk-23-1.png" width="576" style="display: block; margin: auto;" /> * High sulphur and high manure seems to be the best for the yield of brussel sprouts. * Any issues here? ] ] --- # Take away messages .flex[ .w-70.f2[ <ul class="fa-ul"> {{content}} </ul> ] ] -- <li><span class="fa-li"><i class="fas fa-paper-plane"></i></span> <ul> <li> Check if experimental layout given in the data and the description match </li> <li> In particular, have a check with a plot to see if treatments are <em>randomised</em>. </li> </ul> {{content}} --- class: transition # Validators --- # .orange[Case study] .circle.bg-orange.white[3] Dutch supermarket revenue and cost .f4[Part 1/3] * Data contains the revenue and cost (in Euros) for 60 supermarkets * Data has been anonymised and distorted ``` ## Rows: 60 ## Columns: 11 ## $ id <fct> RET01, RET02, RET03, RET04, RET05, RET06, RET07, RET08, RET09, RET10, RET11, RET12, RET13, RET14, RET15, RET16, RET17, RET18, RET19, RET20, RET21, RET22, RET23, RET24, RET25, RET… ## $ size <fct> sc0, sc3, sc3, sc3, sc3, sc0, sc3, sc1, sc3, sc2, sc2, sc2, sc3, sc1, sc1, sc0, sc3, sc1, sc2, sc3, sc0, sc0, sc1, sc1, sc2, sc3, sc2, sc3, sc0, sc2, sc3, sc2, sc3, sc3, sc3, sc3… ## $ incl.prob <dbl> 0.02, 0.14, 0.14, 0.14, 0.14, 0.02, 0.14, 0.02, 0.14, 0.05, 0.05, 0.05, 0.14, 0.02, 0.02, 0.02, 0.14, 0.02, 0.05, 0.14, 0.02, 0.02, 0.02, 0.02, 0.05, 0.14, 0.05, 0.14, 0.02, 0.05… ## $ staff <int> 75, 9, NA, NA, NA, 1, 5, 3, 6, 5, 5, 5, 13, NA, 3, 52, 10, 4, 3, 8, 2, 3, 2, 4, 3, 6, 2, 16, 1, 6, 29, 8, 13, 9, 15, 14, 6, 53, 7, NA, 20, 2, NA, 1, 3, 1, 60, 8, 10, 12, 7, 24, 2… ## $ turnover <int> NA, 1607, 6886, 3861, NA, 25, NA, 404, 2596, NA, 645, 2872, 5678, 931397, 80000, 9067, 1500, 440, 690, 1852, 359, 839, 471, 933, 1665, 2318, 1175, 2946, 492, 1831, 7271, 971, 411… ## $ other.rev <int> NA, NA, -33, 13, 37, NA, NA, 13, NA, NA, NA, NA, 12, NA, NA, 622, 20, NA, NA, NA, 9, NA, NA, 2, NA, NA, 12, 7, NA, 1831, 30, NA, 11, NA, 33, 98350, 4, NA, 38, 98, 11, NA, NA, NA,… ## $ total.rev <int> 1130, 1607, 6919, 3874, 5602, 25, 1335, 417, 2596, NA, 645, 2872, 5690, 931397, NA, 9689, 1520, 440, 690, 1852, 368, 839, 471, 935, 1665, 2318, 1187, 2953, 492, 1831, 7301, 107, … ## $ staff.costs <int> NA, 131, 324, 290, 314, NA, 135, NA, 147, NA, 130, 182, 326, 36872, 40000, 1125, 195, 16, 19000, 120, NA, 2, 34, 31, 70, 184, 114, 245, NA, 53, 451, 28, 57, 106, 539, 221302, 64,… ## $ total.costs <int> 18915, 1544, 6493, 3600, 5530, 22, 136, 342, 2486, NA, 636, 2652, 5656, 841489, NA, 9911, 1384, 379, 464507, 1812, 339, 717, 411, 814, 186, 390, NA, 2870, 470, 1443, 7242, 95, 36… ## $ profit <int> 20045, 63, 426, 274, 72, 3, 1, 75, 110, NA, 9, 220, 34, 89908, NA, -222, 136, 60, 225493, 40, 29, 122, 60, 121, 1478, 86, 17, 83, 22, 388, 59, 100, 528, 160, 282, 22457, 37, -160… ## $ vat <int> NA, NA, NA, NA, NA, NA, 1346, NA, NA, NA, NA, NA, NA, 863, 813, 964, 733, 296, 486, 1312, 257, 654, 377, 811, 1472, 2082, 1058, 2670, 449, 1695, 6754, 905, 3841, 2668, 2758, 2548… ``` --- # .orange[Case study] .circle.bg-orange.white[3] Dutch supermarket revenue and cost .f4[Part 2/3] * Checking for completeness of records ```r library(validate) rules <- validator( is_complete(id), is_complete(id, turnover), is_complete(id, turnover, profit)) out <- confront(SBS2000, rules) summary(out) ``` ``` ## name items passes fails nNA error warning expression ## 1 V1 60 60 0 0 FALSE FALSE is_complete(id) ## 2 V2 60 56 4 0 FALSE FALSE is_complete(id, turnover) ## 3 V3 60 52 8 0 FALSE FALSE is_complete(id, turnover, profit) ``` --- # .orange[Case study] .circle.bg-orange.white[3] Dutch supermarket revenue and cost .f4[Part 3/3] * Sanity check derived variables ```r library(validate) rules <- validator( total.rev - profit == total.costs, turnover + other.rev == total.rev, profit <= 0.6 * total.rev ) out <- confront(SBS2000, rules) summary(out) ``` ``` ## name items passes fails nNA error warning expression ## 1 V1 60 39 14 7 FALSE FALSE abs(total.rev - profit - total.costs) <= 1e-08 ## 2 V2 60 19 4 37 FALSE FALSE abs(turnover + other.rev - total.rev) <= 1e-08 ## 3 V3 60 49 6 5 FALSE FALSE profit - 0.6 * total.rev <= 1e-08 ``` --- # Take away messages .flex[ .w-70.f2[ <ul class="fa-ul"> {{content}} </ul> ] ] -- <li><span class="fa-li"><i class="fas fa-paper-plane"></i></span>Check your data: <ul> <li>by validating the variable types</li> <li>with independent or external sources</li> <li>by checking the data quality</li> </ul> </li> {{content}} -- <li><span class="fa-li"><i class="fas fa-paper-plane"></i></span>Check if the data collection method has been consistent</li> {{content}} -- <li><span class="fa-li"><i class="fas fa-paper-plane"></i></span>Check if experimental layout given in the data and the description match</li> {{content}} -- <li><span class="fa-li"><i class="fas fa-paper-plane"></i></span>Consider if or how data were derived</li> --- class: middle center # Why? <br><br> .blockquote.w-80["The first thing to do with data is to look at them.... usually means tabulating and plotting the data in many different ways to ‘see what’s going on’. With the wide availability of computer packages and graphics nowadays there is no excuse for ducking the labour of this preliminary phase, and it may save some .monash-red2[red faces] later.] .footnote[Crowder, M. J. & Hand, D. J. (1990) "Analysis of Repeated Measures" https://doi.org/10.1201/9781315137421] --- # Further reading - Huebner et al (2018) [A Contemporary Conceptual Framework for Initial Data Analysis](https://muse.jhu.edu/article/793379/pdf) - Huebner et al (2020) [Hidden analyses](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-00942-y) - Chatfield (1985) The Initial Examination of Data. *Journal of the Royal Statistical Society. Series A (General)* **148** <Br> - Cox & Snell (1981) Applied Statistics. *London: Chapman and Hall.* - van der Loo and de Jonge (2018). Statistical Data Cleaning with Applications in R. John Wiley and Sons Ltd. - Hyndman (2014) [Explaining the ABS unemployment fluctuations](https://robjhyndman.com/hyndsight/abs-seasonal-adjustment-2/) --- background-size: cover class: title-slide background-image: url("images/bg-12.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Di Cook* <i class="fas fa-envelope"></i> ETC5521.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 3 - Session 1 <br> ] <br><br>Lecture materials originally developed by Dr Emi Tanaka