ETC5521 Tutorial 4

Initial data analysis


Prof. Di Cook


August 17, 2023

🎯 Objectives

Practice conducting initial data analyses, and make a start on learning how to assess significance of patterns.

🔧 Preparation

The reading for this week is The initial examination of data. It is authored by Chris Chatfield, and is a classic paper explaining the role of initial data analysis. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:

install.packages(c("tidyverse", "palmerpenguins", "ggbeeswarm", "broom", "nullabor"))
  • Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this week’s activities.

Exercise 1: IDA on penguins data

  1. Take a glimpse of the penguins data. What types are variables are present in the data?
  1. How was this data collected? You will need to read the documentation for the palmerpenguins package.

  2. Using the visdat package make an overview plot to examine types of variables and for missing values.

  3. Check the distributions of each species on each of the size variables, using a jittered dotplot, using the geom_quasirandom() function in the ggbeeswarm package. There seems to be some bimodality in some species on some variables eg bill_length_mm. Why do you think this might be? Check your thinking by making a suitable plot.

  4. Is there any indication of outliers from the jittered dotplots of different variables?

  5. Make a scatterplot of body_mass_g vs flipper_length_mm for all the penguins. What do the vertical stripes indicate? Are there any other unusual patterns to note, such as outliers or clustering or nonlinearity?

  6. How well can penguin body mass be predicted based on the flipper length? Fit a linear model to check. Report the equation, the \(R^2\), \(\sigma\), and make a residual plot of residuals vs flipper_length_mm. From the residual plot, are there any concerns about the model fit?

Exercise 2: Can we believe what we see?

This question uses material from this week’s lecture, from a few hours ago.

  1. In the previous question we made subjective statements about the residual plot to determine if the model was a good fit or not. We’ll use randomisation to check any observations we made from the residual plot. The code below makes a lineup of the true plot against plots made with rotation residuals (nulls/good). When you run the code you will get a line decrypt("...."), which you can copy and paste back in to the console window to get the location of the true plot (in case you forgot which it is). Does the true plot look like the null plots? If not, describe how it differs.
ggplot(lineup(null_lm(body_mass_g~flipper_length_mm, method="rotate"),
       aes(x=flipper_length_mm, y=.resid)) +
  geom_point() +
  facet_wrap(~.sample, ncol=5) +
  theme_void() +
  theme(axis.text = element_blank(), 
        panel.border = element_rect(fill=NA, colour="black"))
  1. Pick one group, males or females, and one of Adelie, Chinstrap or Gentoo, and choose two of the four measurements. Fit a linear model, and do a lineup of the residuals. Can you tell which is the true plot? Show your lineup to your tutorial partner or someone else nearby and ask them
  • to pick the plot that is most different.
  • explain why they picked that plot.

Using your decrypt() code locate the true plot. Is the true plot different from the nulls?

Did you or your friend choose the data plot? Was it identifiable from the lineup or indistinguishable from the null plots?

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.