install.packages(c("tidyverse", "palmerpenguins", "ggbeeswarm", "broom", "nullabor"))
ETC5521 Tutorial 4
Initial data analysis
🎯 Objectives
Practice conducting initial data analyses, and make a start on learning how to assess significance of patterns.
🔧 Preparation
The reading for this week is The initial examination of data. It is authored by Chris Chatfield, and is a classic paper explaining the role of initial data analysis. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:
- Open your RStudio Project for this unit, (the one you created in week 1,
eda
orETC5521
). Create a.Rmd
document for this week’s activities.
Exercise 1: IDA on penguins
data
- Take a
glimpse
of thepenguins
data. What types are variables are present in the data?
library(palmerpenguins)
How was this data collected? You will need to read the documentation for the
palmerpenguins
package.Using the
visdat
package make an overview plot to examine types of variables and for missing values.Check the distributions of each species on each of the size variables, using a jittered dotplot, using the
geom_quasirandom()
function in theggbeeswarm
package. There seems to be some bimodality in some species on some variables egbill_length_mm
. Why do you think this might be? Check your thinking by making a suitable plot.Is there any indication of outliers from the jittered dotplots of different variables?
Make a scatterplot of
body_mass_g
vsflipper_length_mm
for all the penguins. What do the vertical stripes indicate? Are there any other unusual patterns to note, such as outliers or clustering or nonlinearity?How well can penguin body mass be predicted based on the flipper length? Fit a linear model to check. Report the equation, the \(R^2\), \(\sigma\), and make a residual plot of residuals vs
flipper_length_mm
. From the residual plot, are there any concerns about the model fit?
Exercise 2: Can we believe what we see?
This question uses material from this week’s lecture, from a few hours ago.
- In the previous question we made subjective statements about the residual plot to determine if the model was a good fit or not. We’ll use randomisation to check any observations we made from the residual plot. The code below makes a lineup of the true plot against plots made with rotation residuals (nulls/good). When you run the code you will get a line
decrypt("....")
, which you can copy and paste back in to the console window to get the location of the true plot (in case you forgot which it is). Does the true plot look like the null plots? If not, describe how it differs.
library(nullabor)
ggplot(lineup(null_lm(body_mass_g~flipper_length_mm, method="rotate"),
penguins_m),aes(x=flipper_length_mm, y=.resid)) +
geom_point() +
facet_wrap(~.sample, ncol=5) +
theme_void() +
theme(axis.text = element_blank(),
panel.border = element_rect(fill=NA, colour="black"))
- Pick one group, males or females, and one of Adelie, Chinstrap or Gentoo, and choose two of the four measurements. Fit a linear model, and do a lineup of the residuals. Can you tell which is the true plot? Show your lineup to your tutorial partner or someone else nearby and ask them
- to pick the plot that is most different.
- explain why they picked that plot.
Using your decrypt()
code locate the true plot. Is the true plot different from the nulls?
Did you or your friend choose the data plot? Was it identifiable from the lineup or indistinguishable from the null plots?
👋 Finishing up
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.