ETC5521 Tutorial 4

Initial data analysis


Prof. Di Cook


August 17, 2023

🎯 Objectives

Practice conducting initial data analyses, and make a start on learning how to assess significance of patterns.

🔧 Preparation

The reading for this week is The initial examination of data. It is authored by Chris Chatfield, and is a classic paper explaining the role of initial data analysis. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:

install.packages(c("tidyverse", "palmerpenguins", "ggbeeswarm", "broom", "nullabor"))
  • Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this week’s activities.

Exercise 1: IDA on penguins data

  1. Take a glimpse of the penguins data. What types are variables are present in the data?
  1. How was this data collected? You will need to read the documentation for the palmerpenguins package.

  2. Using the visdat package make an overview plot to examine types of variables and for missing values.

  3. Check the distributions of each species on each of the size variables, using a jittered dotplot, using the geom_quasirandom() function in the ggbeeswarm package. There seems to be some bimodality in some species on some variables eg bill_length_mm. Why do you think this might be? Check your thinking by making a suitable plot.

  4. Is there any indication of outliers from the jittered dotplots of different variables?

  5. Make a scatterplot of body_mass_g vs flipper_length_mm for all the penguins. What do the vertical stripes indicate? Are there any other unusual patterns to note, such as outliers or clustering or nonlinearity?

  6. How well can penguin body mass be predicted based on the flipper length? Fit a linear model to check. Report the equation, the \(R^2\), \(\sigma\), and make a residual plot of residuals vs flipper_length_mm. From the residual plot, are there any concerns about the model fit?

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
  1. Details are at, and you learn “These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal.” It is necessary to also read the original data collection article to obtain details of the sampling. Breeding pairs of penguins were included based on sampling of nests where pairs of adults were present, were chosen and marked, before onset of egg-laying.

  2. There are three factor variables - species, island, sex - and three integer variables - flipper_length_mm, body_mass_g and year - and two numeric variables - bill_length_mm and bill_depth_mm.

It is interesting that flipper_length_mm and body_mass_g are reported without decimal places, and thus are integers, whereas bill_length_mm and bill_depth_mm are reported with one decimal place, and thus are doubles. Both are numeric variables, though.

Four variables have missing values: sex, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g. There are more missings on sex. The other missings occur together, the penguins are missing on all five variables.