`install.packages(c("tidyverse", "palmerpenguins", "ggbeeswarm", "broom", "nullabor"))`

# ETC5521 Tutorial 4

Initial data analysis

## 🎯 Objectives

Practice conducting initial data analyses, and make a start on learning how to assess significance of patterns.

## 🔧 Preparation

The reading for this week is The initial examination of data. It is authored by Chris Chatfield, and is a classic paper explaining the role of initial data analysis. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:

- Open your RStudio Project for this unit, (the one you created in week 1,
`eda`

or`ETC5521`

). Create a`.Rmd`

document for this week’s activities.

## Exercise 1: IDA on `penguins`

data

- Take a
`glimpse`

of the`penguins`

data. What types are variables are present in the data?

`library(palmerpenguins)`

How was this data collected? You will need to read the documentation for the

`palmerpenguins`

package.Using the

`visdat`

package make an overview plot to examine types of variables and for missing values.Check the distributions of each species on each of the size variables, using a jittered dotplot, using the

`geom_quasirandom()`

function in the`ggbeeswarm`

package. There seems to be some bimodality in some species on some variables eg`bill_length_mm`

. Why do you think this might be? Check your thinking by making a suitable plot.Is there any indication of outliers from the jittered dotplots of different variables?

Make a scatterplot of

`body_mass_g`

vs`flipper_length_mm`

for all the penguins. What do the vertical stripes indicate? Are there any other unusual patterns to note, such as outliers or clustering or nonlinearity?How well can penguin body mass be predicted based on the flipper length? Fit a linear model to check. Report the equation, the \(R^2\), \(\sigma\), and make a residual plot of residuals vs

`flipper_length_mm`

. From the residual plot, are there any concerns about the model fit?