ETC5521 Tutorial 7

Bivariate dependencies and relationships, transformations to linearise

Author

Prof. Di Cook

Published

September 7, 2023

🎯 Objectives

These are exercises in making scatterplots and variations to examine association between two variables, and to make practice using transformations.

🔧 Preparation

The reading for this week is Wilke (2019) Ch 12 Visualizing associations. - Complete the weekly quiz, before the deadline! - Install the following R-packages if you do not have them already:

install.packages(c("VGAMdata", "Sleuth2", "colorspace", "nullabor", "broom", "patchwork"))
  • Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this week’s activities.
  • Note for Tutors: Randomly allocate one of the problems to each student. Give them 30-40 minutes to work on the problem. Then in the last 30 minutes discuss each one, and the findings with entire tutorial.

Exercise 1: Olympics

We have seen from the lecture that the Athletics category has too many different types of athletics in it for it to be a useful group for studying height and weight. There is another variable called Event which contains more specific information.

# Read-in data
data(oly12, package = "VGAMdata")
  1. Tabulate Event for just the Sport category Athletics, and decide which new categories to create.

World Athletics, the sport’s governing body, defines athletics in six disciplines: track and field, road running, race walking, cross country running, mountain running, and trail running wikipedia.

That’s not so helpful! Track and field should be two different groups. I suggest running (short, middle and long distance), throwing, jumping, walking, and Decathlon (men) or Heptathlon (women)

  1. Create the new categories, in steps, creating a new binary variable for each. The function str_detect is useful for searching for text patterns in a string. It also helps to know about regular expressions to work with strings like this. And there are two sites, which are great for learning: Regex puzzles, Information and testing board
# Give each athlete an id as unique identifier to facilitate relational joins
oly12 <- oly12 %>%
  mutate(id = row_number(), .before = Name)

# For athletes with > 1 event, separate each event into a row
oly12_ath <- oly12 %>%
  filter(Sport == "Athletics") %>%
  separate_rows(Event, sep = ", ")

# Determine athlete types into 7 categories
oly12_ath <- oly12_ath %>%
  mutate(
    Ath_type = case_when(
      # 100m, 110m Hurdles, 200m, 400m, 400m hurdles, 800m, 4 x 100m relay, 4 x 400m relay,
      str_detect(Event, "[1248]00m|Hurdles") ~ "Short distance",
      # 1500m, 3000m Steeplechase, 5000m
      str_detect(Event, "1500m|5000m|Steeplechase") ~ "Middle distance",
      # 10,000m, Marathon
      str_detect(Event, ",000m|Marathon") ~ "Long distance",
      # 20km Race walk, Men's 50km Race walk
      str_detect(Event, "Walk") ~ "Walking",
      # discus throw, hammer throw, javelin throw, shot put,
      str_detect(Event, "Throw|Put") ~ "Throwing",
      # high jump, long jump, triple jump, pole vault
      str_detect(Event, "Jump|Pole Vault") ~ "Jumping",
      # decathlon (men) or heptathlon (women)
      str_detect(Event, "Decathlon|Heptathlon") ~ "Decathlon/Heptathlon"
    )
  )

# Remove rows with > 1 of the same athlete type
oly12_ath <- oly12_ath %>%
  select(-Event) %>%
  distinct()

# Add events back to each athlete
oly12_ath <- oly12_ath %>%
  left_join(
    select(.data = oly12, c(Event, id)),
    by = "id"
  )
  1. Make several plots to explore the association between height and weight for the different athletic categories, eg scatterplots faceted by sex and event type, with/without free scales, linear models for the different subsets, overlaid on the same plot, 2D density plots faceted by sex and event type, with free scales.
library(colorspace)
ggplot(data = oly12_ath, aes(x = Height, y = Weight)) +
  geom_point(alpha = 0.4) +
  facet_grid(Sex ~ Ath_type)
Warning: Removed 222 rows containing missing values (`geom_point()`).