ETC5521 Tutorial 7

Bivariate dependencies and relationships, transformations to linearise

Author

Prof. Di Cook

Published

September 7, 2023

🎯 Objectives

These are exercises in making scatterplots and variations to examine association between two variables, and to make practice using transformations.

🔧 Preparation

The reading for this week is Wilke (2019) Ch 12 Visualizing associations. - Complete the weekly quiz, before the deadline! - Install the following R-packages if you do not have them already:

install.packages(c("VGAMdata", "Sleuth2", "colorspace", "nullabor", "broom", "patchwork"))

Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this week’s activities.
Note for Tutors: Randomly allocate one of the problems to each student. Give them 30-40 minutes to work on the problem. Then in the last 30 minutes discuss each one, and the findings with entire tutorial.

Exercise 1: Olympics

We have seen from the lecture that the Athletics category has too many different types of athletics in it for it to be a useful group for studying height and weight. There is another variable called Event which contains more specific information.

# Read-in data
data(oly12, package = "VGAMdata")

Tabulate Event for just the Sport category Athletics, and decide which new categories to create.

Create the new categories, in steps, creating a new binary variable for each. The function str_detect is useful for searching for text patterns in a string. It also helps to know about regular expressions to work with strings like this. And there are two sites, which are great for learning: Regex puzzles, Information and testing board

Make several plots to explore the association between height and weight for the different athletic categories, eg scatterplots faceted by sex and event type, with/without free scales, linear models for the different subsets, overlaid on the same plot, 2D density plots faceted by sex and event type, with free scales.

List what you learned about body types across the different athletics types and sexes.

If one were use visual inference to check for a different relationship between height and weight across sports how would you generate null data? Do it, and test your lineup with others in the class.

Exercise 2: Fisherman’s Reach crabs

Mud crabs are delicious to eat! Prof Cook’s father started a crab farm at Fisherman’s Reach, NSW, when he retired. He caught small crabs (with a special license) and nurtured and fed the crabs until they were marketable size. They were then sent to market, like Queen Victoria Market in Melbourne, for people to buy to eat. Mud crabs have a strong and nutty flavour, and a good to eat simply after steaming or boiling.

Early in the farming setup, he collected the measurements of 62 crabs of different sizes, because he wanted to learn when was the best time to send the crab to market. Crabs re-shell from time to time. They grow too big for their shell, and need to discard it. Ideally, the crabs should be sent to market just before they re-shell, because the will be crab will be fuller in the shell, less air, less juice and more crab meat.

Note: In NSW it is legal to sell female mud crabs, as long as they are not carrying eggs. In Queensland, it is illegal to keep and sell female mud crabs. Focusing only on males could be worthwhile.

fr_crabs <- read_csv("https://eda.numbat.space/data/fr-crab.csv") %>%
  mutate(Sex = factor(Sex, levels=c(1,2),
                      labels=c("m", "f")))

Where is Fisherman’s Reach? What would you expect the relationship between Length and Weight of a crab to be?

Make a scatterplot of Weight by NSW Length. Describe the relationship. It might be even better if you can add marginal density plots to the sides of the scatterplot. (Aside: Should one variable be considered a dependent variable? If so, make sure this is on the \(y\) axis.)

Examine transformations to linearise the relationship. (Think about why the relationship between Length and Weight is nonlinear.)

Is there possibly a lurking variable? Examine the variables in the data, and use colour in the plot to check for another variable explaining some of the relationship.

If you have determined that the is a lurking variable, make changes in the plots to find the best model of the relationship between Weight and Length.

How would you select the crabs that were close to re-shelling based on this data?

Exercise 3: Bank discrimination

data(case1202, package = "Sleuth2")

Look at the help page for the case1202 from the Sleuth2 package. What does the variable “Senior” measure? “Exper”? Age?

Make all the pairwise scatterplots of Senior, Exper and Age. What do you learn about the relationship between these three pairs of variables? How can the age be 600? Are there some wizards or witches or vampires in the data?

Colour the observations by Sex. What do you learn?

Instead of scatterplots, make faceted histograms of the three variables by Sex. What do you learn about the difference in distribution of these three variables between the sexes.

The data also has 1975 salary and annual salary. Plot these two variables, in two ways: (1) coloured by Sex, and (2) faceted by Sex. Explain the relationships.

Examine the annual salary against Age, Senior and Exper, separately by Sex, by adding a fitted linear model to the scatterplot where Sex is mapped to colour. What is the relationship and what do you learn?

When you use geom_smooth(method="lm") to add a fitted model to the scatterplot, is it adding a model with interactions?

There is danger of of misinterpreting differences when only examining marginal plots. What we need to know is: for a person with the same age, same experience, same seniority, is the salary different for men and women. How would you make plots to try to examine this?

Would you say that this data provides evidence of sex discrimination?

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.