install.packages(c("ggplot2movies", "bayesm", "flexmix", "ggbeeswarm", "mixtools", "lvplot", "patchwork", "nullabor"))
ETC5521 Tutorial 6
Working with a single variable, making transformations, detecting outliers, using robust statistics
🎯 Objectives
These are exercises in making plots of one variable and what can be learned about the distributions and the data patterns and problems.
🔧 Preparation
The reading for this week is Wilke (2019) Ch 6 Visualizing Amounts; Ch 7 Visualizing distributions. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:
- Open your RStudio Project for this unit, (the one you created in week 1,
eda
orETC5521
). Create a.Rmd
document for this week’s activities. - Note for Tutors: Randomly allocate one of the problems to each student. Give them 30-40 minutes to work on the problem. Then in the last 30 minutes discuss each one, and the findings with entire tutorial.
Exercise 1: Galaxies
Load the galaxies
data in the MASS
package and answer the following questions based on this dataset.
data(galaxies, package = "MASS")
You can access documentation of the data (if available) using the help
function specifying the package name in the argument.
help(galaxies, package = "MASS")
- What does the data contain? And what is the data source?
- Based on the description in the R Help for the data, what would be an appropriate null distribution of this data?
- How many observations are there?
- If the data is multimodal, which of the following displays do you think would be the best? Which would not be at all useful?
- histogram
- boxplot
- density plot
- violin plot
- jittered dot plot
- letter value plot
- Make these plots for the data. Experiment with different binwidths for the histogram and different bandwiths for the density plot. Were you right in your thinking about which would be the best?
- Fit your best mixture model to the data, and simulate 19 nulls to make a lineup. Show it to five of your neighbours and have them report which is the most different plot. Compute the \(p\)-value. Did you do a good job in matching the distribution? (Extra bonus: What is the model that you have created? Can you make a plot to show how it looks relative to the observed data?)
This code might be helpful to get you started.
library(mixtools)
<- normalmixEM(galaxies, k=3)
galaxies_fit
set.seed(1138)
<- rnormmix(n=length(galaxies),
galaxies_sim1 lambda=galaxies_fit$lambda,
mu=galaxies_fit$mu,
sigma=galaxies_fit$sigma)
ggplot(tibble(galaxies_sim1), aes(x=galaxies_sim1)) +
geom_quasirandom(aes(x=1, y=galaxies_sim1)) +
coord_flip() +
theme(
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()
)<- tibble(.sample=1, galaxies=galaxies_sim1)
galaxies_null for (i in 2:19) {
<- rnormmix(n=length(galaxies),
gsim lambda=galaxies_fit$lambda,
mu=galaxies_fit$mu,
sigma=galaxies_fit$sigma)
<- bind_rows(galaxies_null,
galaxies_null tibble(.sample=i, galaxies=gsim))
}<- bind_rows(galaxies_null,
galaxies_null tibble(.sample=20,
galaxies=galaxies))
# Randomise .sample to hide data plot
$.sample <- rep(sample(1:20, 20), rep(82, 20))
galaxies_nullggplot(tibble(galaxies_null), aes(x=galaxies)) +
geom_quasirandom(aes(x=1, y=galaxies)) +
facet_wrap(~.sample, ncol=4) +
coord_flip() +
theme(
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()
)
Exercise 2: Movie lengths
Load the movies
dataset in the ggplot2movies
package and answer the following questions based on it.
- How many observations are in the data?
- Draw a histogram with an appropriate binwidth that shows the peaks at 7 minutes and 90 minutes. Draw another set of histograms to show whether these peaks existed both before and after 1980.
- The variable
Short
indicates whether the film was classified as a short film (1
) or not (0
). Draw plots to investigate what rules was used to define a film as “short” and whether the films have been consistently classified.
- How would you use the lineup protocol to determine if the periodic peaks could happen by chance? What would be the null hypothesis? Make your lineup, and show it to five of your neighbours and have them report which is the most different plot. Compute the \(p\)-value. Comment on the results. (Note: It might be helpful to focus on movies between 50-150 minutes only.)
Exercise 3: Whisky
The Scotch
data set in bayesm
package was collated from a survey on scotch drinkers, recording the brand they consumed. Do the necessary pre-processing to get the data looking like:
# A tibble: 21 × 2
brand count
<chr> <int>
1 Ballantine 99
2 Black & White 81
3 Chivas Regal 806
4 Clan MacGregor 103
5 Cutty Sark 339
6 Dewar's White Label 517
7 Glenfiddich 334
8 Glenlivet 354
9 Grant's 74
10 J&B 458
11 Johnnie Walker Black Label 502
12 Johnnie Walker Red Label 424
13 Knockando 47
14 Macallan 95
15 Other brands 414
16 Passport 82
17 Pinch (Haig) 117
18 Scoresby Rare 79
19 Ushers 67
20 White Horse 62
21 Singleton 31
- Produce a barplot of the number of respondents per brand. What ordering of the brands do you think is the best?
- There are 20 named brands and one category that is labelled as
Other.brands
. Produce a barplot that you think best reduces the number of categories by selecting a criteria to lump certain brands to theOther
category.
- If you were to test whether this sample were consistent with a sample from a uniform distribution, how would to generate null samples? If it was a sample from a uniform distribution what would it say about typical whiskey consumption? Write down what the null hypothesis is, too.
The following code might help:
# Subset the data
<- scotch_consumption %>%
scotch_consumption_sub mutate(
brand = ifelse(count > 200, brand, "Other brands")
%>%
) filter(brand != "Other brands") %>%
mutate(brand = as.character(factor(brand, labels=c(1:8))))
set.seed(220)
<- rmultinom(n=9,
sim size=sum(scotch_consumption_sub$count),
prob=rep(1/8, 8))
<- t(sim)
sim colnames(sim) <- as.character(c(1:8))
<- sim %>%
sim as_tibble() %>%
mutate(.sample=1:9)
<- sim %>%
sim pivot_longer(cols=`1`:`8`, names_to = "brand", values_to = "count")
<- bind_rows(sim,
scotch_lineup bind_cols(.sample=10, scotch_consumption_sub))
# Randomise .sample to hide data plot
$.sample <- rep(sample(1:10, 10), rep(8, 10)) scotch_lineup
- Show your lineup to five of your neighbours and have them report which is the most different plot. Compute the \(p\)-value. What do your results tell you about the typical consumption of the different whiskey brands?
- The data
whiskey_brands
in theflexmix
package has the information on whether the whiskey is a single malt or blend. It has additional variablesType
(Single Malt
orBlend
) andBottled
(Foreign
orDomestic
). What do you think would be interesting to learn from this new information? Write down some open-ended questions to explore from the data. The data is loaded when you load thewhiskey
data in the same package:
data(whiskey, package = "flexmix")
👋 Finishing up
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.