ETC5521 Tutorial 6

Working with a single variable, making transformations, detecting outliers, using robust statistics

Author

Prof. Di Cook

Published

August 31, 2023

🎯 Objectives

These are exercises in making plots of one variable and what can be learned about the distributions and the data patterns and problems.

🔧 Preparation

The reading for this week is Wilke (2019) Ch 6 Visualizing Amounts; Ch 7 Visualizing distributions. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:

install.packages(c("ggplot2movies", "bayesm", "flexmix",  "ggbeeswarm", "mixtools", "lvplot", "patchwork", "nullabor"))
  • Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this week’s activities.
  • Note for Tutors: Randomly allocate one of the problems to each student. Give them 30-40 minutes to work on the problem. Then in the last 30 minutes discuss each one, and the findings with entire tutorial.

Exercise 1: Galaxies

Load the galaxies data in the MASS package and answer the following questions based on this dataset.

data(galaxies, package = "MASS")

You can access documentation of the data (if available) using the help function specifying the package name in the argument.

help(galaxies, package = "MASS")
  1. What does the data contain? And what is the data source?
data(galaxies, package = "MASS")
glimpse(galaxies)
 num [1:82] 9172 9350 9483 9558 9775 ...

The data contains velocities in km/sec of 82 galaxies from 6 well-separated conic sections of an unfilled survey of the Corona Borealis region. The original data is from Postman et al. (1986) and this data is from Roeder with 83rd observation removed from the original data as well as typo for the 78th observation.

  • Postman, M., Huchra, J. P. and Geller, M. J. (1986) Probes of large-scale structures in the Corona Borealis region. Astronomical Journal 92, 1238–1247
  • Roeder, K. (1990) Density estimation with confidence sets exemplified by superclusters and voids in galaxies. Journal of the American Statistical Association 85, 617–624.
  1. Based on the description in the R Help for the data, what would be an appropriate null distribution of this data?

The description in the R help for the data says Multimodality in such surveys is evidence for voids and superclusters in the far universe.

Deciding on an appropriate null hypothesis is always tricky. If we wanted to test the statement that the data is multimodal, we could compare against a unimodal distribution, either a normal or an exponential depending on what shape we might expect.

However, the published work has already made a claim that the data is multimodal, so it would be interesting to determine if we can generate samples from a multimodal distribution that are indistinguishable from the data.

\(H_0:\) The distribution is multimodal. \(H_a:\) The distribution is something other than multimodal.

  1. How many observations are there?

There are 82 observations.

  1. If the data is multimodal, which of the following displays do you think would be the best? Which would not be at all useful?
  • histogram
  • boxplot
  • density plot
  • violin plot
  • jittered dot plot
  • letter value plot

If you said a density plot, jittered dot plot, or a histogram, you’re on the right track, because each can give a fine resolution for showing modes. (The violin plot is not any different from a density plot, when only looking at one variable.)

  1. Make these plots for the data. Experiment with different binwidths for the histogram and different bandwiths for the density plot. Were you right in your thinking about which would be the best?
g <- ggplot(tibble(galaxies), aes(galaxies)) +
  theme(
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )
g1 <- g + geom_histogram(binwidth = 1000, color = "white") 
g2 <- g + geom_boxplot() 
g3 <- g + geom_density() 
g4 <- g + geom_violin(aes(x=galaxies, y=1), draw_quantiles = c(0.25, 0.5, 0.75))
g5 <- g + geom_quasirandom(aes(x=1, y=galaxies)) + coord_flip() 
g6 <- g + geom_lv(aes(x=1, y=galaxies)) + coord_flip() 

g1 + g2 + g3 + g4 + g5 + g6 + plot_layout(ncol = 2)

  1. Fit your best mixture model to the data, and simulate 19 nulls to make a lineup. Show it to five of your neighbours and have them report which is the most different plot. Compute the \(p\)-value. Did you do a good job in matching the distribution? (Extra bonus: What is the model that you have created? Can you make a plot to show how it looks relative to the observed data?)

This code might be helpful to get you started.

library(mixtools)
galaxies_fit <- normalmixEM(galaxies, k=3)

set.seed(1138)
galaxies_sim1 <- rnormmix(n=length(galaxies), 
              lambda=galaxies_fit$lambda, 
              mu=galaxies_fit$mu,
              sigma=galaxies_fit$sigma)
ggplot(tibble(galaxies_sim1), aes(x=galaxies_sim1)) +
  geom_quasirandom(aes(x=1, y=galaxies_sim1)) + 
  coord_flip() +
  theme(
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )
galaxies_null <- tibble(.sample=1, galaxies=galaxies_sim1)
for (i in 2:19) {
  gsim <- rnormmix(n=length(galaxies), 
              lambda=galaxies_fit$lambda, 
              mu=galaxies_fit$mu,
              sigma=galaxies_fit$sigma)
  galaxies_null <- bind_rows(galaxies_null,
                             tibble(.sample=i, galaxies=gsim))
}
galaxies_null <- bind_rows(galaxies_null,
                             tibble(.sample=20,
                                    galaxies=galaxies))
# Randomise .sample  to hide data plot
galaxies_null$.sample <- rep(sample(1:20, 20), rep(82, 20))
ggplot(tibble(galaxies_null), aes(x=galaxies)) +
  geom_quasirandom(aes(x=1, y=galaxies)) + 
  facet_wrap(~.sample, ncol=4) +
  coord_flip() +
  theme(
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )
# To make a rough plot of your model
plot(galaxies_fit, whichplots=2)

The lambda value provides the proportion of mixing, from three normal samples. The mu and sigma give the mean and standard deviations for each of the distributions.

Exercise 2: Movie lengths

Load the movies dataset in the ggplot2movies package and answer the following questions based on it.

  1. How many observations are in the data?

There are 58788 observations.

  1. Draw a histogram with an appropriate binwidth that shows the peaks at 7 minutes and 90 minutes. Draw another set of histograms to show whether these peaks existed both before and after 1980.
data(movies)
movies %>%
  mutate(
    after1980 = ifelse(year > 1980,
      "After 1980",
      "1980 or before"
    ),
    copy = FALSE
  ) %>%
  bind_rows(mutate(movies, copy = TRUE, after1980 = "All")) %>%
  ggplot(aes(length)) +
  geom_histogram(binwidth = 1, fill = "yellow", color = "black") +
  scale_x_continuous(
    breaks = c(0, 7, 30, 60, 90, 120, 150, 180),
    limits = c(0, 180)
  ) +
  facet_grid(after1980 ~ .)
Warning: Removed 784 rows containing non-finite values (`stat_bin()`).
Warning: Removed 6 rows containing missing values (`geom_bar()`).

  1. The variable Short indicates whether the film was classified as a short film (1) or not (0). Draw plots to investigate what rules was used to define a film as “short” and whether the films have been consistently classified.
movies %>%
  group_by(Short) %>%
  summarise(x = list(summary(length))) %>%
  unnest_wider(x)
# A tibble: 2 × 7
  Short Min.        `1st Qu.`   Median      Mean        `3rd Qu.`   Max.       
  <int> <table[1d]> <table[1d]> <table[1d]> <table[1d]> <table[1d]> <table[1d]>
1     0 1           85          93          95.44581    103         5220       
2     1 1            7          10          13.97092     19          240       

The maximum length for a film classified as short is 240 minutes.

movies %>%
  mutate(short = factor(Short, labels = c("Long", "Short"))) %>%
  ggplot(aes(y=length, x=short)) +
  geom_quasirandom() +
  scale_y_log10(
    limits = c(1, 240),
    breaks = c(1, 7, 10, 15, 20, 30, 45, 50, 70, 90, 110, 240)
  ) +
  labs(y = "") +
  coord_flip()
Warning: Removed 114 rows containing missing values (`position_quasirandom()`).

From the graph, majority of films classified as short are under 50 minutes while those classified as long tend to be longer than 50 minutes. There are clear cases of mislabelling, e.g. a one-minute long film classified as “not short”.

On further detective work, the original source of the data says “Any theatrical film or made-for-video title with a running time of less than 45 minutes, i.e., 44 minutes or less, or any TV series or TV movie with a running time of less than 22 minutes, i.e. 21 minutes or less. (A”half-hour” television program should not be listed as a Short.)” Given this is an objective measure based on the length of the film, we can see that that any films that are 45 minutes or longer should be classified as long and less than that as short.

  1. How would you use the lineup protocol to determine if the periodic peaks could happen by chance? What would be the null hypothesis? Make your lineup, and show it to five of your neighbours and have them report which is the most different plot. Compute the \(p\)-value. Comment on the results. (Note: It might be helpful to focus on movies between 50-150 minutes only.)
movies %>% 
  filter(between(length, 50, 150)) %>%
  select(length) %>%
  lineup(null_dist("length", "norm"), true=., n=9) %>%
  ggplot(aes(x=length)) +
    geom_histogram(binwidth = 1) +
    scale_x_continuous(
      breaks = c(0, 7, 30, 60, 90, 120, 150, 180),
      limits = c(0, 180)
    ) +
  facet_wrap(~.sample, ncol=3, scales="free") +
  theme(axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid.major = element_blank())
decrypt("clZx bKhK oL 3OHohoOL B6")
Warning: Removed 18 rows containing missing values (`geom_bar()`).

Simulate samples from a normal distribution using the sample mean and standard deviation as the parameters.

\(H_0\): Periodic peaks are not present by chance.

I would expect the \(p\)-value to be 0 as the data plot is very clearly different from the nulls. This suggests that the multimodality is not possible to observe in samples from a normal distribution, and that the pattern is present because the movie lengths are commonly cut to specific numbers of minutes.

Exercise 3: Whisky

The Scotch data set in bayesm package was collated from a survey on scotch drinkers, recording the brand they consumed. Do the necessary pre-processing to get the data looking like:

# A tibble: 21 × 2
   brand                      count
   <chr>                      <int>
 1 Ballantine                    99
 2 Black & White                 81
 3 Chivas Regal                 806
 4 Clan MacGregor               103
 5 Cutty Sark                   339
 6 Dewar's White Label          517
 7 Glenfiddich                  334
 8 Glenlivet                    354
 9 Grant's                       74
10 J&B                          458
11 Johnnie Walker Black Label   502
12 Johnnie Walker Red Label     424
13 Knockando                     47
14 Macallan                      95
15 Other brands                 414
16 Passport                      82
17 Pinch (Haig)                 117
18 Scoresby Rare                 79
19 Ushers                        67
20 White Horse                   62
21 Singleton                     31
  1. Produce a barplot of the number of respondents per brand. What ordering of the brands do you think is the best?
scotch_consumption %>%
  mutate(
    brand = fct_reorder(brand, count),
    brand = fct_relevel(brand, "Other brands")
  ) %>%
  ggplot(aes(x=count, y=brand)) +
  geom_col() +
  ylab("")

  1. There are 20 named brands and one category that is labelled as Other.brands. Produce a barplot that you think best reduces the number of categories by selecting a criteria to lump certain brands to the Other category.
scotch_consumption %>%
  mutate(
    brand = ifelse(count > 200, brand, "Other brands"),
    brand = fct_reorder(brand, count),
    brand = fct_relevel(brand, "Other brands")
  ) %>%
  ggplot(aes(count, brand)) +
  geom_col()

I’ve chosen the cut-off to be 200 as there was a gap in frequency between brands that sold more than 200 and less than 200 and this reduced the comparison to 8 named brands, which is more managable for comparison.

  1. If you were to test whether this sample were consistent with a sample from a uniform distribution, how would to generate null samples? If it was a sample from a uniform distribution what would it say about typical whiskey consumption? Write down what the null hypothesis is, too.

The following code might help:

# Subset the data
scotch_consumption_sub <- scotch_consumption %>%
    mutate(
    brand = ifelse(count > 200, brand, "Other brands")
  ) %>%
  filter(brand != "Other brands") %>%
  mutate(brand = as.character(factor(brand, labels=c(1:8)))) 

set.seed(220)
sim <- rmultinom(n=9,
   size=sum(scotch_consumption_sub$count),
   prob=rep(1/8, 8))
sim <- t(sim)
colnames(sim) <- as.character(c(1:8))
sim <- sim %>%
  as_tibble() %>%
  mutate(.sample=1:9)
sim <- sim %>%
  pivot_longer(cols=`1`:`8`, names_to = "brand", values_to = "count")
scotch_lineup <- bind_rows(sim,
  bind_cols(.sample=10, scotch_consumption_sub))

# Randomise .sample  to hide data plot
scotch_lineup$.sample <- rep(sample(1:10, 10), rep(8, 10))

We need to simulate samples from a multinomial distribution for the null sets.

If the data was a sample from a uniform distribution, it would mean that all of these brands are typically consumed in equal quantity.

\(H_0:\) The counts for the brands are consistent with a sample from a uniform distributiom.

# Subset the data
scotch_consumption_sub <- scotch_consumption %>%
    mutate(
    brand = ifelse(count > 200, brand, "Other brands")
  ) %>%
  filter(brand != "Other brands") %>%
  mutate(brand = as.character(factor(brand, labels=c(1:8)))) 

# Unfortunately multinom is not one of the available distributions
#ggplot(lineup(null_dist("brand", "multinom", params=list(n=8, size=nrow(scotch_consumption_sub), prob=rep(1/8, 8))), scotch_consumption_sub))

set.seed(220)
sim <- rmultinom(n=9,
   size=sum(scotch_consumption_sub$count),
   prob=rep(1/8, 8))
sim <- t(sim)
colnames(sim) <- as.character(c(1:8))
sim <- sim %>%
  as_tibble() %>%
  mutate(.sample=1:9)
sim <- sim %>%
  pivot_longer(cols=`1`:`8`, names_to = "brand", values_to = "count")
scotch_lineup <- bind_rows(sim,
  bind_cols(.sample=10, scotch_consumption_sub))

# Randomise .sample  to hide data plot
scotch_lineup$.sample <- rep(sample(1:10, 10), rep(8, 10))

# Make the lineup
ggplot(scotch_lineup, aes(x=brand, y=count)) +
  geom_col() +
  facet_wrap(~.sample, scales="free", ncol=5) +
  theme(axis.text = element_blank(),
        axis.title = element_blank())

# Order categories in all samples from highest to lowest: TRICKY
p <- list("p1", "p2", "p3", "p4", "p5", "p6", "p7", "p8", "p9", "p10")
for (i in unique(scotch_lineup$.sample)) {
  d <- scotch_lineup %>% 
    filter(.sample == i)
  d <- d %>%
    mutate(brand = fct_reorder(brand, count))
  p[[i]] <- ggplot(d, aes(y=brand, x=count)) +
    geom_col() +
    ggtitle(i) +
    theme(axis.text = element_blank(),
        axis.title = element_blank())
}

p[[1]] + p[[2]] + p[[3]] + p[[4]] + p[[5]] +
  p[[6]] + p[[7]] + p[[8]] + p[[9]] + p[[10]] + 
  plot_layout(ncol=5)

  1. Show your lineup to five of your neighbours and have them report which is the most different plot. Compute the \(p\)-value. What do your results tell you about the typical consumption of the different whiskey brands?

I would expect that you get a \(p\)-value\(=0\) because everyone will point to the data plot. They will probably also report than one bar is much bigger than the other bars.

It tells us that the data is not a sample from uniform, at least in the sense that one brand is consumed more frequently than the others.

  1. The data whiskey_brands in the flexmix package has the information on whether the whiskey is a single malt or blend. It has additional variables Type (Single Malt or Blend) and Bottled (Foreign or Domestic). What do you think would be interesting to learn from this new information? Write down some open-ended questions to explore from the data. The data is loaded when you load the whiskey data in the same package:
data(whiskey, package = "flexmix")

Possible questions:

  • Is single malt more popularly consumed than blended whiskey?
  • Is domestic whiskey more popularly consumed than foreign whiskey?
  • Are foreign whiskies more likely to be single malt?

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.