ETC5521 Tutorial 5

Statistical inference for exploratory methods

Author

Prof. Di Cook

Published

August 24, 2023

🎯 Objectives

Learn to apply visual inference to different types of plots.

🔧 Preparation

The reading for this week is Wickham et al. (2010) Graphical inference for Infovis. It is a basic introduction to inference for exploratory data analysis, especially for data visualisation. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:

install.packages(c("tidyverse", "datarium", "broom", "nullabor"))

Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this week’s activities.

Exercise 1: Skittles experiment

Skittles come in five colors (orange, yellow, red, purple, green) each with their own flavours (orange, lemon, strawberry, grape, green apple). Data was collected by Dr Nick Tierney to explore whether a sample of 3 people could identify the flavour of skittles while blindfolded. You can find the cleaned tidy data here.

How many skittles did each person taste?

Solution

skittle <- read_csv("https://raw.githubusercontent.com/njtierney/skittles/master/data/skittles.csv")

Each person tasted 10 skittles.

skittle %>% 
  with(table(person))

person
 a  b  c 
10 10 10

A person with loss of taste is called ageusia and a person who has a loss of smell is called anosmia. The loss of taste and loss of smell will not allow you to distinguish flavours in food. What is the probability that a person with ageusia and anosmia will guess the skittle flavour correctly (out of the five flavours) for one skittle?

Solution

If a person cannot distinguish flavours then they will randomly choose one of the five flavours. So the probability that they select the correct flavour is 1/5.

What is the probability that a person with ageusia and anosmia will guess the skittle flavour correctly for 2 out of 10 skittles, assuming the order of taste does not matter?

Solution

Suppose $X$ is the number of skittles that they correctly identified the flavour. Then assuming that the person cannot distinguish flavours and order of tasting the skittles does not matter, $X \sim B(10, 0.2)$. Then $P(X = 2) = {10 \choose 2} 0.2^2 0.8^8\approx 0.3$. So there’s only about 30% chance such an event happens!

dbinom(2, 10, 0.2)

[1] 0.3019899

Test the null hypothesis that people cannot distinguish the flavours correctly, against the alternative that they can. Assume that the order of tasting does not matter and each person has the same ability to correctly identify the flavours. In conducting your test, define your null and alternate hypothesis, in statistical notation, your assumptions, the test statistics and calculate the $p$-value.

Solution

Suppose $X$ is the number of skittles that a person identified the flavour correctly out of 30 skittles. Suppose each tasting is independent and has a equal probability of identifying the flavour correctly; we denote this probability as $p$. We test the hypotheses: $H_0: p = 0.2$ vs. $H_1: p > 0.2$. Under $H_0$, $X\sim B(30, 0.2)$ and therefore the $p$-value is $P(X \geq 15) \approx 0.0002$. The $p$-value is small so the data supports that people can correctly identify the flavour of a skittle!

sum(skittle$correct)

[1] 15

1 - pbinom(sum(skittle$correct), 30, 0.2)

[1] 5.238729e-05

In part (d) we disregarded the order of the tasting and the possible variability in people’s ability to correctly identify the flavour. If in fact these do matter, then how would you construct the test statistic? Is it easy?

Solution

To construct a test statistic, we need to construct a summary statistic with some known distribution under the null hypothesis (if using a parametric approach) with large (or extreme) values indicating rejection of the null hypothesis. Suppose that $X_1$, $X_2$ and $X_3$ are the number of skittles out of 10 that person a, b and c, respectively, correctly identified. If each tasting is independent, then $X_1 \sim B(10, p_1)$, $X_2 \sim B(10, p_2)$ and $X_3 \sim B(10, p_3)$ where $p_i$ is the probability that the $i$-th person correctly identifies the flavour of a skittle. Now under $H_0$ you may assume that $p_1 = p_2 = p_3 = 0.2$ and assuming each person is independent, $X_1 + X_2 + X_3 \sim B(30, 0.2)$. Same as (d)! However, if we know remove the assumption that each tasting is independent (so the order of tasting does matter), then the distribution of the test statistic does not hold true any longer.

Consider the plot below that shows in each tile whether a person guessed correctly by order of their tasting. Suppose that under the null hypothesis, the order of tasting does not matter and people have no ability to distinguish the flavours. Generate a null plot under this null hypothesis.

Solution

The null plot is constructed as follows.

gtile <- skittle %>% 
  ggplot(aes(factor(order), person, fill = factor(correct))) + 
  geom_tile(color = "black", size = 2) +
  coord_equal() + 
  scale_fill_viridis_d() +
  labs(x = "Order", y = "Person", fill = "Correct")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

gtile

set.seed(1)
method <- null_dist("correct", "binom", list(size = 1, prob = 0.2))
gtile %+% method(skittle)

Based on (f), construct a lineup (using nullabor or otherwise) of 20 plots. Ask your classmate, which plot looks different.

Solution

lineup_df <- lineup(method, true = skittle)

decrypt("clZx bKhK oL 3OHohoOL 0C")

gtile %+% lineup_df +
  facet_wrap(~.sample) +
  guides(fill = FALSE) +
  theme(axis.text = element_blank(),
        axis.title = element_blank())

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.

decrypt("h8RX 5IvI ne TAynvnAe YL")

Warning in decrypt("h8RX 5IvI ne TAynvnAe YL"): NAs introduced by coercion

[1] "tlqA XUMU Fv xBKFMFBv  NA"

Suppose that you have a response from 100 people based on your line-up from (g) and 76 correctly identified the data plot. What is the $p$-value from this visual inference?

Solution

We suppose that each person has the same ability to identify the data plot. If we let $X$ be the number of people who correctly identified the data plot in the lineup, then $X \sim B(100, p)$. The visual inference $p$-value is calculated from testing the hypotheses $H_0: p = 0.05$ vs $H_1: p \neq 0.05$, and so is $P(X\geq 76) \approx 0$. The visual inference $p$-value is very small so there is strong evidence to believe that the structure in the data deviates away from the null distribution!

1 - pbinom(75, 100, 0.05)

[1] 0

Now consider the plot below. Use the same null data in (g) to construct a lineup based on below visual statistic. Suppose we had 28 people out of 100 who correctly identified the data plot in this lineup. What is the difference in power of visual statistic in (f) and this one?

Solution

gbar <- skittle %>% 
  mutate(person = fct_reorder(person, correct, sum)) %>% 
  group_by(person) %>% 
  summarise(correct = sum(correct)) %>% 
  ggplot(aes(person, correct)) + 
  geom_col() +
  labs(x = "Person", y = "Correct") +
  geom_hline(yintercept = 2, linetype = "dashed")
gbar

gbar %+% lineup_df +
  facet_wrap(~.sample) +
  guides(fill = FALSE) +
  theme(axis.text = element_blank(),
        axis.title = element_blank())

decrypt("h8RX 5IvI ne TAynvnAe YL")

Warning in decrypt("h8RX 5IvI ne TAynvnAe YL"): NAs introduced by coercion

[1] "tlqA XUMU Fv xBKFMFBv  NA"

The estimated power of visual statistic in (f) is 76% and for the barplot is 26%. So the difference in power is 50%.

Exercise 2: Social media marketing

The data marketing in the datarium R-package contains information on sales with advertising budget for three advertising media (youtube, facebook and newspaper). This advertising experiment was repeated 200 times to study the impact of the advertisting media on sales.

data(marketing, package = "datarium")

Study the pairs plot. Which of the advertising medium do you think affects the sales?

Solution

GGally::ggpairs(marketing)

The pairs plot suggest that advertising on youtube is highly correlated with the sales and advertising on facebook is moderately correlated with the sales. Newspaper advertisement does not appear to be correlated highly with the sales.

Construct a coplot for sales vs advertising budget for facebook conditioned on advertising budget for youtube and newspaper. (You may like to make the intervals non-overlapping to make it easier to plot in ggplot). What do you see in the plot?

Solution

marketing %>% 
  ggplot(aes(facebook, sales)) +
  geom_point() + 
  facet_grid(cut_number(youtube, 4) ~ cut_number(newspaper, 4)) + 
  geom_smooth(method = "lm")

The newspaper does not seem to have much affect on the sales however it is noticeable that sales is linearly related to advertisement budget for facebook conditioned on youtube.

Now construct a coplot for sales vs advertising budget for facebook conditioned on advertising budget for youtube alone. Superimpose a linear model on each facet. Is there an interval where the linear model is not a good fit?

Solution

marketing %>% 
  ggplot(aes(facebook, sales)) +
  geom_point() + 
  facet_wrap(~cut_number(youtube, 4)) + 
  geom_smooth(method = "lm")

There is a noticeably higher variability along the line in the above plot where advertisement budget for youtube is less than $90,000. There appears to be a linear relationship between facebook and sales (conditioned on advertisement budget on youtube), however the fitted lines all appear to have different slopes.

Consider the following interaction model (which has the same symbolic model formulae as sales ~ facebook*youtube) for data where the advertising budget for youtube is at least $90,000. Construct a QQ-plot of the residuals. Do you think the errors are normally distributed? Construct a lineup for the QQ-plot assuming that the null distribution is Normally distributed with mean zero and variance as estimated from the model fit.

Solution

set.seed(1149)
fit <- lm(sales ~ facebook + youtube + facebook:youtube, data = filter(marketing, youtube > 90))

gqq <- augment(fit) %>% 
  ggplot(aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line()
gqq

gqq %+% lineup(null_dist(".resid", "norm", list(mean = 0, sd = sigma(fit))),
               true = augment(fit)) +
  facet_wrap(~.sample) +
  theme(axis.text = element_blank(),
        axis.title = element_blank(), 
        aspect.ratio = 1)

# > decrypt("clZx bKhK oL 3OHohoOL 0d")
# [1] "True data in position  13"

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.