Learn to apply visual inference to different types of plots.
🔧 Preparation
The reading for this week is Wickham et al. (2010) Graphical inference for Infovis. It is a basic introduction to inference for exploratory data analysis, especially for data visualisation. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:
Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this week’s activities.
Exercise 1: Skittles experiment
Skittles come in five colors (orange, yellow, red, purple, green) each with their own flavours (orange, lemon, strawberry, grape, green apple). Data was collected by Dr Nick Tierney to explore whether a sample of 3 people could identify the flavour of skittles while blindfolded. You can find the cleaned tidy data here.
A person with loss of taste is called ageusia and a person who has a loss of smell is called anosmia. The loss of taste and loss of smell will not allow you to distinguish flavours in food. What is the probability that a person with ageusia and anosmia will guess the skittle flavour correctly (out of the five flavours) for one skittle?
Solution
If a person cannot distinguish flavours then they will randomly choose one of the five flavours. So the probability that they select the correct flavour is 1/5.
What is the probability that a person with ageusia and anosmia will guess the skittle flavour correctly for 2 out of 10 skittles, assuming the order of taste does not matter?
Solution
Suppose \(X\) is the number of skittles that they correctly identified the flavour. Then assuming that the person cannot distinguish flavours and order of tasting the skittles does not matter, \(X \sim B(10, 0.2)\). Then \(P(X = 2) = {10 \choose 2} 0.2^2 0.8^8\approx 0.3\). So there’s only about 30% chance such an event happens!
dbinom(2, 10, 0.2)
[1] 0.3019899
Test the null hypothesis that people cannot distinguish the flavours correctly, against the alternative that they can. Assume that the order of tasting does not matter and each person has the same ability to correctly identify the flavours. In conducting your test, define your null and alternate hypothesis, in statistical notation, your assumptions, the test statistics and calculate the \(p\)-value.
Solution
Suppose \(X\) is the number of skittles that a person identified the flavour correctly out of 30 skittles. Suppose each tasting is independent and has a equal probability of identifying the flavour correctly; we denote this probability as \(p\). We test the hypotheses: \(H_0: p = 0.2\) vs. \(H_1: p > 0.2\). Under \(H_0\), \(X\sim B(30, 0.2)\) and therefore the \(p\)-value is \(P(X \geq 15) \approx 0.0002\). The \(p\)-value is small so the data supports that people can correctly identify the flavour of a skittle!
sum(skittle$correct)
[1] 15
1-pbinom(sum(skittle$correct), 30, 0.2)
[1] 5.238729e-05
In part (d) we disregarded the order of the tasting and the possible variability in people’s ability to correctly identify the flavour. If in fact these do matter, then how would you construct the test statistic? Is it easy?
Solution
To construct a test statistic, we need to construct a summary statistic with some known distribution under the null hypothesis (if using a parametric approach) with large (or extreme) values indicating rejection of the null hypothesis. Suppose that \(X_1\), \(X_2\) and \(X_3\) are the number of skittles out of 10 that person a, b and c, respectively, correctly identified. If each tasting is independent, then \(X_1 \sim B(10, p_1)\), \(X_2 \sim B(10, p_2)\) and \(X_3 \sim B(10, p_3)\) where \(p_i\) is the probability that the \(i\)-th person correctly identifies the flavour of a skittle. Now under \(H_0\) you may assume that \(p_1 = p_2 = p_3 = 0.2\) and assuming each person is independent, \(X_1 + X_2 + X_3 \sim B(30, 0.2)\). Same as (d)! However, if we know remove the assumption that each tasting is independent (so the order of tasting does matter), then the distribution of the test statistic does not hold true any longer.
Consider the plot below that shows in each tile whether a person guessed correctly by order of their tasting. Suppose that under the null hypothesis, the order of tasting does not matter and people have no ability to distinguish the flavours. Generate a null plot under this null hypothesis.
Solution
The null plot is constructed as follows.
gtile <- skittle %>%ggplot(aes(factor(order), person, fill =factor(correct))) +geom_tile(color ="black", size =2) +coord_equal() +scale_fill_viridis_d() +labs(x ="Order", y ="Person", fill ="Correct")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
decrypt("h8RX 5IvI ne TAynvnAe YL")
Warning in decrypt("h8RX 5IvI ne TAynvnAe YL"): NAs introduced by coercion
[1] "tlqA XUMU Fv xBKFMFBv NA"
Suppose that you have a response from 100 people based on your line-up from (g) and 76 correctly identified the data plot. What is the \(p\)-value from this visual inference?
Solution
We suppose that each person has the same ability to identify the data plot. If we let \(X\) be the number of people who correctly identified the data plot in the lineup, then \(X \sim B(100, p)\). The visual inference \(p\)-value is calculated from testing the hypotheses \(H_0: p = 0.05\) vs \(H_1: p \neq 0.05\), and so is \(P(X\geq 76) \approx 0\). The visual inference \(p\)-value is very small so there is strong evidence to believe that the structure in the data deviates away from the null distribution!
1-pbinom(75, 100, 0.05)
[1] 0
Now consider the plot below. Use the same null data in (g) to construct a lineup based on below visual statistic. Suppose we had 28 people out of 100 who correctly identified the data plot in this lineup. What is the difference in power of visual statistic in (f) and this one?
Warning in decrypt("h8RX 5IvI ne TAynvnAe YL"): NAs introduced by coercion
[1] "tlqA XUMU Fv xBKFMFBv NA"
The estimated power of visual statistic in (f) is 76% and for the barplot is 26%. So the difference in power is 50%.
Exercise 2: Social media marketing
The data marketing in the datarium R-package contains information on sales with advertising budget for three advertising media (youtube, facebook and newspaper). This advertising experiment was repeated 200 times to study the impact of the advertisting media on sales.
data(marketing, package ="datarium")
Study the pairs plot. Which of the advertising medium do you think affects the sales?
Solution
GGally::ggpairs(marketing)
The pairs plot suggest that advertising on youtube is highly correlated with the sales and advertising on facebook is moderately correlated with the sales. Newspaper advertisement does not appear to be correlated highly with the sales.
Construct a coplot for sales vs advertising budget for facebook conditioned on advertising budget for youtube and newspaper. (You may like to make the intervals non-overlapping to make it easier to plot in ggplot). What do you see in the plot?
The newspaper does not seem to have much affect on the sales however it is noticeable that sales is linearly related to advertisement budget for facebook conditioned on youtube.
Now construct a coplot for sales vs advertising budget for facebook conditioned on advertising budget for youtube alone. Superimpose a linear model on each facet. Is there an interval where the linear model is not a good fit?
There is a noticeably higher variability along the line in the above plot where advertisement budget for youtube is less than $90,000. There appears to be a linear relationship between facebook and sales (conditioned on advertisement budget on youtube), however the fitted lines all appear to have different slopes.
Consider the following interaction model (which has the same symbolic model formulae as sales ~ facebook*youtube) for data where the advertising budget for youtube is at least $90,000. Construct a QQ-plot of the residuals. Do you think the errors are normally distributed? Construct a lineup for the QQ-plot assuming that the null distribution is Normally distributed with mean zero and variance as estimated from the model fit.