---
# Why is a data plot a statistic?
- The concept of tidy data matches elementary statistics
- Tabular form puts variables in columns and observations in rows
- Not all tabular data is in this form
- This is the point of tidy data
$$X = \left[ \begin{array}{rrrr}
X_1 & X_2 & ... & X_p
\end{array} \right] \\
= \left[ \begin{array}{rrrr}
X_{11} & X_{12} & ... & X_{1p} \\
X_{21} & X_{22} & ... & X_{2p} \\
\vdots & \vdots & \ddots& \vdots \\
X_{n1} & X_{n2} & ... & X_{np}
\end{array} \right]$$
- We might even make assumptions about the distribution of each variable, e.g. $X_1 \sim N(0,1), ~~X_2 \sim \text{Exp}(1) ...$
---
# Why is a data plot a statistic?
- A statistic is a function on the values of items in a sample, e.g. for $n$ iid random variates $\bar{X}_1=\sum_{i=1}^n X_{i1}$, $s_1^2=\frac{1}{n-1}\sum_{i=1}^n(X_{i1}-\bar{X}_1)^2$
- We study the behaviour of the statistic over all possible samples of size $n$.
- The grammar of graphics is the mapping of (random) variables to graphical elements, making plots of data into statistics
.flex[
.item.w-45[
```
ggplot(threept_sub,
aes(x=angle, y=r)) +
geom_point(alpha=0.3)
```
```
ggplot(penguins,
aes(x=bill_length_mm,
y=flipper_length_mm,
colour=species)) +
geom_point()
```
]
.item.w-45[
`angle` is mapped to the x axis
`r` is mapped to the y axis
`bill_length_mm` is mapped to the x axis
`flipper_length_mm` is mapped to the y axis
`species` is mapped to colour
]
]
---
# What is inference?
.f5[
Inferring that what we see in the data at hand holds more broadly in life, society and the world.
]
.flex[
.item.w-50[
Here's an example tweeted by David Robinson,
commenting on an analysis in [Tick Tock blog, by Graham Tierney](https://ticktocksaythehandsoftheclock.wordpress.com/2018/01/11/capitals-and-good-governance/)
]
.item.w-50[
.blockquote[Below is a simple scatterplot of the two variables of interest. A slight negative slope is observed, but it does not look very large. There are a lot of states whose capitals are less than 5% of the total population. The two outliers are Hawaii (government rank 33 and capital population 25%) and Arizona (government rank 26 and capital population 23%). Without those two the downward trend (an improvement in ranking) would be much stronger.
... I'm not convinced ...]
]
]
---
# To do *statistical* inference
You need a:
- null hypothesis, and alternative
- statistic computed on the data
- reference distribution on which to measure the statistic: if its extreme on this scale you would reject the null
--
# Inference with *data plots*
You need a:
- plot description, as provided by the grammar (which is a statistic)
- plot description prescribing the null hypothesis .monash-blue2[(not the data plot itself)]
- null generating mechanism, e.g. permutation, simulation from a distribution or model
- human visual system, to examine array of plots and decide if any are different from the others
---
# Some examples
Here are several plot descriptions. What would be the null hypothesis in each?
.flex[
.item.w-45[
A
```
ggplot(data) +
geom_point(aes(x=x1, y=x2))
```
B
```
ggplot(data) +
geom_point(aes(x=x1,
y=x2, colour=cl))
```
]
.item.w-10[
.white[space]
]
.item.w-45[
C
```
ggplot(data) +
geom_histogram(aes(x=x1))
```
D
```
ggplot(data) +
geom_boxplot(aes(x=cl, y=x1))
```
]
]
.monash-orange2[Which plot definition would most match to a null hypothesis stating **there is no difference in the distribution between the groups?**]
---
# Some examples
Here are several plot descriptions. What would be the null hypothesis in each?
.flex[
.item.w-45[
A
$H_o:$ no association between `x1` and `x2`
B
$H_o:$ no difference between levels of `cl`
]
.item.w-10[
.white[space]
]
.item.w-45[
C
$H_o:$ the distribution of `x1` is XXX
D
$H_o:$ no difference in the distribution of `x1` between levels of `cl`
]
]
---
class: transition
# Visual inference with the nullabor 📦
---
.flex[
.item.w-45[
Example from the nullabor package. The data plot is embedded randomly in a field of null plots, this is a **lineup**. Can you see which one is different?
When you run the example yourself, you get a `decrypt` code line, that you run after deciding on a plot to print the location of the data plot amongst the nulls.
- plot is a scatterplot, null hypothesis is .monash-orange2[*there is no association between the two variables mapped to the x, y axes*]
- null generating mechanism: .monash-orange2[permutation]
]
.item.w-10[
.white[space]
]
.item.w-50.f5[
```{r lineup 1, fig.height=8, fig.width=8, out.width="70%"}
# Make a lineup of the mtcars data, 20 plots, one is the data,
# and the others are null plots. Which one is different?
set.seed(20190709)
ggplot(lineup(null_permute('mpg'), mtcars), aes(mpg, wt)) +
geom_point() +
facet_wrap(~ .sample) +
theme(axis.text=element_blank(),
axis.title=element_blank())
```
]]
---
.flex[
.item.w-45[
# Lineup
Embed the data plot in a field of null plots
```{r embed the data plot in a field of null plots, eval=FALSE}
library(nullabor)
pos <- sample(1:20, 1)
df_null <- lineup(
null_permute('v1'),
df, pos=pos)
ggplot(df_null,
aes(x=v2,
y=v1,
fill=v2)) +
geom_boxplot() +
facet_wrap(~.sample,
ncol=5) +
coord_flip()
```
.monash-orange2[Ask]: Which plot is the most different?
]
.item.w-10[
.white[space]
]
.item.w-45[
# Null-generating mechanisms
- Permutation: randomizing the order of one of the variables breaks association, but keeps marginal distributions the same
- Simulation: from a given distribution, or model. Assumption is that the data comes from that model
# Evaluation
- Compute $p$-value
- Power $=$ signal strength
]
]
---
# .orange[Case study] .circle.white.bg-orange[1] Temperatures of stars .font_small[Part 1/2]
* The data consists of the surface temperature in Kelvin degrees of 96 stars.
--
* We want to check if the surface temperature has an exponential distribution.
--
* We use histogram with 30 bins as our visual test statistic.
--
* For the null data, we will generate from an exponential distribution.
```{r star-null, message = TRUE}
line_df <- lineup(null_dist("temp", "exp",
list(rate = 1 / mean(dslabs::stars$temp))),
true = dslabs::stars,
n = 10
)
```
* Note: the rate in an exponential distribution can be estimated from the inverse of the sample mean.
---
# .orange[Case study] .circle.white.bg-orange[4] Temperatures of stars .font_small[Part 2/2]
.grid[
.item[
.panelset[
.panel[.panel-name[đź“Š]
```{r stars-lineup, echo = FALSE, fig.width = 14}
ggplot(line_df, aes(temp)) +
geom_histogram(color = "white") +
facet_wrap(~.sample, nrow = 2) +
theme(
axis.text = element_blank(),
axis.title = element_blank()
)
```
]
.panel[.panel-name[R]
```{r stars-lineup, eval = FALSE}
```
]]
]]
---
# .orange[Case study] .circle.white.bg-orange[2] Foreign exchange rate .font_small[Part 1/2]
* The data contains the daily exchange rate of 1 AUD to 1 USD between 9th Jan 2018 to 21st Feb 2018.
* Does the rate follow an ARIMA model?
.f5[
```{r ts-data, message = TRUE}
data(aud, package = "nullabor")
line_df <- lineup(null_ts("rate", forecast::auto.arima), true = aud, n = 10)
```
```{r ts-plot, eval = FALSE}
ggplot(line_df, aes(date, rate)) +
geom_line() +
facet_wrap(~.sample, scales = "free_y", nrow = 2) +
theme(
axis.title = element_blank(),
axis.text = element_blank()
)
```
]
---
# .orange[Case study] .circle.white.bg-orange[5] Foreign exchange rate .font_small[Part 2/2]
.grid[
.item[
```{r ts-plot, echo = FALSE, fig.width = 14}
```
]]
---
# Power of a lineup
.w-70[
* The power of a lineup is calculated as $x/n$ where $x$ is the number of people who detected the data plot out of $n$ people.
* This is useful if you want to decide which plot design is better.
* Show the same lineup made using different plots to observers .f4[(different sets of observers, the same person cannot see the same data more than once, else they may be biased)].
]
.footnote.f4[
Hofmann, H., L. Follett, M. Majumder, and D. Cook. 2012. “Graphical Tests for Power Comparison of Competing Designs.” IEEE Transactions on Visualization and Computer Graphics 18 (12): 2441–48.
]
---
```{r echo=FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="100%"}
library(palmerpenguins)
set.seed(400)
ggplot(lineup(null_permute("species"), penguins, n=10),
aes(x=species, y=bill_length_mm, colour=species)) +
geom_point(alpha=0.4) +
facet_wrap(~.sample, ncol=5) +
colorspace::scale_color_discrete_divergingx(palette="Temps") +
theme_bw() +
theme(
legend.position="none",
axis.title = element_blank(),
axis.text = element_blank()
)
```
---
```{r echo=FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="100%"}
ggplot(lineup(null_permute("species"), penguins, n=10),
aes(x=species, y=bill_length_mm, fill=species), colour="white") +
geom_boxplot() +
facet_wrap(~.sample, ncol=5) +
colorspace::scale_fill_discrete_divergingx(palette="Temps") +
theme_bw() +
theme(
legend.position="none",
axis.title = element_blank(),
axis.text = element_blank()
)
```
---
```{r echo=FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="100%"}
ggplot(lineup(null_permute("species"), penguins, n=10),
aes(x=species, y=bill_length_mm, fill=species), colour="white") +
geom_violin() +
facet_wrap(~.sample, ncol=5) +
colorspace::scale_fill_discrete_divergingx(palette="Temps") +
theme_bw() +
theme(
legend.position="none",
axis.title = element_blank(),
axis.text = element_blank()
)
```
---
```{r echo=FALSE, fig.width=8, fig.height=4, fig.align="center", out.width="100%"}
ggplot(lineup(null_permute("species"), penguins, n=10),
aes(x=species, y=bill_length_mm, colour=species)) +
ggbeeswarm::geom_quasirandom(alpha=0.8) +
facet_wrap(~.sample, ncol=5) +
colorspace::scale_color_discrete_divergingx(palette="Temps") +
theme_bw() +
theme(
legend.position="none",
axis.title = element_blank(),
axis.text = element_blank()
)
```
---
Plot type | $x$ | $n$ | Power
--- | --- | --- | ---
`geom_point` | $x_1=4$ | $n_1=23$ | $x_1 / n_1=0.174$
`geom_boxplot` | $x_2=5$ | $n_2=25$ | $x_2 / n_2=0.185$
`geom_violin` | $x_3=6$ | $n_3=29$ | $x_3 / n_3=0.206$
`ggbeeswarm::geom_quasirandom` | $x_4=8$ | $n_4=24$ | $x_4 / n_4=0.333$
--
* The plot type with a higher power is preferable
--
* You can use this framework to find the optimal plot design
---
# Some considerations in visual inference
* In practice you don't want to bias the judgement of the human viewers so for a proper visual inference:
* you should _not_ show the data plot before the lineup
* you should _not_ give the context of the data
* you should remove labels in plots
* You can crowd source these by paying for services like:
* [Amazon Mechanical Turk](https://www.mturk.com/),
* [Appen (formerly Figure Eight)](https://appen.com/figure-eight-is-now-appen/) and
* [LABVANCED](https://www.labvanced.com/).
* [prolific](https://www.prolific.co/).
* If the data is for research purposes, then you may need ethics approval for publication.
---
# Resources and Acknowledgement
.font18[
- Buja, Andreas, Dianne Cook, Heike Hofmann, Michael Lawrence, Eun-Kyung Lee, Deborah F. Swayne, and Hadley Wickham. 2009. “Statistical Inference for Exploratory Data Analysis and Model Diagnostics.” Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 367 (1906): 4361–83.
- Wickham, Hadley, Dianne Cook, Heike Hofmann, and Andreas Buja. 2010. “Graphical Inference for Infovis.” IEEE Transactions on Visualization and Computer Graphics 16 (6): 973–79.
- Hofmann, H., L. Follett, M. Majumder, and D. Cook. 2012. “Graphical Tests for Power Comparison of Competing Designs.” IEEE Transactions on Visualization and Computer Graphics 18 (12): 2441–48.
- Majumder, M., Heiki Hofmann, and Dianne Cook. 2013. “Validation of Visual Statistical Inference, Applied to Linear Models.” Journal of the American Statistical Association 108 (503): 942–56.
- Data coding using [`tidyverse` suite of R packages](https://www.tidyverse.org)
- Slides originally written by Emi Tanaka and constructed with [`xaringan`](https://github.com/yihui/xaringan), [remark.js](https://remarkjs.com), [`knitr`](http://yihui.name/knitr), and [R Markdown](https://rmarkdown.rstudio.com)
]
---
```{r endslide, child="assets/endslide.Rmd"}
```