ETC5521 Tutorial 6

Working with a single variable, making transformations, detecting outliers, using robust statistics

Author

Prof. Di Cook

Published

August 31, 2023

🎯 Objectives

These are exercises in making plots of one variable and what can be learned about the distributions and the data patterns and problems.

🔧 Preparation

The reading for this week is Wilke (2019) Ch 6 Visualizing Amounts; Ch 7 Visualizing distributions. - Complete the weekly quiz, before the deadline! - Make sure you have this list of R packages installed:

install.packages(c("ggplot2movies", "bayesm", "flexmix",  "ggbeeswarm", "mixtools", "lvplot", "patchwork", "nullabor"))
  • Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this week’s activities.
  • Note for Tutors: Randomly allocate one of the problems to each student. Give them 30-40 minutes to work on the problem. Then in the last 30 minutes discuss each one, and the findings with entire tutorial.

Exercise 1: Galaxies

Load the galaxies data in the MASS package and answer the following questions based on this dataset.

data(galaxies, package = "MASS")

You can access documentation of the data (if available) using the help function specifying the package name in the argument.

help(galaxies, package = "MASS")
  1. What does the data contain? And what is the data source?
data(galaxies, package = "MASS")
glimpse(galaxies)
 num [1:82] 9172 9350 9483 9558 9775 ...

The data contains velocities in km/sec of 82 galaxies from 6 well-separated conic sections of an unfilled survey of the Corona Borealis region. The original data is from Postman et al. (1986) and this data is from Roeder with 83rd observation removed from the original data as well as typo for the 78th observation.

  • Postman, M., Huchra, J. P. and Geller, M. J. (1986) Probes of large-scale structures in the Corona Borealis region. Astronomical Journal 92, 1238–1247
  • Roeder, K. (1990) Density estimation with confidence sets exemplified by superclusters and voids in galaxies. Journal of the American Statistical Association 85, 617–624.
  1. Based on the description in the R Help for the data, what would be an appropriate null distribution of this data?

The description in the R help for the data says Multimodality in such surveys is evidence for voids and superclusters in the far universe.

Deciding on an appropriate null hypothesis is always tricky. If we wanted to test the statement that the data is multimodal, we could compare against a unimodal distribution, either a normal or an exponential depending on what shape we might expect.

However, the published work has already made a claim that the data is multimodal, so it would be interesting to determine if we can generate samples from a multimodal distribution that are indistinguishable from the data.

\(H_0:\) The distribution is multimodal. \(H_a:\) The distribution is something other than multimodal.

  1. How many observations are there?

There are 82 observations.

  1. If the data is multimodal, which of the following displays do you think would be the best? Which would not be at all useful?
  • histogram
  • boxplot
  • density plot
  • violin plot
  • jittered dot plot
  • letter value plot

If you said a density plot, jittered dot plot, or a histogram, you’re on the right track, because each can give a fine resolution for showing modes. (The violin plot is not any different from a density plot, when only looking at one variable.)

  1. Make these plots for the data. Experiment with different binwidths for the histogram and different bandwiths for the density plot. Were you right in your thinking about which would be the best?
g <- ggplot(tibble(galaxies), aes(galaxies)) +
  theme(
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )
g1 <- g + geom_histogram(binwidth = 1000, color = "white") 
g2 <- g + geom_boxplot() 
g3 <- g + geom_density() 
g4 <- g + geom_violin(aes(x=galaxies, y=1), draw_quantiles = c(0.25, 0.5, 0.75))
g5 <- g + geom_quasirandom(aes(x=1, y=galaxies)) + coord_flip() 
g6 <- g + geom_lv(aes(x=1, y=galaxies)) + coord_flip() 

g1 + g2 + g3 + g4 + g5 + g6 + plot_layout(ncol = 2)