ETC5521 Tutorial 3

Deconstructing an exploratory data analysis

Author

Prof. Di Cook

Published

August 10, 2023

🎯 Objectives

Constructing, planning and evaluating an exploratory data analysis are important skills. This tutorial is an exercise in reading and digesting a really good analysis. Your goal is to understand the analysis, reproduce it, and the choices the analysts made, and why these were would be considered high quality.

🔧 Preparation

The reading for this week is EDA Case Study: Bay area blues. It is authored by Hadley Wickham, Deborah F. Swayne, and David Poole. It appeared in the book “Beautiful Data” edited by Jeff Hammerbacher and Toby Segaran. Not all the chapters in the book are good examples of data analysis, though.

  • Complete the weekly quiz, before the deadline!
  • Make sure you have this list of R packages installed:
install.packages(c("tidyverse", "forcats", "patchwork"))
  • Note that the code and data for reproducing their analysis can be found here.

  • Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this weeks activities.

💻 Reproducing the analysis

Point your web browser to the github site for the analysis, https://github.com/hadley/sfhousing. The main data file is house-sales.csv. Read this data into your R session. (🛑 ARE YOU USING A PROJECT FOR THIS UNIT? IF NOT, STOP and OPEN IT NOW.)

You can read the data in directly from the web site using this code:

library(tidyverse)
library(patchwork)
library(forcats)
sales <- read_csv("https://raw.githubusercontent.com/hadley/sfhousing/master/house-sales.csv")

Exercise 1: What’s in the data?

  • Is the data in tidy form?
  • Of the variables in the data, which are
    • numeric?
    • categorical?
    • temporal?
  • What would be an appropriate plot to make to examine the
    • numeric variables?
    • categorical variables?
    • a categorical and numeric variable?
    • a temporal variable and a numeric variable?

Exercise 2: Time series plots

Reproduce the time series plots of weekly average price and volume of sales.