ETC5521 Tutorial 3

Deconstructing an exploratory data analysis


Prof. Di Cook


August 10, 2023

🎯 Objectives

Constructing, planning and evaluating an exploratory data analysis are important skills. This tutorial is an exercise in reading and digesting a really good analysis. Your goal is to understand the analysis, reproduce it, and the choices the analysts made, and why these were would be considered high quality.

🔧 Preparation

The reading for this week is EDA Case Study: Bay area blues. It is authored by Hadley Wickham, Deborah F. Swayne, and David Poole. It appeared in the book “Beautiful Data” edited by Jeff Hammerbacher and Toby Segaran. Not all the chapters in the book are good examples of data analysis, though.

  • Complete the weekly quiz, before the deadline!
  • Make sure you have this list of R packages installed:
install.packages(c("tidyverse", "forcats", "patchwork"))
  • Note that the code and data for reproducing their analysis can be found here.

  • Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this weeks activities.

💻 Reproducing the analysis

Point your web browser to the github site for the analysis, The main data file is house-sales.csv. Read this data into your R session. (🛑 ARE YOU USING A PROJECT FOR THIS UNIT? IF NOT, STOP and OPEN IT NOW.)

You can read the data in directly from the web site using this code:

sales <- read_csv("")

Exercise 1: What’s in the data?

  • Is the data in tidy form?
  • Of the variables in the data, which are
    • numeric?
    • categorical?
    • temporal?
  • What would be an appropriate plot to make to examine the
    • numeric variables?
    • categorical variables?
    • a categorical and numeric variable?
    • a temporal variable and a numeric variable?
  • Yes
    • price, br, lsqft, bsqft
    • county, city, zip, street
    • year, date, datesold
    • scatterplots
    • bar charts, pie charts, mosaic
    • facet by the categorical variable. could be boxplots, or density plots, or facetted scatterplots to look at multiple numeric variables
    • time series plot, connect lines to indicate time, maybe need to aggregate over time to get one value per time point

Exercise 2: Time series plots

Reproduce the time series plots of weekly average price and volume of sales.