ETC5521 Tutorial 3

Deconstructing an exploratory data analysis

Author

Prof. Di Cook

Published

August 10, 2023

🎯 Objectives

Constructing, planning and evaluating an exploratory data analysis are important skills. This tutorial is an exercise in reading and digesting a really good analysis. Your goal is to understand the analysis, reproduce it, and the choices the analysts made, and why these were would be considered high quality.

🔧 Preparation

The reading for this week is EDA Case Study: Bay area blues. It is authored by Hadley Wickham, Deborah F. Swayne, and David Poole. It appeared in the book “Beautiful Data” edited by Jeff Hammerbacher and Toby Segaran. Not all the chapters in the book are good examples of data analysis, though.

  • Complete the weekly quiz, before the deadline!
  • Make sure you have this list of R packages installed:
install.packages(c("tidyverse", "forcats", "patchwork"))
  • Note that the code and data for reproducing their analysis can be found here.

  • Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this weeks activities.

💻 Reproducing the analysis

Point your web browser to the github site for the analysis, https://github.com/hadley/sfhousing. The main data file is house-sales.csv. Read this data into your R session. (🛑 ARE YOU USING A PROJECT FOR THIS UNIT? IF NOT, STOP and OPEN IT NOW.)

You can read the data in directly from the web site using this code:

library(tidyverse)
library(patchwork)
library(forcats)
sales <- read_csv("https://raw.githubusercontent.com/hadley/sfhousing/master/house-sales.csv")

Exercise 1: What’s in the data?

  • Is the data in tidy form?
  • Of the variables in the data, which are
    • numeric?
    • categorical?
    • temporal?
  • What would be an appropriate plot to make to examine the
    • numeric variables?
    • categorical variables?
    • a categorical and numeric variable?
    • a temporal variable and a numeric variable?

Exercise 2: Time series plots

Reproduce the time series plots of weekly average price and volume of sales.

Exercise 3: Correlation between series

It looks like volume goes down as price goes up. There is a better plot to make to examine this. What is it? Make the plot. After making the plot, report what you learn about the apparent correlation.

Exercise 4: Geographic differences

Think about potential plots you might make for examining differences by geographic region (as measured by zip, county or city). Make a plot, and report what you learn.

Exercise 5: The Rich Get Richer and the Poor Get Poorer

In the section “The Rich Get Richer and the Poor Get Poorer” there are some interesting transformations of the data, and unusual types of plots. Explain why looking at proportional change in value refines the view of price movement in different higher vs lower priced properties.

Exercise 6: Anything surprising?

Were there any findings that surprised the authors? Or would surprise you?

Exercise 7: Additional resources

Some of the findings were compared against information gathered from external sources. Can you point to an example of this, and how the other information was used to support or question the finding?

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.