ETC5521 Tutorial 3

Deconstructing an exploratory data analysis

Author

Prof. Di Cook

Published

August 10, 2023

🎯 Objectives

Constructing, planning and evaluating an exploratory data analysis are important skills. This tutorial is an exercise in reading and digesting a really good analysis. Your goal is to understand the analysis, reproduce it, and the choices the analysts made, and why these were would be considered high quality.

🔧 Preparation

The reading for this week is EDA Case Study: Bay area blues. It is authored by Hadley Wickham, Deborah F. Swayne, and David Poole. It appeared in the book “Beautiful Data” edited by Jeff Hammerbacher and Toby Segaran. Not all the chapters in the book are good examples of data analysis, though.

  • Complete the weekly quiz, before the deadline!
  • Make sure you have this list of R packages installed:
install.packages(c("tidyverse", "forcats", "patchwork"))
  • Note that the code and data for reproducing their analysis can be found here.

  • Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this weeks activities.

💻 Reproducing the analysis

Point your web browser to the github site for the analysis, https://github.com/hadley/sfhousing. The main data file is house-sales.csv. Read this data into your R session. (🛑 ARE YOU USING A PROJECT FOR THIS UNIT? IF NOT, STOP and OPEN IT NOW.)

You can read the data in directly from the web site using this code:

library(tidyverse)
library(patchwork)
library(forcats)
sales <- read_csv("https://raw.githubusercontent.com/hadley/sfhousing/master/house-sales.csv")

Exercise 1: What’s in the data?

  • Is the data in tidy form?
  • Of the variables in the data, which are
    • numeric?
    • categorical?
    • temporal?
  • What would be an appropriate plot to make to examine the
    • numeric variables?
    • categorical variables?
    • a categorical and numeric variable?
    • a temporal variable and a numeric variable?
  • Yes
    • price, br, lsqft, bsqft
    • county, city, zip, street
    • year, date, datesold
    • scatterplots
    • bar charts, pie charts, mosaic
    • facet by the categorical variable. could be boxplots, or density plots, or facetted scatterplots to look at multiple numeric variables
    • time series plot, connect lines to indicate time, maybe need to aggregate over time to get one value per time point

Exercise 2: Time series plots

Reproduce the time series plots of weekly average price and volume of sales.

sales_weekly <- sales %>%
  group_by(date) %>%
  summarise(av_price = mean(price, na.rm=TRUE),
            volume = n())
p1 <- ggplot(sales_weekly, aes(x=date, y=av_price)) +
  geom_line() +
  scale_y_continuous("Average price (millions)", 
              breaks = seq(500000, 800000, 50000), 
              labels = c("0.50", "0.55", "0.60", "0.65",
                         "0.70", "0.75", "0.80")) +
  scale_x_date("", date_breaks = "1 years", 
               minor_breaks = NULL, 
               date_labels = "%Y")
p2 <- ggplot(sales_weekly, aes(x=date, y=volume)) + geom_line() +
  scale_y_continuous("Number of sales", 
              breaks = seq(500,3000,500), 
              labels = c("500", "1,000", "1,500", "2,000",
                         "2,500", "3,000")) +
  scale_x_date("", date_breaks = "1 years", 
               minor_breaks = NULL, 
               date_labels = "%Y")
p1/p2

Exercise 3: Correlation between series

It looks like volume goes down as price goes up. There is a better plot to make to examine this. What is it? Make the plot. After making the plot, report what you learn about the apparent correlation.

ggplot(sales_weekly, aes(x=av_price, y=volume)) +
  geom_point() +
  theme(aspect.ratio = 1)

Any correlation is very weak, and negative.

Exercise 4: Geographic differences

Think about potential plots you might make for examining differences by geographic region (as measured by zip, county or city). Make a plot, and report what you learn.

ggplot(sales, 
       aes(x = fct_reorder(county, price, na.rm=TRUE), 
           y = price)) +
         geom_boxplot() + 
  scale_y_log10() +
  xlab("") +
  coord_flip()

Marin County has the highest prices on average, and San Joaquin the lowest. The lowest priced house was sold in Sonoma County. The highest priced properties and lowest priced are pretty similar from one county to another - that is, the variability within county is large.

Exercise 5: The Rich Get Richer and the Poor Get Poorer

In the section “The Rich Get Richer and the Poor Get Poorer” there are some interesting transformations of the data, and unusual types of plots. Explain why looking at proportional change in value refines the view of price movement in different higher vs lower priced properties.

The transformation makes changes relative to the initial average price at the start of the time period. All curves produced will start from the same point. This means that we only need to compare the end points of each line, saving us from calculating differences between lines relative to the difference at the beginning.

Exercise 6: Anything surprising?

Were there any findings that surprised the authors? Or would surprise you?

I found it interesting that Mountain View had no decline in housing prices. This city has the headquarters of many of the world’s largest technology companies are in the city, including Google, Mozilla Foundation, Symantec, and Intuit.

Exercise 7: Additional resources

Some of the findings were compared against information gathered from external sources. Can you point to an example of this, and how the other information was used to support or question the finding?

All of this is consistent with what we have learned about subprime mortgages since the housing bust hit the headlines.

Subprime mortgages were offered on little collateral which meant they were quite risky, and they tended to be on the lower end of the housing market. This information was in all the news headlines at the time, and the analysis that these authors have done was checked against the common reporting at the time. The data was consistent with these reports.

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.