ETC5521 Tutorial 2

Introduction to exploratory data analysis

Author

Prof. Di Cook

Published

August 3, 2023

🎯 Objectives

The purpose of this tutorial is to scope out the software reporting to do EDA in R. We want to understand the capabilities and the limitations.

🔧 Preparation

The reading for this week is The Landscape of R Packages for Automated Exploratory Data Analysis. This is a lovely summary of software available that is considered to do exploratory data analysis (EDA). (Note: Dr Cook considers these to be mostly descriptive statistics packages, not exploratory data analysis in the true spirit of the term.) This reading will be the basis of the tutorial exercises today.

Complete the weekly quiz, before the deadline!
Install this list of R packages, in addition to what you installed in the previous weeks:

install.packages(c("arsenal", "autoEDA", "DataExplorer", "dataMaid", "dlookr", "ExPanDaR", "explore", "exploreR", "funModeling", "inspectdf", "RtutoR", "SmartEDA", "summarytools", "visdat", "xray", "cranlogs", "tidyverse", "nycflights13"))

Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this weeks activities.

Exercise: Trying out EDA software

The article lists a number of R packages that might be used for EDA: arsenal, autoEDA, DataExplorer, dataMaid, dlookr, ExPanDaR, explore, exploreR, funModeling, inspectdf, RtutoR, SmartEDA, summarytools, visdat, xray.

What package had the highest number of CRAN downloads as of 12.07.2019? (Based on the paper.)

Open up the shiny server for checking download rates at https://hadley.shinyapps.io/cran-downloads/. What package has the highest download rate over the period Jan 1, 2023-today?

What is an interesting pattern to observe from the time series plot of all the downloads?

How many functions does Staniak and Biecek (2019) say visdat has for doing EDA? Explore what each of them does, by running the example code for each function. What do you think are the features that make visdat a really popular package?

The package DataExplorer has a high download rate and number of GitHub stars, and also a nice web site at https://boxuancui.github.io/DataExplorer/. The vignette “Introduction to DataExplorer” is a good place to start to learn what the package does. I want you to generate an automatic report to see what it creates, and what the package suggests is important. Use the code below to create the data to use - it does the same thing as the code in the vignette but uses dplyr and piping more nicely. Then run the report. It’s not very pretty to read, but there’s a vast amount of very useful information about the data that can help in preparing for its analysis. It’s completely overwhelming, though.

What does specifying y = "arr_delay" do?
Which variables have a lot of missing values?
How many engines do planes typically have?
What does the scale of the arr_time, dep_time, sched_arr_time and sched_dep_time indicate?
Which carrier had the most flights?
Which type of plane was most common? Why might conclusions about the type of plane be dangerous about to make?
What do the QQ plots tell you?
arr_delay is divided into 6 categories before making some plots. Why do you think the purpose of this is?
With your neighbour in the tutorial come up with one thing that is a bit surprising to you about this data. Make sure you state what you expected to see, and why what you saw was then a surprise.

# DataExplorer
library(DataExplorer)
library(nycflights13)
library(tidyverse)

# Create a big data set
airlines_all <- flights %>% 
  full_join(airlines, by = "carrier") %>%
  full_join(planes, by = "tailnum", 
            suffix = c("_flights", "_planes")) %>%
  full_join(airports, by = c("origin"="faa"), 
            suffix = c("_carrier", "_origin")) %>%
  full_join(airports, by = c("dest"="faa"), 
            suffix = c("_origin", "_dest"))

# Run a report
create_report(airlines_all, y = "arr_delay")

Table 2 summarises the activities of two early phases of the CRISP-DM standard. What does CRISP-DM mean? The implication is that EDA is related to “data understanding” and “data preparation”. Would you agree with this or disagree? Why?

Table 1 of the paper is summarising CRAN downloads and GitHub activity is hard to read. How are the rows sorted? What is the most important information communicated by the table? In what way(s) might revising this table make it easier to read and digest the most important information?

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.