ETC5521 Tutorial 2

Introduction to exploratory data analysis

Author

Prof. Di Cook

Published

August 3, 2023

🎯 Objectives

The purpose of this tutorial is to scope out the software reporting to do EDA in R. We want to understand the capabilities and the limitations.

🔧 Preparation

The reading for this week is The Landscape of R Packages for Automated Exploratory Data Analysis. This is a lovely summary of software available that is considered to do exploratory data analysis (EDA). (Note: Dr Cook considers these to be mostly descriptive statistics packages, not exploratory data analysis in the true spirit of the term.) This reading will be the basis of the tutorial exercises today.

Complete the weekly quiz, before the deadline!
Install this list of R packages, in addition to what you installed in the previous weeks:

install.packages(c("arsenal", "autoEDA", "DataExplorer", "dataMaid", "dlookr", "ExPanDaR", "explore", "exploreR", "funModeling", "inspectdf", "RtutoR", "SmartEDA", "summarytools", "visdat", "xray", "cranlogs", "tidyverse", "nycflights13"))

Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this weeks activities.

Exercise: Trying out EDA software

The article lists a number of R packages that might be used for EDA: arsenal, autoEDA, DataExplorer, dataMaid, dlookr, ExPanDaR, explore, exploreR, funModeling, inspectdf, RtutoR, SmartEDA, summarytools, visdat, xray.

What package had the highest number of CRAN downloads as of 12.07.2019? (Based on the paper.)

Solution

summarytools with 84737

Open up the shiny server for checking download rates at https://hadley.shinyapps.io/cran-downloads/. What package has the highest download rate over the period Jan 1, 2023-today?

Solution

library(cranlogs)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

eda_pkgs <- cran_downloads(packages=c("arsenal", "autoEDA", "DataExplorer", "dataMaid", "dlookr", "ExPanDaR", "explore", "exploreR", "funModeling", "inspectdf", "RtutoR", "SmartEDA", "summarytools", "visdat", "xray"), from="2023-01-01", to=lubridate::today())
eda_pkgs %>% 
  group_by(package) %>%
  summarise(m=mean(count)) %>%
  arrange(desc(m))

# A tibble: 15 × 2
   package           m
   <chr>         <dbl>
 1 visdat       913.  
 2 summarytools 888.  
 3 SmartEDA     695.  
 4 DataExplorer 329.  
 5 arsenal      199.  
 6 dlookr       115   
 7 funModeling   87.0 
 8 explore       55.9 
 9 inspectdf     46.1 
10 dataMaid      34.7 
11 ExPanDaR      21.4 
12 xray          17.2 
13 exploreR       7.71
14 RtutoR         2.51
15 autoEDA        0

visdat. Interestingly, this package was developed by Nick Tierney in the years he was at Monash.

What is an interesting pattern to observe from the time series plot of all the downloads?

Solution

The weekly seasonality! There is a regular up/down pattern, that if you zoom in closely - try plotting just a couple of weeks of data - you can see corresponds to week day vs weekend.

How many functions does Staniak and Biecek (2019) say visdat has for doing EDA? Explore what each of them does, by running the example code for each function. What do you think are the features that make visdat a really popular package?

Solution

7; Simple focus, useful functions that apply to a lot of problems.

library(visdat)
# function 1
vis_dat(airquality)

# function 2
messy_vector <- c(TRUE,
                 "TRUE",
                 "T",
                 "01/01/01",
                 "01/01/2001",
                 NA,
                 NaN,
                 "NA",
                 "Na",
                 "na",
                 "10",
                 10,
                 "10.1",
                 10.1,
                 "abc",
                 "$%TG")
set.seed(1114)
messy_df <- data.frame(var1 = messy_vector,
                       var2 = sample(messy_vector),
                       var3 = sample(messy_vector))
vis_guess(messy_df)

# function 3
vis_miss(airquality)

# function 4
aq_diff <- airquality
aq_diff[1:10, 1:2] <- NA
vis_compare(airquality, aq_diff)

# function 5
dat_test <- tibble::tribble(
            ~x, ~y,
            -1,  "A",
            0,  "B",
            1,  "C",
            NA, NA
            )

vis_expect(dat_test, ~.x == -1)

# function 6
vis_cor(airquality)

The package DataExplorer has a high download rate and number of GitHub stars, and also a nice web site at https://boxuancui.github.io/DataExplorer/. The vignette “Introduction to DataExplorer” is a good place to start to learn what the package does. I want you to generate an automatic report to see what it creates, and what the package suggests is important. Use the code below to create the data to use - it does the same thing as the code in the vignette but uses dplyr and piping more nicely. Then run the report. It’s not very pretty to read, but there’s a vast amount of very useful information about the data that can help in preparing for its analysis. It’s completely overwhelming, though.

What does specifying y = "arr_delay" do?
Which variables have a lot of missing values?
How many engines do planes typically have?
What does the scale of the arr_time, dep_time, sched_arr_time and sched_dep_time indicate?
Which carrier had the most flights?
Which type of plane was most common? Why might conclusions about the type of plane be dangerous about to make?
What do the QQ plots tell you?
arr_delay is divided into 6 categories before making some plots. Why do you think the purpose of this is?
With your neighbour in the tutorial come up with one thing that is a bit surprising to you about this data. Make sure you state what you expected to see, and why what you saw was then a surprise.

# DataExplorer
library(DataExplorer)
library(nycflights13)
library(tidyverse)

# Create a big data set
airlines_all <- flights %>% 
  full_join(airlines, by = "carrier") %>%
  full_join(planes, by = "tailnum", 
            suffix = c("_flights", "_planes")) %>%
  full_join(airports, by = c("origin"="faa"), 
            suffix = c("_carrier", "_origin")) %>%
  full_join(airports, by = c("dest"="faa"), 
            suffix = c("_origin", "_dest"))

# Run a report
create_report(airlines_all, y = "arr_delay")

Solution

This is specifying that arr_delay is to be treated as a response variable.
The variable speed is mostly missing. Other variables engine, seats, model, manufacturer, type and year_planes have about 15% missing.
Almost all planes have 2 engines. Some have 1, and it seems that some have 3 or 4. The histogram includes these categories which means there must be some planes with 4 engines, but the count is so small that the bar is invisible.
These are hours and minutes as four digits, eg 1045 means 10:45am.
UA, United Airlines
BOEING. It could be dangerous because 15% of observations on this variable are missing. If these all belong to one of the other popular types of planes (ie EMBRAER) we may be over-stating the popularity of BOEING.
QQ plots are for checking for normal distributions. We can see that none of the variables have normal distributions because noe of these feature points along the straight guide line.
Sometimes breaking a numeric variable into pieces and plotting against another variable allows a rough assess associations in a simplified manner. It is often done to assess associations with categorical variables.
Here is one example: We might expect that if the departure was delayed then the arrival would be delayed. In the scatterplot of dep_delay vs arr_delay we don’t see this pattern! There is a subset of flights where there is a strong relationship, and another subset of observations where dep_delay is slightly increased at the highest values of arr_delay, but for the most part it is flat. That even if there is no departure delay, there is a long arrival delay. This is curious! But there is something strange about this plot, too. It looks like arr_delay was treated as a categorical variable, given the regular markings on the horizontal axis.

Table 2 summarises the activities of two early phases of the CRISP-DM standard. What does CRISP-DM mean? The implication is that EDA is related to “data understanding” and “data preparation”. Would you agree with this or disagree? Why?

Solution

Cross-Industry Standard Process for Data Mining; EDA techniques can be useful for some parts of these stages, for example finding outliers, or examining missing value patterns. Some of these steps are important for effective EDA, too, for example, you need to know what types of variables you have in order to decide what types of plots to make.

Table 1 of the paper is summarising CRAN downloads and GitHub activity is hard to read. How are the rows sorted? What is the most important information communicated by the table? In what way(s) might revising this table make it easier to read and digest the most important information?

Solution

Most important information is the download rate because the purpose is to know which are the commonly used packages. Sort rows by downloads.

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.