ETC5521 Tutorial 2

Introduction to exploratory data analysis

Author

Prof. Di Cook

Published

August 3, 2023

🎯 Objectives

The purpose of this tutorial is to scope out the software reporting to do EDA in R. We want to understand the capabilities and the limitations.

🔧 Preparation

The reading for this week is The Landscape of R Packages for Automated Exploratory Data Analysis. This is a lovely summary of software available that is considered to do exploratory data analysis (EDA). (Note: Dr Cook considers these to be mostly descriptive statistics packages, not exploratory data analysis in the true spirit of the term.) This reading will be the basis of the tutorial exercises today.

  • Complete the weekly quiz, before the deadline!
  • Install this list of R packages, in addition to what you installed in the previous weeks:
install.packages(c("arsenal", "autoEDA", "DataExplorer", "dataMaid", "dlookr", "ExPanDaR", "explore", "exploreR", "funModeling", "inspectdf", "RtutoR", "SmartEDA", "summarytools", "visdat", "xray", "cranlogs", "tidyverse", "nycflights13"))
  • Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521). Create a .Rmd document for this weeks activities.

Exercise: Trying out EDA software

The article lists a number of R packages that might be used for EDA: arsenal, autoEDA, DataExplorer, dataMaid, dlookr, ExPanDaR, explore, exploreR, funModeling, inspectdf, RtutoR, SmartEDA, summarytools, visdat, xray.

  1. What package had the highest number of CRAN downloads as of 12.07.2019? (Based on the paper.)

summarytools with 84737

  1. Open up the shiny server for checking download rates at https://hadley.shinyapps.io/cran-downloads/. What package has the highest download rate over the period Jan 1, 2023-today?
library(cranlogs)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
eda_pkgs <- cran_downloads(packages=c("arsenal", "autoEDA", "DataExplorer", "dataMaid", "dlookr", "ExPanDaR", "explore", "exploreR", "funModeling", "inspectdf", "RtutoR", "SmartEDA", "summarytools", "visdat", "xray"), from="2023-01-01", to=lubridate::today())
eda_pkgs %>% 
  group_by(package) %>%
  summarise(m=mean(count)) %>%
  arrange(desc(m))
# A tibble: 15 × 2
   package           m
   <chr>         <dbl>
 1 visdat       913.  
 2 summarytools 888.  
 3 SmartEDA     695.  
 4 DataExplorer 329.  
 5 arsenal      199.  
 6 dlookr       115   
 7 funModeling   87.0 
 8 explore       55.9 
 9 inspectdf     46.1 
10 dataMaid      34.7 
11 ExPanDaR      21.4 
12 xray          17.2 
13 exploreR       7.71
14 RtutoR         2.51
15 autoEDA        0   

visdat. Interestingly, this package was developed by Nick Tierney in the years he was at Monash.

  1. What is an interesting pattern to observe from the time series plot of all the downloads?

The weekly seasonality! There is a regular up/down pattern, that if you zoom in closely - try plotting just a couple of weeks of data - you can see corresponds to week day vs weekend.

  1. How many functions does Staniak and Biecek (2019) say visdat has for doing EDA? Explore what each of them does, by running the example code for each function. What do you think are the features that make visdat a really popular package?

7; Simple focus, useful functions that apply to a lot of problems.

library(visdat)
# function 1
vis_dat(airquality)

# function 2
messy_vector <- c(TRUE,
                 "TRUE",
                 "T",
                 "01/01/01",
                 "01/01/2001",
                 NA,
                 NaN,
                 "NA",
                 "Na",
                 "na",
                 "10",
                 10,
                 "10.1",
                 10.1,
                 "abc",
                 "$%TG")
set.seed(1114)
messy_df <- data.frame(var1 = messy_vector,
                       var2 = sample(messy_vector),
                       var3 = sample(messy_vector))
vis_guess(messy_df)

# function 3
vis_miss(airquality)