ETC5521: Exploratory Data Analysis

class: middle center hide-slide-number monash-bg-gray80

.info-box.w-50.bg-white[
These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-09B.pdf>here for the PDF </a>. 
]

.white[Press the **right arrow** to progress to the next slide!]

---

class: title-slide
count: false
background-image: url("images/bg-12.png")

# .monash-blue[ETC5521: Exploratory Data Analysis]

<h2 style="font-weight:900!important;">Exploring data having a space and time context</h2>

.bottom_abs.width100[

Lecturer: *Di Cook*

ETC5521.Clayton-x@monash.edu

Week 9 - Session 2

]

---
# Outline

- temporal missing values: time series models require that there is a value for each time step
- longitudinal data: it is different from time series. Typically measurements are taken irregularly in time.

---
class: transition

## Working with missings

---
# Checking counting and filling missings in time

.flex[
.w-45[
.s500.f5[

```r
set.seed(328)
harvest <- tsibble(
 year = c(2010, 2011, 2013, 2011, 
 2012, 2013),
 fruit = rep(c("kiwi", "cherry"), 
 each = 3),
 kilo = sample(1:10, size = 6),
 key = fruit, index = year
)
harvest
```

```
## # A tsibble: 6 x 3 [1Y]
## # Key: fruit [2]
## year fruit kilo
## <dbl> <chr> <int>
## 1 2011 cherry 2
## 2 2012 cherry 7
## 3 2013 cherry 1
## 4 2010 kiwi 6
## 5 2011 kiwi 5
## 6 2013 kiwi 8
```
]]
.w-10[
.white[...]
]
.w-45[

```r
*has_gaps(harvest, .full = TRUE)
```

```
## # A tibble: 2 × 2
## fruit .gaps
## <chr> <lgl>
## 1 cherry TRUE 
## 2 kiwi TRUE
```

Both levels of the key have missings.

.monash-orange2[Can you see the gaps in time?]

]]
---
# Checking counting and filling missings in time

.flex[
.w-45[
.s500.f5[

```r
*count_gaps(harvest,  .full=TRUE)
```

```
## # A tibble: 2 × 4
## fruit .from .to .n
## <chr> <dbl> <dbl> <int>
## 1 cherry 2010 2010 1
## 2 kiwi 2012 2012 1
```

One missing in each level, although it is a different year.

Notice how `tsibble` handles this summary so neatly.

]]

---
# Checking counting and filling missings in time

.flex[
.w-45[
.s500.f5[

Make the implicit missing values .monash-orange2[explicit].

.s400.f5[

```r
*harvest <- fill_gaps(harvest,
* .full=TRUE)
harvest 
```

```
## # A tsibble: 8 x 3 [1Y]
## # Key: fruit [2]
## year fruit kilo
## <dbl> <chr> <int>
*## 1 2010 cherry NA
## 2 2011 cherry 2
## 3 2012 cherry 7
## 4 2013 cherry 1
## 5 2010 kiwi 6
## 6 2011 kiwi 5
*## 7 2012 kiwi NA
## 8 2013 kiwi 8
```
]

]]

---
# Checking counting and filling missings in time

.flex[
.w-45[
.s500[

.s500.f5[

```r
harvest_nomiss <- harvest %>% 
* group_by(fruit) %>%
* mutate(kilo =
* na_interpolation(kilo)) %>%
 ungroup()
harvest_nomiss
```

```
## # A tsibble: 8 x 3 [1Y]
## # Key: fruit [2]
## year fruit kilo
## <dbl> <chr> <dbl>
*## 1 2010 cherry 2 
## 2 2011 cherry 2 
## 3 2012 cherry 7 
## 4 2013 cherry 1 
## 5 2010 kiwi 6 
## 6 2011 kiwi 5 
*## 7 2012 kiwi 6.5
## 8 2013 kiwi 8
```
]

]]

---
# .orange[Case study] .bg-orange.circle[3] Melbourne pedestrian traffic .font_small[Part 1/5]

.flex[
.w-45[
.s500.f5[

```r
data(pedestrian) # in tsibble
has_gaps(pedestrian, 
         .full = TRUE)
```

```
## # A tibble: 4 × 2
## Sensor .gaps
## <chr> <lgl>
## 1 Birrarung Marr TRUE 
## 2 Bourke Street Mall (North) TRUE 
## 3 QV Market-Elizabeth St (West) TRUE 
## 4 Southern Cross Station TRUE
```

```r
ped_gaps <- pedestrian %>% 
 count_gaps(.full = TRUE)
```

```r
ggplot(ped_gaps,
       aes(x = Sensor, 
           colour = Sensor)) +
  geom_linerange(
    aes(ymin = .from, 
        ymax = .to), size=2) +
  geom_point(aes(y = .from), size=4) +
  geom_point(aes(y = .to), size=4) +
  xlab("") +
  coord_flip() +
  scale_y_datetime("", 
    date_breaks = "4 months", 
    date_labels = "%y-%m",
    date_minor_breaks = "1 month") +
  scale_color_brewer("", 
          palette="Dark2") +
  theme(legend.position = "none")
```

]
]
.w-5[
.white[...]
]
.w-50[

Every sensor has a missing value each April. .monash-orange2[What happens in April each year]?

]]

---
# .orange[Case study] .bg-orange.circle[3] Melbourne pedestrian traffic .font_small[Part 2/5]

.panelset[
.panel[.panel-name[🖼️]

Missings at the end of the year at QV market.

]

.panel[.panel-name[R]

```r
ped_full %>% 
  filter(month(Date_Time) == 12, year(Date_Time) == 2015,
         mday(Date_Time) > 24) %>%
  ggplot(aes(x=Date_Time, y=Count, 
             colour=Sensor, group=Sensor)) + 
    geom_line() +
    geom_point() +
    xlab("") +
    facet_wrap(~Sensor, ncol=2, scales="free_y") +
    scale_color_brewer("", palette="Dark2") +
    theme(legend.position = "none")
```

]
]
---
# .orange[Case study] .bg-orange.circle[3] Melbourne pedestrian traffic .font_small[Part 3/5]

.panelset[
.panel[.panel-name[🖼️]

Imputed with seasonal component. 
]

.panel[.panel-name[R]

.s500.f5[

```r
ped_nomiss <- ped_full %>% 
 group_by(Sensor) %>%
 mutate(Count_int = na_interpolation(Count),
 Count_mean = na_mean(Count),
 Count_rand = na_random(Count),
 Count_ma = na_ma(Count, k=24),
 Count_loc = na_locf(Count),
 Count_seas = na_seadec(Count, find_frequency = TRUE))
ped_nomiss %>% 
 filter(month(Date_Time) == 12, year(Date_Time) == 2015,
 mday(Date_Time) > 24) %>%
 mutate(is_miss = ifelse(is.na(Count), "missing", "not")) %>%
 ggplot(aes(x=Date_Time, y=Count_seas, 
 group=Sensor, colour = is_miss)) + 
 geom_line() +
 geom_point() +
 facet_wrap(~Sensor, ncol=2, scales="free_y") +
 scale_color_brewer("", palette="Paired", direction = -1) +
 ylab("Count") + xlab("") + 
 theme(legend.position = c(0.4, 1.0))
```
]
]

]
---
# .orange[Case study] .bg-orange.circle[3] Melbourne pedestrian traffic .font_small[Part 4/5]

.panelset[
.panel[.panel-name[🖼️]

Missings in November at Birrarung Marr.
]

.panel[.panel-name[R]

```r
ped_full %>% 
 filter(month(Date_Time) == 11, year(Date_Time) == 2015,
 mday(Date_Time) < 8) %>%
 ggplot(aes(x=Date_Time, y=Count, 
 colour=Sensor, group=Sensor)) + 
 geom_line() +
 geom_point() +
 xlab("") +
 facet_wrap(~Sensor, ncol=2, scales="free_y") +
 scale_color_brewer("", palette="Dark2") +
 theme(legend.position = "none")
```

]
]
---
# .orange[Case study] .bg-orange.circle[3] Melbourne pedestrian traffic .font_small[Part 5/5]

.panelset[
.panel[.panel-name[🖼️]

Imputed with seasonal component. Irregular patterns make imputation difficult.

]

.panel[.panel-name[R]

```r
ped_nomiss %>% 
 filter(month(Date_Time) == 11, year(Date_Time) == 2015,
 mday(Date_Time) < 8) %>%
 mutate(is_miss = ifelse(is.na(Count), "missing", "not")) %>%
 ggplot(aes(x=Date_Time, y=Count_seas, 
 group=Sensor, colour = is_miss)) + 
 geom_line() +
 geom_point() +
 facet_wrap(~Sensor, ncol=2, scales="free_y") +
 scale_color_brewer("", palette="Paired", direction = -1) +
 ylab("Count") + xlab("") +
 theme(legend.position = c(0.4, 1.0))
```
]
]

---
class: middle

.info-box.w-60[Imputing temporal data is necessary for modeling and forecasting, which typically require complete data. Incorporate seasonal components, if necessary, and temporal dependence. That means you need to .monash-orange2[understand enough about the data to do imputation well].]

---
class: transition

## Longitudinal data

Information from the same individuals, recorded at multiple points in time.

Usually irregular, and not easy to regularise. Lots more short series.

Longitudinal data has the .monash-yellow2[same properties] as time series, but generally .monash-yellow2[different objectives] for the analysis.

---

In the `brolgar` package methods build on the `tsibble` data object.

```
*## # A tsibble: 6,402 x 9 [!]
*## # Key: id [888]
## id ln_wages xp ged xp_since_ged black hispanic high_grade unemploy_rate
## <int> <dbl> <dbl> <int> <dbl> <int> <int> <int> <dbl>
## 1 31 1.49 0.015 1 0.015 0 1 8 3.21
## 2 31 1.43 0.715 1 0.715 0 1 8 3.21
## 3 31 1.47 1.73 1 1.73 0 1 8 3.21
## 4 31 1.75 2.77 1 2.77 0 1 8 3.3 
## 5 31 1.93 3.93 1 3.93 0 1 8 2.89
## 6 31 1.71 4.95 1 4.95 0 1 8 2.49
## 7 31 2.09 5.96 1 5.96 0 1 8 2.6 
## 8 31 2.13 6.98 1 6.98 0 1 8 4.8 
## 9 36 1.98 0.315 1 0.315 0 0 9 4.89
## 10 36 1.80 0.983 1 0.983 0 0 9 7.4 
## # ℹ 6,392 more rows
```

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 1/15]

.pull-left[

```r
wages %>%
  ggplot(aes(x = xp,
             y = ln_wages,
             group = id)) + 
  geom_line(alpha=0.3) 
```

Log(wages) of 888 individuals, measured at various times in their employment (workforce experience).

]

.pull-right[
<img src="images/lecture-09B/unnamed-chunk-11-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: middle center

# from a spaghetti mess

.footnote[Source: giphy]
---
class: middle center
count: false

# to controlled spaghetti handling

.footnote[Source: giphy]

---
class: middle center
count: false

# to perfection

.footnote[Source: giphy]

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 2/15]

.pull-left[

Using features, compute the number of measurements for each subject

```r
wages %>%
* features(ln_wages, n_obs) %>%
  ggplot(aes(x = n_obs)) + 
  geom_bar() +
  xlab("Number of observations") 
```

Different number of observations per person! It ranges from 1-13.

.info-box[Too few observations means there is a lack of support to do temporal analysis.]
]

.pull-right[
<img src="images/lecture-09B/unnamed-chunk-12-1.png" width="100%" style="display: block; margin: auto;" />
]
---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 3/15]

.pull-left[
You should filter on this, and remove subjects with few observations.

```r
*wages <- wages %>% add_n_obs()
wages %>% 
* filter(n_obs > 3) %>%
 select(id, ln_wages, xp, n_obs)
```
]

.pull-right[

```
## # A tsibble: 6,145 x 4 [!]
*## # Key: id [764]
## id ln_wages xp n_obs
## <int> <dbl> <dbl> <int>
## 1 31 1.49 0.015 8
## 2 31 1.43 0.715 8
## 3 31 1.47 1.73 8
## 4 31 1.75 2.77 8
## 5 31 1.93 3.93 8
## 6 31 1.71 4.95 8
## 7 31 2.09 5.96 8
## 8 31 2.13 6.98 8
## 9 36 1.98 0.315 10
## 10 36 1.80 0.983 10
## # ℹ 6,135 more rows
```
]

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 4/15]

.pull-left[
Using features to extract minimum time

```r
wages %>%
* features(xp, list(min = min)) %>%
  ggplot(aes(x = min)) + 
  geom_histogram(binwidth=0.5) +
  xlim(c(0, 13)) +
  xlab("First time in study") 
```

Subjects start in the study at different employment experience times, ranging from 0 to more than 10 years.
]

.pull-right[
<img src="images/lecture-09B/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 5/15]

.pull-left[
Using features to extract range of time index

```r
wages_xp_range <- wages %>% 
* features(xp, feat_ranges)

ggplot(wages_xp_range,
       aes(x = range_diff)) + 
  geom_histogram() + 
  xlab("Range of experience")
```

There's a range of workforce experience.
]

.pull-right[

<img src="images/lecture-09B/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 6/15]

.pull-left[
.monash-orange2[Small spoonfuls of spaghetti]

Sample some individuals

```r
wages %>%
* sample_n_keys(size = 10) %>%
  ggplot(aes(x = xp,
             y = ln_wages,
             group = id)) + 
  geom_line() +
  xlim(c(0,13)) + ylim(c(0, 4.5)) +
  xlab("Years of experience") + 
  ylab("Log wages") 
```

Wages conversion 0.5 = $1.65; 4.5 = $90
]

.pull-right[
<img src="images/lecture-09B/unnamed-chunk-17-1.png" width="100%" style="display: block; margin: auto;" />
]

---
count: false
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 6/15]

.pull-left[
.monash-orange2[Small spoonfuls of spaghetti]

Sample some individuals

Wages conversion 0.5 = $1.65; 4.5 = $90
]

.pull-right[
<img src="images/lecture-09B/unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" />
]

---
count: false
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 6/15]

.pull-left[
.monash-orange2[Small spoonfuls of spaghetti]

Sample some individuals

Wages conversion 0.5 = $1.65; 4.5 = $90
]

.pull-right[
<img src="images/lecture-09B/unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 7/15]

.pull-left[
.monash-orange2[Take a spoonful of different lengths]

Sample experienced individuals

```r
wages %>%
  add_n_obs() %>%
* filter(n_obs > 7) %>%
  sample_n_keys(size = 10) %>%
  ggplot(aes(x = xp,
             y = ln_wages,
             group = id)) + 
  geom_line()  +
  xlim(c(0,13)) + ylim(c(0, 4.5)) +
  xlab("Years of experience") + 
  ylab("Log wages")
```

Wages conversion 0.5 = $1.65; 4.5 = $90
]

.pull-right[
<img src="images/lecture-09B/unnamed-chunk-22-1.png" width="100%" style="display: block; margin: auto;" />
]

---
count: false
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 7/15]

.pull-left[
.monash-orange2[Take a spoonful of different lengths]

Sample experienced individuals

Wages conversion 0.5 = $1.65; 4.5 = $90
]

.pull-right[
<img src="images/lecture-09B/unnamed-chunk-24-1.png" width="100%" style="display: block; margin: auto;" />
]

---
count: false
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 7/15]

.pull-left[
.monash-orange2[Take a spoonful of different lengths]

Sample experienced individuals

Wages conversion 0.5 = $1.65; 4.5 = $90
]

.pull-right[
<img src="images/lecture-09B/unnamed-chunk-26-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 8/15]

.panelset[
.panel[.panel-name[🖼️]

]
.panel[.panel-name[info]

Two sampling tools

- `facet_strata`: show the whole pot, neatly separated into equally portioned
- `facet_sample`: show most of the pot in neatly separated portions

]
.panel[.panel-name[R]

```r
wages %>% 
  filter(n_obs > 1) %>%
  ggplot(aes(x = xp,
           y = ln_wages,
           group = id)) +
  geom_line() +
* facet_sample(n_per_facet = 3,
*              n_facets = 20)  +
  xlab("Years of experience") + 
  ylab("Log wages") 
```

]
]

---
class: transition

## Special features

---
## Special features

Remember scagnostics?

Compute longnostics for each subject

- Slope, intercept from simple linear model
- Variance, standard deviation
- Jumps, differences

For large collections of time series, take a look at the `feasts` package, which has a long list of time series features (tignostics) to calculate.

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 9/15]

.pull-left[

.monash-orange2[increasing]

```r
wages_slope <- wages %>% 
 add_n_obs() %>%
 filter(n_obs > 4) %>%
* add_key_slope(ln_wages ~ xp) %>%
 as_tsibble(key = id, index = xp)

wages_slope %>% 
* filter(.slope_xp > 0.4) %>%
  ggplot(aes(x = xp, 
             y = ln_wages, 
             group = id)) + 
  geom_line() +
  ylim(c(0, 4.5)) +
  xlab("Years of experience") + 
  ylab("Log wages") 
```

]

.pull-right[
<img src="images/lecture-09B/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 10/15]

.pull-left[

.monash-orange2[decreasing]

```r
wages_slope %>% 
* filter(.slope_xp < (-0.7)) %>%
 ggplot(aes(x = xp, 
 y = ln_wages, 
 group = id)) + 
 geom_line() +
 ylim(c(0, 4.5)) +
 xlab("Years of experience") + 
 ylab("Log wages")
```

]

.pull-right[
<img src="images/lecture-09B/unnamed-chunk-29-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: transition

## Longitudinal data needs a special five number summary

---
# Summarising individuals

A different style of five number summary

Who is average?
Who is different?

Find those individuals who are .monash-orange2[representative] of the min, median, maximum, etc of a particular feature, e.g. trend, using `keys_near()`. This reports the individual who is closest to a particular statistic.

`wages_threenum()` returns the three individuals: min, max and closest to the median value.

`wages_fivenum()` returns the five individuals: min, max and closest to the median, Q1 and Q3 values.

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 11/15]

.pull-left[

```r
wages_threenum <- wages %>% 
 add_n_obs() %>%
 filter(n_obs > 4) %>%
 key_slope(ln_wages ~ xp) %>%
* keys_near(key = id,
* var = .slope_xp,
* funs = l_three_num) %>%
 left_join(wages, by = "id") %>%
 as_tsibble(key = id, index = xp) 
```

.info-box[Minimum/maximum are short series with substantial decline/incline. Median is very flat, no change in real wages.]
]
]

.pull-right[
<img src="images/lecture-09B/three_number_plot-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 12/15]

.pull-left[

.f5[

```r
wages_fivenum <- wages %>% 
 add_n_obs() %>%
 filter(n_obs > 6) %>%
 key_slope(ln_wages ~ xp) %>%
 keys_near(key = id,
 var = .slope_xp,
* funs = l_five_num) %>%
 left_join(wages, by = "id") %>%
 as_tsibble(key = id, index = xp) 
```
]
]
.pull-right[
.info-box[Q1 and Q3 are also flat which means that, at least, 50% of the individuals experience no real change in wage.]
]

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 13/15]

.pull-left[

.monash-orange2[Sculpting spaghetti]

Mixed effects model, education as fixed effect, subject random effect using slope.

.f5[

```r
wages_fit_int <- 
* lmer(ln_wages ~ xp + high_grade +
* (xp |id), data = wages)
wages_aug <- wages %>%
 add_predictions(wages_fit_int, 
 var = "pred_int") %>%
 add_residuals(wages_fit_int, 
 var = "res_int")
```
]
]

.pull-right[
<img src="images/lecture-09B/model_plot-1.png" width="100%" style="display: block; margin: auto;" />
]

---
count: false

# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 13/15]

.pull-left[

.monash-orange2[Sculpting spaghetti]

Mixed effects model, education as fixed effect, subject random effect using slope.

.f5[

.pull-right[
<img src="images/lecture-09B/res_plot-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 14/15]

.panelset[
.panel[.panel-name[🖼️]

Model diagnostics: Sample individuals and plot model on the data. .monash-orange2[Notice individual 5630.]

]
.panel[.panel-name[R]

```r
set.seed(1)
wages_aug %>% add_n_obs() %>% filter(n_obs > 4) %>%
  sample_n_keys(size = 12) %>%
  ggplot() + 
* geom_line(aes(x = xp, y = pred_int, group = id,
             colour = factor(id))) + 
* geom_point(aes(x = xp, y = ln_wages,
                 colour = factor(id))) + 
  scale_colour_ochre(palette = "emu_woman_paired") +
  facet_wrap(~id, ncol=4)  +
  xlab("Years of experience") + ylab("Log wages") +
  theme(legend.position = "none")
```
]
]

---
# .orange[Case study] .bg-orange.circle[4] Wages .font_small[Part 15/15]

## What we learn about wages that we would not have learned without doing EDA

- The individual wage experience is extremely varied
- Some individuals see a decline in their wages the longer they are in the workforce
- Most individuals generally see some (small) increase, on average

.info-box[Exploratory analysis of individual temporal patterns can be very really interesting!]

---
class: middle

.info-box[The main difference between .monash-green2[time series] data and .monash-purple2[longitudinal] data, is the former is typically regular, complete, may be only one or a few, and the latter is typically of different lengths, different time measurements and a lot.]

---

# Resources and Acknowledgement

- Tidy tools for time series [tidyverts](https://tidyverts.org)
- Imputing missings in time using [imputeTS](https://cran.r-project.org/web/packages/imputeTS/vignettes/imputeTS-Time-Series-Missing-Value-Imputation-in-R.pdf)
- [Temporal missings](https://tsibble.tidyverts.org/articles/implicit-na.html) in `tsibble`
- [Longitudinal data exploration](http://brolgar.njtierney.com/index.html) with `brolgar`
- Data coding using [`tidyverse` suite of R packages](https://www.tidyverse.org) 
- Slides constructed with [`xaringan`](https://github.com/yihui/xaringan), [remark.js](https://remarkjs.com), [`knitr`](http://yihui.name/knitr), and [R Markdown](https://rmarkdown.rstudio.com).

---

background-size: cover
class: title-slide
background-image: url("images/bg-12.png")

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

.bottom_abs.width100[

Lecturer: *Di Cook*

ETC5521.Clayton-x@monash.edu

Week 9 - Session 2

]