ETC5521 Tutorial 10

Exploring data having a space and time context

Author

Prof. Di Cook

Published

October 5, 2023

🎯 Objectives

These exercise are to do some exploratory analysis with graphics and statistical models, focusing on temporal data analysis.

🔧 Preparation

The reading for this week is Reintroducing tsibble: data tools that melt the clock and brolgar: An R package to BRowse Over Longitudinal Data Graphically and Analytically in R.
Complete the weekly quiz, before the deadline!
Install the following R-packages if you do not have them already:

install.packages(c("tidyverse", "here", "tsibble", "lubridate", "DAAG", "broom", "patchwork", "colorspace", "GGally", "tsibbledata", "forcats", "chron", "sugrrants"))
remotes::install_github("njtierney/brolgar")
remotes::install_github("hrbrmstr/streamgraph")

Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521).

Exercise 1: Australian rain

This exercise is based on one from Unwin (2015), and uses the bomregions data from the DAAG package. The data contains regional rainfall for the years 1900-2008. The regional rainfall numbers are area-weighted averages for the respective regions. Extract just the rainfall columns from the data, along with year.

What do you think area-weighted averages are, and how would these be calculated?

Solution

The total rainfall is divided by geographic area to get the rainfall on a scale that can be compared aross different sized regions.

Make line plots of the rainfall for each of the 7 regions and the Australian averages. What do you learn about rainfall patterns across the years and regions? (Hint: you can make these with the ggduo function in the GGally package.)

Solution

The rainfall patterns are fairly flat across regions. The temporal trend differs a little from one region to another.

data(bomregions)
ggduo(bomregions, "Year", c("seRain", "southRain", "eastRain", "northRain", "swRain", "qldRain", "nswRain", "ntRain", "saRain", "tasRain", "vicRain", "waRain", "mdbRain", "ausRain"))

It can be difficult to assess correlation between multiple series using line plots, and the best way to check correlation between multiple series is to make a scatterplot. Make a splom for this data, ignoring year. What regions have strong positive correlation between their rainfall averages?

Solution

It is mostly positive linear association. Some pairs - eastRain, seRain, mdbRain, qldRain, vicRain - are strongly correlated. There are a few outliers (high values) in several regions, particularly the north.

ggpairs(bomregions[, c(16:29)])

One of the consequences of climate change for Australia is that some regions are likely getting drier. Make a transformation of the data to compute the difference between rainfall average in the year, and the mean over all years. Using a bar for each year, make a barchart that examines the differences in the yearly rainfall over time. (Hint: you will need to pivot the data into tidy long form to make this easier.) Are there some regions who have negative differences in recent years? What else do you notice?

Solution

There doesn’t appear to be more negative differences in recent years. Although there is possibly a hint in several regions: swRain, seRain, tasRain and vicRain. There were several years of heavier than average rain in most regions in the early 1970s. Generally the pattern is a few wet years then a few dry years.

med_ctr <- function(x, na.rm = TRUE) {
  x-mean(x, na.rm = TRUE)
}
bomrain <- bomregions %>% as_tibble() %>%
  select(Year, contains("Rain")) %>%
  mutate_at(vars(-Year), med_ctr) %>%
  pivot_longer(cols = seRain:ausRain, 
               names_to = "area", 
               values_to = "rain")
ggplot(bomrain, aes(x=Year, y=rain)) +
  geom_col() +
  facet_wrap(~area, ncol=1, scales="free_y")

Exercise 2: US unemployment

This exercise is based on and example in Oscar Perpinan Lamigueiro (2018) “Displaying Time Series, Spatial, and Space-Time Data with R”. Read the US employment data from the book web site. This contains monthly unemployment numbers from 2000 through 2012 in different sectors.

Transform the data into tidy long form and convert to a tsibble. Make a line plot coloured by sector. What do you learn about unemployment during this time frame from this chart?

Solution

2008 is when unemployment rose in many sectors. In some sectors there is a strong seasonal pattern.

US_unemp <- read_csv("https://raw.githubusercontent.com/oscarperpinan/bookvis/master/data/unemployUSA.csv") %>%
  select(!contains("Annual")) %>%
  pivot_longer(`Jan 2000`:`Dec 2012`, 
               names_to = "date", 
               values_to = "count") %>%
  mutate(date = as.Date(paste("01", date), "%d %b %Y")) %>%
  as_tsibble(index = date, key = `Series ID`)
ggplot(US_unemp, aes(x=date, y=count, colour = `Series ID`)) +
  geom_line() +
  scale_colour_discrete_divergingx()

We are going to re-arrange the data now to examine the monthly patterns by year and sector. Create new variables for month and year from the date variable. Now make a line plot of count by month coloured by year, using an appropriately sequential colour palette, and facet by sector. Are there some sectors that have a seasonal pattern? Are there some sectors who were not affected by the 2008 economic downfall?

Solution

Some sectors, eg LNU03028615, LNU03032231 and LNU03035181 have a strong seasonal pattern. LNU03035181 and LNU03032237 appear to be less affected by the economic downfall.

US_unemp <- US_unemp %>%
  mutate(month = month(date, label = TRUE),
         year = year(date)) 
ggplot(US_unemp, aes(x=month, y=count, 
                     colour = year, group = year)) +
  geom_line() + 
  facet_wrap(~`Series ID`, ncol=5, scales="free_y") +
  scale_x_discrete("", labels = c("Jan"="J", "Feb"="F", 
                                  "Mar"="M", "Apr"="A", 
                                  "May"="M", "Jun"="J",
                                  "Jul"="J", "Aug"="A",
                                  "Sep"="S", "Oct"="O",
                                  "Nov"="N", "Dec"="D")) +
  scale_colour_viridis_c("", breaks = seq(2000, 2012, 4),
    guide = guide_colourbar(barwidth = 15, 
                            barheight = 1)) +
  theme_bw() +
  theme(legend.position = "bottom")

This next way to look at the data is like a stacked bar chart for time series. Using the same code as in question a., change geom_line with geom_area, to stack the series, with a different fill colour for each sector. What do you learn about the magnitude of the 2008 economic crisis? Can you read much from this chart about the effect on different sectors?

Solution

The big increase in unemployed after 2008 is emphasised by this chart. It’s difficult to examine the individual sectors, though.

ggplot(US_unemp, aes(x=date, y=count, fill = `Series ID`)) + geom_area(position = "stack") + theme_bw() +
  scale_x_date("", date_breaks = "3 years", 
               date_labels = "%Y") +
  scale_fill_discrete_divergingx() +
  theme(legend.position = "bottom")

This is a similar type of plot called a “stream graph”. The streamgraph package generates this as an interactive plot, which is great for exploring multiple nested time series.

US_unemp %>% 
  rename("sector" = `Series ID`) %>%
streamgraph("sector", "count", "date")

Exercise 3: Lynx trappings periodicity

The lynx data in R is a classic example: Annual numbers of lynx trappings for 1821–1934 in Canada, from Brockwell & Davis (1991). It is a classic because it looks periodic, but it really doesn’t have a period. Here we look at two ways to examine the cyclic nature to check for periodicity.

Make a line plot of count by year. Do you agree that it looks periodic?

Solution

lynx_tsb <- as_tsibble(lynx) %>%
  rename(count = value)
ggplot(lynx_tsb, aes(x=index, y=count)) +
  geom_line() + 
  xlab("") + ylab("count") +
  theme_bw()

Create two new variables by rounding the year into a decade, and the remainder into a year in the decade.

Solution

lynx_tsb <- lynx_tsb %>%
  mutate(decade = round(index/10, 0), 
         yr_decade = index %% 10)

Add a vertical line every 10 years, to your plot of counts over time. If you start from 1928, the location of the first peak, you can check the peak locations in subsequent years. Is the peak roughly every 10 years?

Solution

The vertical lines mark each decade, and the first one matches the first peak. For the first few decades the line matches the peak, but as time progresses the peak arrives a little earlier than the decade.

ggplot(lynx_tsb, aes(x=index, y=count)) +
  geom_vline(xintercept = seq(1828, 1928, 10),
             colour="seagreen3") +
  geom_line() + 
  xlab("") + ylab("count") +
  theme_bw()

Cut the series into decades, and make overlaid line plots of count vs year in decade, using decade as the group variable. This is like looking at seasonality, like we might look at seasonal patterns in a year by examining the months. If the series is cyclic, particularly with peaks every 10 years, the peaks should line up. Do they?

Solution

Snipping the series into 10 year blocks does not produce a matching of the peaks. Although it looks like a 10 year cycle, it appears to be a little less than that, and slightly irregular.

ggplot(lynx_tsb, aes(x=yr_decade, y=count, group = decade)) +
  geom_line() +
  scale_x_continuous("", breaks = seq(0, 9, 1))

Exercise 4: Missing values in NYC bikes data

In Earo Wang’s blog post introducing tsibble she used NYC bikes data. This data is now made available in the tsibbledata package.

Focusing on May, aggregate the data to hour, and count the number of trips in each hour.

Solution

# Select May
hourly_trips <- nyc_bikes %>% 
  filter(month(start_time) == 5) %>%
  index_by(start_hour = floor_date(start_time, unit = "1 hour")) %>% 
  summarise(ntrips = n()) %>% 
  as_tsibble()

Has the data got missing values?

Solution

Yes

has_gaps(hourly_trips, .full = TRUE)

# A tibble: 1 × 1
  .gaps
  <lgl>
1 TRUE

Make a line plot of the hourly trips over the month. Why does it look like there are no missings? Why is the line not broken by missings?

Solution

ggplot(hourly_trips, aes(x=start_hour, y=ntrips)) +
  geom_line()

Use fill_gaps to make implicit missings explicit. Re-make the line plot from question c again.

Solution

There are a lot of gaps. It’s hard to see where they are because there are missings everywhere!

hourly_trips %>% fill_gaps(.full = TRUE) %>%
  ggplot(aes(x=start_hour, y=ntrips)) +
  geom_line()

To focus on the missings, we can make a plot of the places where there are gaps, by plotting the data created by the count_gaps function. How extensive are the missing values?

Solution

The missing values are extensive.

nycbikes_gaps <- count_gaps(hourly_trips, .full = TRUE)
ggplot(nycbikes_gaps) +
  geom_linerange(aes(xmin = .from, xmax = .to, y=1)) +
  geom_point(aes(x = .from, y=1)) +
  geom_point(aes(x = .to, y=1))

Focusing on the hour when missings occur, check if there are some times of the day that missings are more frequent.

Solution

There are more missings in peak traffic hours, and lunch times. This data is tracking 10 bikes over this time period. Its not clear what generates the missing values, but maybe times when no bikes are being used?

scan_gaps(hourly_trips)  %>%
  mutate(time = hour(start_hour)) %>% 
  ggplot(aes(x = time)) +
  geom_bar()

Exercise 5: Imputing missings for pedestrian sensor using a model

We saw in the lecture notes that imputing by simple method such as mean or moving average doesn’t work well with multiple seasonality in a time series. Here we will use a linear model to capture the seasonality and produce better imputations for the pedestrian sensor data (from the tsibble package).

What are the multiple seasons of the pedestrian sensor data, for QV Market-Elizabeth St (West)? (Hint: Make a plot to check. You might filter to a single month to make it easier to see seasonality. You might also want to check when Queen Victoria Market is open.)

Solution

There is a daily seasonality, and an open/closed market day seasonality (Tue, Thu, Fri, Sat, Sun), and there is even a summer winter seasonality (Wednesday night market, see the double-peak in Feb/Mar).

pedestrian %>%
  mutate(year = year(Date),
         month = month(Date)) %>%
  dplyr::filter(Sensor == "QV Market-Elizabeth St (West)",
                year == 2016, month %in% c(2:5)) %>%
  ggplot() +   
    geom_line(aes(x=Time, y=Count)) +
    facet_calendar(~Date) +
  theme(axis.title = element_blank(),
        axis.text = element_blank())

Check and fill the gaps for the pedestrian sensor data with NA. Subset to just the QV market sensor.

Solution

has_gaps(pedestrian, .full = TRUE)

# A tibble: 4 × 2
  Sensor                        .gaps
  <chr>                         <lgl>
1 Birrarung Marr                TRUE 
2 Bourke Street Mall (North)    TRUE 
3 QV Market-Elizabeth St (West) TRUE 
4 Southern Cross Station        TRUE

ped_gaps <- pedestrian %>% 
  count_gaps(.full = TRUE)
ped_full <- pedestrian %>% 
  fill_gaps(.full = TRUE)

Create a new variable to indicate if a day is a non-working day, called hol. Make hour a factor - this helps to make a simple model for a non-standard daily pattern.

Solution

hol <- holiday_aus(2015:2016, state = "VIC")
ped_qvm <- ped_full %>% 
  filter(Sensor == "QV Market-Elizabeth St (West)") %>%
  mutate(hol = is.weekend(Date)) %>%
  mutate(hol = ifelse(Date %in% hol, TRUE, hol)) %>%
  mutate(Date = as_date(Date_Time), Time = hour(Date_Time)) %>%
  mutate(Time = factor(Time))

Fit a linear model with Count as the response on predictors Time and hol interacted.

Solution

ped_qvm_lm <- lm(Count~Time*hol, data=ped_qvm)

Predict the count for all the data at the sensor.

Solution

ped_qvm$pCount <- predict(ped_qvm_lm, ped_qvm)

Make a line plot focusing on the last two weeks in 2015, where there was a day of missings, where the missing counts are substituted by the model predictions. Do you think that these imputed values match the rest of the series, nicely?

Solution

This makes a much better imputed value. There’s still room for improvement but its better than a nearest neighbour, or mean or moving average imputation.

ped_qvm_sub <- ped_qvm %>%
  filter(month(Date_Time) == 12, year(Date_Time) == 2015,
         mday(Date_Time) > 21) 
ggplot(ped_qvm_sub) +   
    geom_line(aes(x=Date_Time, y=Count)) +
    geom_line(data=filter(ped_qvm_sub, is.na(Count)), 
                      aes(x=Date_Time, 
                          y=pCount), 
                      colour="seagreen3") +
  scale_x_datetime("", date_breaks = "1 day", 
                   date_labels = "%a %d")

Warning: Removed 24 rows containing missing values (`geom_line()`).

Exercise 6: Men’s heights

The heights data provided in the brolgar package contains average male heights in 144 countries from 1500-1989.

What’s the time index for this data? What is the key?

Solution

The time index is year, and key is country.

Filter the data to keep only measurements since 1700, when there are records for many countries. Make a spaghetti plot for the values from Australia. Does it look like Australian males are getting taller?

Solution

Its looking like Australian males are getting taller BUT …. There are few measurements in the 1900s, and none since 1975. The data for Australia looks unreliable.

heights <- brolgar::heights %>% filter(year > 1700)
heights_oz <- heights %>% 
  filter(country == "Australia") 
ggplot(heights_oz,
       aes(x = year,
           y = height_cm,
           group = country)) + 
  geom_point() + 
  geom_line()

Check the number of observations for each country. How many countries have less than five years of measurements? Filter these countries out of the data, because we can’t study temporal trend without sufficient measurements.

Solution

heights <- heights %>% 
  add_n_obs() %>% 
  filter(n_obs >= 5)

Make a spaghetti plot of all the data, with a smoother overlaid. Does it look like men are generally getting taller?

Solution

Generally, the trend is up, so yes it does look like men are getting taller acorss the globe.

ggplot(heights,
       aes(x = year,
           y = height_cm)) + 
  geom_line(aes(group = country), alpha = 0.3) + 
  geom_smooth(se=FALSE)

Use facet_strata to break the data into subsets using the year, and plot is several facets. What sort of patterns are there in terms of the earliest year that a country appears in the data?

Solution

The countries are pretty evenly distributed across the facets, which means that there are roughly similar numbers of countries regularly joining their data into the collection.

heights <- as_tsibble(heights,
                      index = year,
                      key = country,
                      regular = FALSE)
set.seed(530)
ggplot(heights, aes(x = year,
           y = height_cm,
           group = country)) + 
  geom_line() + 
  facet_strata(along = -year)

Compute the three number summary (min, median, max) for each country. Make density plots of these statistics, overlaid in a single plot, and a parallel coordinate plot of these three statistics. What is the average minimum (median, maximum) height across countries? Are there some countries who have roughly the same minimum, median and maximum height?

Solution

The average minimum height is about 164cm, median is about 168cm and tallest is about 172cm. The maximum height appears to be bimodal, with a small peak around 178cm.

Most countries have the expected pattern of increasing heights from minimum, median to maximum. There are a few which have very similar values of these, though, which is a bit surprising. It means that there has been no change in these metrics over time.

heights_three <- heights %>%
  features(height_cm, c(
    min = min,
    median = median,
    max = max
  ))
heights_three_l <- heights_three %>% 
  pivot_longer(cols = min:max,
               names_to = "feature",
               values_to = "value")

p1 <- heights_three_l %>% 
  ggplot(aes(x = value,
             fill = feature)) + 
  geom_density(alpha = 0.5) +
  labs(x = "Value",
       y = "Density",
       fill = "Feature") + 
  scale_fill_discrete_qualitative(palette = "Dark 3") +
  xlab("Height") +
  ylab("") +
  theme(legend.position = "none",
        aspect.ratio = 1)

p2 <- heights_three_l %>% 
 ggplot(aes(x = feature,
            y = value,
             group = country)) + 
  geom_line(alpha = 0.4) +
  xlab("") +
  ylab("Height") +
  theme(aspect.ratio = 1)

heights_three <- heights_three %>% 
  mutate(country = factor(country)) %>%
  mutate(country = fct_reorder(country, median)) 
p3 <- heights_three %>%
    ggplot() + 
    geom_point(aes(x = country,
           y = median)) +
    geom_errorbar(aes(x = country, 
                      ymin=min, ymax=max), 
                  alpha = 0.6, width=0) +
    xlab("") + ylab("heights") +
    coord_flip() +
  theme(axis.text.y = element_text(size=6))
 
design <- "
113
223
##3
##3"
p1 + p2 + p3 + 
  plot_layout(design = design)

Which country has the tallest men? Which country has highest median male height? Which country has the shortest men? Would you say that the distribution of heights within a country is similar for all countries?

Solution

Denmark has the tallest man (max). Estonia has the tallest median height. Papua New Guinea has the shortest men, on all metrics. The distribution of heights over the years is not the same for each country.

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.