ETC5521 Tutorial 11

Exploring spatiotemporal data

Author

Prof. Di Cook

Published

October 12, 2023

🎯 Objectives

This tutorial practices rearranging spatiotemporal data to focus on spatial or temporal patterns, and constructing choropleth maps and cartograms.

🔧 Preparation

The reading for this week is Moraga (2019) Spatial data and R packages for mapping; cubble: A Vector Spatio-Temporal Data Structure for Data Analysis; Making maps plot faster Simplify spatial polygons; sf: Simple Features for R.
Complete the weekly quiz, before the deadline!
Install the following R-packages if you do not have them already:

install.packages(c("tidyverse","here","lubridate","GGally","tsibble","forcats","cartogram","sf","cartogram","patchwork","ggthemes", "sugarbag", "viridis", "rmapshaper"))
remotes::install_github("runapp-aus/strayr")
remotes::install_github("huizezhang-sherry/cubble")

Open your RStudio Project for this unit, (the one you created in week 1, eda or ETC5521).

Exercise 1: Gridded spatiotemporal data

Conduct a spatiotemporal analysis of ozone measurements over central America, following the analysis of temperature provided in the class lecture notes.

a. Make a single map

Load the nasa data from the GGally package, and make a map of ozone for January 2015, overlaid on a map of the geographic area. What do you learn about the spatial distribution of ozone?

Solution

The high concentrations of ozone are at the highest latitude. The lowest are close to the equator, and there is a small increase in values in the southern hemisphere. The trend is primarily north-south, and doesn’t change between land and sea.

data(nasa) # from GGally packge

nasa_cb <- as_cubble(as_tibble(nasa), 
                     key=id, 
                     index=time, 
                     coords=c(long, lat))

sth_america <- map_data("world") %>%
  filter(between(long, -115, -53), between(lat, -20.5, 41))
nasa_cb %>% face_temporal() %>%
  filter(month == 1, year == 1995) %>% 
  select(id, time, ozone) %>%
  unfold(long, lat) %>%
  ggplot() + 
  geom_tile(aes(x=long, y=lat, fill=ozone)) +
  geom_path(data=sth_america, 
            aes(x=long, y=lat, group=group), 
            colour="white", linewidth=1) +
  scale_fill_viridis_c("ozone", option = "magma") +
  theme_map() +
  theme(legend.position = "bottom") +
  ggtitle("January 1995")

b. Display the map over time

Generate the maps of ozone for all of the time period, by facetting on month and year. Why was the plot organised so that months were in columns and years in rows, do you think? What do you learn about the temporal changes in the spatial distribution of ozone?

Solution

The primary comparison is same month each year, which we might expect to be fairly similar. Reading down columns is easier for making this comparison. Reading across the row, allows comparison of seasonal patterns within a year.

There is a small seasonal pattern, in that there is a decrease in values in the northern hemisphere in the late northern hemisphere summer (July, Aug, Sep). There is an increase during these months around the equator also. Because the latitude does not go as far south as north, we cannot see whether the ozone values are similarly high in the south as in the north, for corresponding distance from the equator. The pattern remains that it is mostly north-south trend rather than east-west trend or land-sea trend. There is not a lot of difference across years: perhaps slightly increased values extending further towards the equator from the northern latitudes in the summer months.

nasa_cb %>% face_temporal() %>%
  select(id, time, month, year, ozone) %>%
  unfold(long, lat) %>%
  ggplot() + 
  geom_tile(aes(x=long, y=lat, fill=ozone)) +
  facet_grid(year~month) +
  scale_fill_viridis_c("ozone",
                       option = "magma") +
  coord_map() +
  theme_map() +
  theme(legend.position="bottom")

c. Glyphmap

Make two glyphmaps of ozone, one with time series at each spatial grid point, scaled globally, and the other using polar coordinates at each spatial grid point, scaled individually. What do you learn about the temporal trend, and seasonal patterns of ozone over this geographic region?

Solution

The dominant pattern from the (time series) glyph maps of high values in the north, and decreasing values going south, and then a slight increase at the most southern values. There may be some seasonality because the series have the up-down pattern repeated 6 times for the 6 years, visible at all locations.

In the seasonal glyphmaps, the seasonality is the strongest pattern at all locations. Some years tend to have more ozone than others, because the 6 “petals” are different sizes in some locations. There is possibly a slight land-sea difference in seasonality. It’s also possible to see the shift in seasons, that the peaks in the south are at different times of the year than the peaks in the north.

nasa_cb %>% face_temporal() %>%
  select(id, time, month, year, ozone) %>%
  unfold(long, lat) %>%
  ggplot() +
  geom_polygon(data=sth_america, 
            aes(x=long, y=lat, group=group), 
            fill="#014221", alpha=0.5, colour="#ffffff") +
  cubble::geom_glyph_box(data=nasa, aes(x_major = long, x_minor = day,
            y_major = lat, y_minor = ozone), fill=NA) +
  cubble::geom_glyph(data=nasa, aes(x_major = long, x_minor = day,
            y_major = lat, y_minor = ozone)) +
  theme_map()

nasa_cb %>% face_temporal() %>%
  select(id, time, month, year, ozone) %>%
  unfold(long, lat) %>%
  ggplot() + 
  geom_polygon(data=sth_america, 
           aes(x=long, y=lat, group=group), 
          fill="#014221", alpha=0.5, colour="#ffffff") +
  cubble::geom_glyph_box(data=nasa, 
                 aes(x_major = long, x_minor = day,
            y_major = lat, y_minor = ozone), fill=NA) +
  cubble::geom_glyph(data=nasa, aes(x_major = long, x_minor = day,
            y_major = lat, y_minor = ozone), 
            polar = TRUE) +
  theme_map()

Exercise 2: Melbourne Covid-19 outbreak

In Melbourne we were in a strict lockdown for much of 2020, and large chunks of 2021. Each week we got our hopes up that restrictions might be eased, and once again these hopes were dashed by announcements each week, keeping the restrictions a little longer. The data we have collected here are the case counts by Victorian local government area (LGA) since the beginning of July, 2020. We will examine the spatiotemporal distribution of these counts.

Working with spatial data is always painful! It almost always requires some ugly code.

Part of the reason for the difficulty is the use of special data objects, that describe maps. There are several different choices, and some packages and tools use one, and others use another, so not all tools work together. The sf package helps enormously, but when you run into errors it can be hard to debug.
Another reason is that map objects can be very large, which makes sense for accurate mapping, but for data analysis and visualisation, we’d rather have smaller, even if slightly inaccurate, spatial objects. It can be helpful to thin out map data before doing further analysis - you need special tools for this, eg mapshapr. We don’t really need this for the exercises here, because the strayr version of the LGAs is already thinned.
Another problem commonly encountered is that there are numerous coordinate systems, and types of projections of the 3D globe into a 2D canvas. We have become accustomed to lat/long but like time its an awkward scale to compute on because a translation from E/W and N/S to positive and negative values is needed. More commonly a Universal Transverse Mercator (UTM) is the standard but its far less intuitive to use.
And yet another reason is that keys linking data tables and spatial tables may not match perfectly because there are often synonyms or slightly different name preferences between different data collectors.

The code for all the analysis is provided for you in the solution. We recommend that you run the code in steps to see what it is doing, why the mutating and text manipulations are necessary. Talk about the code with each other to help you understand it.

a. Read case counts for 2020

The file melb_lga_covid.csv contains the cases by LGA. Read the data in and inspect result. You should find that some variables are type chr because “null” has been used to code entries on some days. This needs fixing, and also missings should be converted to 0. Why does it make sense to substitute missings with 0, here?

Solution

NAs really have to be 0s. Its likely that the cells were left blank when numbers were recorded, left blank because there were no cases that day.

# Read the data
# Replace null with 0, for three LGAs
# Convert to long form to join with polygons
# Make the date variables a proper date
# Set NAs to 0, this is a reasonable assumption
covid <- read_csv("https://raw.githubusercontent.com/numbats/eda/master/data/melb_lga_covid.csv") %>%
  mutate(Buloke = as.numeric(ifelse(Buloke == "null", "0", Buloke))) %>%
   mutate(Hindmarsh = as.numeric(ifelse(Hindmarsh == "null", "0", Hindmarsh))) %>%
   mutate(Towong = as.numeric(ifelse(Towong == "null", "0", Towong))) %>%
  pivot_longer(cols = Alpine:Yarriambiack, names_to="NAME", values_to="cases") %>%
  mutate(Date = ydm(paste0("2020/",Date))) %>%
  mutate(cases=replace_na(cases, 0))

b. Check the data

Check the case counts to learn whether they are daily or cumulative. The best way to do this is select one suburb where there were substantial cases, and make a time series. If the counts are cumulative, calculate the daily counts, and re-check the temporal trend for your chosen LGA. Describe the temporal trend, and any visible artifacts.

Solution

This is cumulative data. Once re-calculated as daily counts, and artifact that emerges is that there are a few small negatives. This could occur if the previous days numbers have been adjusted as new information on cases or duplicates were found.

# Check the case counts
covid %>% filter(NAME == "Brimbank") %>%
  ggplot(aes(x=Date, y=cases)) +
    geom_point()

# Case counts are cumulative, so take lags to get daily case counts
covid <- covid %>%
  group_by(NAME) %>%
  mutate(new_cases = cases - dplyr::lag(cases)) %>%
  na.omit()

# Check the case counts
covid %>% filter(NAME == "Brimbank") %>%
  ggplot(aes(x=Date, y=new_cases)) +
    geom_col()

c. Spatial polygons size

Now let’s get polygon data of Victorian LGAs using the strayr package. The map is already fairly small, so it doesn’t need any more thinning, but we’ll look at how thinning works.

Get a copy of the lga2018 using strayr::read_absmap(). Save the resulting data as an .rda file, and plot the map.

Now run rmapshaper::ms_simplify(), saving it as a different object. Save the object as an .rda file, and plot the map.

What is the difference in file size before and after thinning. Can you see a difference in the map?

Solution

Before thinning the file is 523 KB, and after thinning it is just 33 KB. That’s a much smaller file.

The thinning has removed some small islands, and smoother the boundaries between LGAs.

# Read the LGA data from strayr package. 
# This has LGAs for all of Australia. 
# Need to filter out Victoria LGAs, avoiding LGAs 
# from other states with same name, and make the names
# match covid data names. The regex equation is
# removing () state and LGA type text strings
# Good reference: https://r-spatial.github.io/sf/articles/sf1.html
lga <- strayr::read_absmap("lga2018") |>
  rename(lga = lga_name_2018) |>
  filter(state_name_2016 == "Victoria") 
save(lga, file="data/lga.rda")
ggplot(lga) + geom_sf() + theme_map()

lga_sm <- ms_simplify(lga)
save(lga_sm, file="data/lga_sm.rda")
ggplot(lga_sm) + geom_sf() + theme_map()

c. Spatial polygons matching

Now let’s match polygon data of Victorian LGAs to the COVID counts. The cubble::check_key() can be used to check if the keys match between spatial and temporal data sets.

You will find that we need to fix some names of LGAs, even though cubble does a pretty good job working out which are supposed to match.

Solution

# Turn the covid data into a tsibble, because this is the
# temporal data, aggregate over time 
covid <- covid %>%
  select(-cases) %>%
  rename(lga = NAME, date=Date, cases = new_cases) 
covid_ts <- as_tsibble(covid, key=lga, index=date)

# Check the name matching
covid_matching <- check_key(spatial = lga, temporal = covid_ts)
covid_matching

$paired
# A tibble: 0 × 0

$potential_pairs
# A tibble: 78 × 2
   spatial        temporal  
   <chr>          <chr>     
 1 Alpine (S)     Alpine    
 2 Ararat (RC)    Ararat    
 3 Ballarat (C)   Ballarat  
 4 Banyule (C)    Banyule   
 5 Bass Coast (S) Bass Coast
 6 Baw Baw (S)    Baw Baw   
 7 Bayside (C)    Bayside   
 8 Benalla (RC)   Benalla   
 9 Boroondara (C) Boroondara
10 Brimbank (C)   Brimbank  
# ℹ 68 more rows

$others
$others$spatial
[1] "Colac-Otway (S)"                       
[2] "Unincorporated Vic"                    
[3] "No usual address (Vic.)"               
[4] "Migratory - Offshore - Shipping (Vic.)"

$others$temporal
[1] "Colac Otway"


attr(,"class")
[1] "key_tbl" "list"

# Fix matching
lga <- lga %>% 
  mutate(lga = ifelse(lga == "Colac-Otway (S)", "Colac Otway (S)", lga)) %>%
  filter(!(lga %in% covid_matching$others$spatial)) %>%
  mutate(lga = str_replace(lga, " \\(.+\\)", "")) # Remove (.)

# Re-do the name matching
covid_matching <- check_key(spatial = lga, temporal = covid_ts)
covid_matching

$paired
# A tibble: 79 × 2
   spatial    temporal  
   <chr>      <chr>     
 1 Alpine     Alpine    
 2 Ararat     Ararat    
 3 Ballarat   Ballarat  
 4 Banyule    Banyule   
 5 Bass Coast Bass Coast
 6 Baw Baw    Baw Baw   
 7 Bayside    Bayside   
 8 Benalla    Benalla   
 9 Boroondara Boroondara
10 Brimbank   Brimbank  
# ℹ 69 more rows

$potential_pairs
# A tibble: 0 × 0

$others
$others$temporal
character(0)

$others$spatial
character(0)


attr(,"class")
[1] "key_tbl" "list"

covid_cb <- make_cubble(
  spatial = lga, temporal = covid_ts,
  potential_match = covid_matching, index = date)

Warning: st_centroid assumes attributes are constant over geometries

Warning: Unknown or uninitialised column: `temporal`.

Warning: Unknown or uninitialised column: `spatial`.

e. Choropleth map

Sum the counts over the time period for each LGA, merge the COVID data with the map polygons (LGA) and create a choropleth map. The LGA data is an sf object so the geom_sf will automatically grab the geometry from the object to make the spatial polygons. Where was the highest COVID incidence?

Solution

The high count LGAs are all in Melbourne, mostly in the western suburbs.

# Aggregate the cases over the time period
covid_tot <- covid_cb %>%
  face_temporal() %>%
  summarise(totcases = sum(cases, na.rm = TRUE))

covid_tot <- covid_tot %>%
  left_join(lga) %>%
  st_as_sf()

# Make choropleth map, with appropriate colour palette
ggplot(covid_tot) + 
  geom_sf(aes(fill = totcases, label=lga), colour="white") + 
  scale_fill_distiller("Cases", palette = "YlOrRd",
                       direction=1) + 
  theme_map() +
  theme(legend.position="bottom")

# Make it interactive
# plotly::ggplotly()

f. Cartogram

To make a population-transformed polygon we need to get population data for each LGA. The file VIF2019_Population_Service_Ages_LGA_2036.xlsx has been extracted from the Vic Gov web site. It is a complicated xlsx file, with the data in sheet 3, and starting 13 rows down. The readxl package is handy here to extract the population data needed. You’ll need to join the population counts to the map data to make a cartogram. Once you have the transformed polygon data, the same plotting code can be used, as created the choropleth map.

Solution

Interestingly, the population of the LGAs is quite different, with densely populated LGAs in Melbourne. These get greatly enlarged by the algorithm, and LGA polygons from the rural areas are much smaller. It makes it easier to see the LGAs with high case counts, and also all of the LGAs in the city with low counts. (Note: the white inner city polygons are not actually LGAs, just unfortunate artifacts of the cartogram transformation.

# Incorporate population data to make cartogram
# Population from https://www.planning.vic.gov.au/land-use-and-population-research/victoria-in-future/tab-pages/victoria-in-future-data-tables
# Data can be downloaded from https://github.com/numbats/eda/blob/master/data/VIF2019_Population_Service_Ages_LGA_2036.xlsx
pop <- read_xlsx("data/VIF2019_Population_Service_Ages_LGA_2036.xlsx", sheet=3, skip=13, col_names = FALSE) %>%
  select(`...4`, `...22`) %>%
  rename(lga = `...4`, pop=`...22`) %>%
  filter(lga != "Unincorporated Vic") %>% 
  mutate(lga = str_replace(lga, " \\(.+\\)", "")) %>%
  mutate(lga = ifelse(lga == "Colac-Otway", "Colac Otway", lga)) 

covid_tot <- covid_tot %>%
  left_join(pop) 

# Compute additional statistics
covid_tot <- covid_tot %>%
  mutate(cases_per10k = totcases/pop*10000,
         lcases = log10(totcases + 1)) 

# Make a contiguous cartogram
# The sf object is in lat/long (WGS84) which is 
# an angle on the globe, but the cartogram 
# needs spatial locations in metres/numeric 
# as given by EPSG:3395.
# So we convert to metres and then back to 
# lat/long with st_transform
covid_tot_carto <- covid_tot %>% 
  st_transform(3395) %>% 
  cartogram_cont("pop") %>%
  st_transform("WGS84") 
# The cartogram object contains a mix of MULTIPOLYGON
# and POLYGON - yes, amazing! - st_cast() forces all 
# to be MULTIPOLYGON and is necessary for plotly 
covid_tot_carto <- st_cast(covid_tot_carto, "MULTIPOLYGON") 
# st_geometry() is a good function for checking
# the projection (lat/long vs metres) and POLYGON

ggplot(covid_tot_carto) + 
  geom_sf(aes(fill = totcases, label=lga), colour="white") + 
  scale_fill_distiller("Cases", palette = "YlOrRd",
                       direction=1) + 
  theme_map() +
  theme(legend.position="bottom")

# ggplotly()

g. Hexagon tile map

Use the provided code to make a hexgon tile map, with functions from the sugarbag package. Is it easier to see the spatial distribution of incidence from the hexagon tile map, or the choropleth or the cartogram?

Solution

The hexagon tile map leaves the georgraphy more recognisable than the cartogram, and it is easier to see the high incidence LGAs, and that they are in the city.

# Placement of hexmaps depends on position relative to
# Melbourne central
data(capital_cities)
covid_hexmap <- create_hexmap(
  shp = covid_tot,
  sf_id = "lga",
  focal_points = capital_cities, verbose = TRUE)
# This shows the centroids of the hexagons
ggplot(covid_hexmap, aes(x=hex_long, y=hex_lat)) +
  geom_point()

# Hexagons are made with the `fortify_hexagon` function
covid_hexmap_poly <- covid_hexmap %>%
  fortify_hexagon(sf_id = "lga", hex_size = 0.1869) %>%
  left_join(covid_tot, by="lga") # hexmap code removed cases!
ggplot() +
  geom_sf(data=covid_tot, 
          fill = "grey95", colour = "white", size=0.1) +
  geom_polygon(data=covid_hexmap_poly, 
               aes(x=long, y=lat, group=hex_id, 
                   fill = totcases, 
                   colour = totcases,
                   label=lga), size=0.2) +
  scale_fill_distiller("Cases", palette = "YlOrRd",
                       direction=1) +
  scale_colour_distiller("Cases", palette = "YlOrRd",
                       direction=1) +
  theme_map() +
  theme(legend.position="bottom")

# ggplotly()

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.