EMODNET Benthos Trends

Tom Webb

2023

Temporal trends from presence-absence data in European benthos

Data preparation

This section describes the workflow for assembling the data sets required to produce spatial estimates of temporal turnover (species loss and gain, beta diversity) from the EMODnet macrobenthos presence-absence product.

First, load required packages:

library(tidyverse)
library(ncdf4)
library(here)
library(worrms)

The main presence-absence dataset is provided as a NetCDF file. This will be served via EMODnet but is accessed locally here. First, establish a connection to the file:

nc_fil <- nc_open(here("data", "raw_data/Macrobenthos_Eur_Seas_Pres_Abs_v0-4.nc"))

Assembling a data set of unique sampling events

Assemble a data frame of sampling events, composed of latitude, longitude, full date, and year from the .nc file:

sample_events <- tibble(
  lat = as.vector(ncvar_get(nc_fil, "lat")),
  lon = as.vector(ncvar_get(nc_fil,"lon")),
  date = lubridate::ymd(as.Date(ncvar_get(nc_fil, "Date"), origin = "1970-01-01 00:00:00")),
  year = lubridate::year(as.Date(ncvar_get(nc_fil, "Date"), origin = "1970-01-01 00:00:00"))
)

Note that there are fewer distinct sampling events (289634) than there are total sampling events (292332). Note also that a few sampling events have dates in the future:

## # A tibble: 4 Ă— 4
##     lat   lon date        year
##   <dbl> <dbl> <date>     <dbl>
## 1  53.3 -3.17 2035-05-19  2035
## 2  51.6 -4.89 2032-05-25  2032
## 3  59.1 -5.83 2030-08-10  2030
## 4  59.1 -5.81 2028-09-29  2028

Assembling a dataset of unique species IDs

First get all taxon names and WoRMS Aphia IDs from the nc file:

taxon_name <- ncvar_get(nc_fil, "Taxon_Name")
aphia_id <- ncvar_get(nc_fil,"AphiaID") %>% word(start = -1, sep = ":") %>% as.integer()

For this product, we use only taxa at the species rank only. To do this this we need to get the WoRMS classification for each taxon, and take the final entry in this. For this we use a simple function:

get_worms_rank <- function(aphiaid){
  classif <- wm_classification(aphiaid) %>% slice(n())
  classif
}

To run for one taxon:

get_worms_rank(aphia_id[1000])
## # A tibble: 1 Ă— 3
##   AphiaID rank    scientificname
##     <int> <chr>   <chr>         
## 1  855675 Species Cylista undata

This will run the function over all taxa. WARNING: this takes ~40 minutes on my machine:

taxo <- aphia_id %>% map_df(get_worms_rank, .progress = TRUE)

Because this is slow to run, the output was written to file (write_csv(taxo, file = here("data", "EMODnet-Biology-Benthos-European-Seas/benthos_taxo_rank.csv"))), so it can simply be read in directly here:

taxo <- read_csv(here("data", "derived_data/benthos_taxo_rank.csv"))

What ranks are present?

taxo %>% count(rank) %>% arrange(desc(n)) %>% print(n = 100)
## # A tibble: 23 Ă— 2
##    rank             n
##    <chr>        <int>
##  1 Species       8149
##  2 Genus         1679
##  3 Family         529
##  4 Order           89
##  5 Subspecies      83
##  6 Subfamily       64
##  7 Superfamily     62
##  8 Class           36
##  9 Suborder        32
## 10 Subclass        20
## 11 Phylum          17
## 12 Subgenus        16
## 13 Infraorder      13
## 14 Subphylum        8
## 15 Superorder       8
## 16 Kingdom          7
## 17 Infraclass       6
## 18 Tribe            5
## 19 Superclass       2
## 20 Infrakingdom     1
## 21 Parvorder        1
## 22 Subterclass      1
## 23 Variety          1

To get a list of only species-rank taxa (adding an index for ease of matching later):

spp <- taxo %>%
  filter(rank == "Species") %>% 
  mutate(aphia_index = match(.$AphiaID, taxo$AphiaID))

This results in 8149 unique species-level benthic taxa.

Get occurrences of a species by sampling event

We can get presence-absence for a single species directly from the nc file:

presabs <- ncvar_get(nc_fil,"Pres_abs",
                     start = c(1, spp$aphia_index[1]), count = c(-1,1))

To run this across species, wrap it in a function:

get_presabs_sp <- function(sp_index, sp_df = spp, samps = sample_events){
  
  # get pres-abs data
  presabs <- ncvar_get(nc_fil, "Pres_abs",
                       start = c(1, sp_index), count = c(-1,1))
  # assemble data frame
  ret <- tibble(aphia_id = spp$AphiaID[spp$aphia_index == sp_index],
                presabs = presabs)
  ret <- samps %>% bind_cols(ret)
  
  ret
}

Run this for a single species and get a count of presences and absences:

presabs <- get_presabs_sp(sp_index = spp$aphia_index[1])
presabs %>% count(presabs)
## # A tibble: 3 Ă— 2
##     presabs      n
##   <int[1d]>  <int>
## 1         0 230910
## 2         1  33777
## 3        NA  27645

NB: 1 = Present, 0 = looked for but absent, NA = not looked for. A value of ‘NA’ indicates that at that sampling event, only a subset of the full benthos community was effectively surveyed. To work out how big an issue this is, we can get species-level summaries of numbers of presences, absences, and NAs, in a function:

get_presabs_sp_summ <- function(sp_index, sp_df = spp){
  
  # get pres-abs data
  presabs <- ncvar_get(nc_fil, "Pres_abs",
                       start = c(1, sp_index), count = c(-1,1))
  # assemble data frame
    ret <- tibble(presabs = presabs) %>%
      count(presabs) %>%
      mutate(aphia_id = spp$AphiaID[spp$aphia_index == sp_index]) %>% 
      dplyr::select(aphia_id, everything())
  
  ret
}

What this does for a single species:

get_presabs_sp_summ(sp_index = spp$aphia_index[1])
## # A tibble: 3 Ă— 3
##   aphia_id   presabs      n
##      <dbl> <int[1d]>  <int>
## 1   141579         0 230910
## 2   141579         1  33777
## 3   141579        NA  27645

Run across species and summarise (takes a minute or two to run):

pres_abs_summ <- spp %>% pull(aphia_index) %>%
  purrr::map(get_presabs_sp_summ, .progress = TRUE) %>%
  bind_rows()

pres_abs_summ %>% group_by(presabs) %>% summarise(n_cases = sum(n))
## # A tibble: 3 Ă— 2
##     presabs    n_cases
##   <int[1d]>      <int>
## 1         0 1989927446
## 2         1    3314018
## 3        NA  388972004

So there is a manageable number of presences (~3.3M) but the number of absences (~2bn) and NAs (~389M) will cause issues.

Given that NA indicates an incomplete survey of the benthic community, the decision taken for this product is not to consider sampling events that include NA for any species. First, add a new variable to sample_events, includes_na, setting this to be FALSE for all sampling events initially:

sample_events <- sample_events %>% mutate(includes_na = FALSE)

This function finds for a given species which (if any) sampling events include NA, and changes the includes_na flag to TRUE for any events that do:

get_na_events <- function(sp_index, samps = sample_events){
  
  # get pres-abs data
  presabs <- ncvar_get(nc_fil, "Pres_abs",
                       start = c(1, sp_index), count = c(-1,1))
  # flag NAs in sample_events
  samps <- samps %>% mutate(includes_na = ifelse(is.na(presabs), TRUE, includes_na))
  samps
}

Then run this over all species, saving the results to a new object called na_events (takes a couple of minutes):

na_events <- sample_events
for(i in spp$aphia_index){
  na_events <- get_na_events(sp_index = i, samps = na_events)
}

What are the consequences of this?

na_events %>% count(includes_na)
## # A tibble: 2 Ă— 2
##   includes_na      n
##   <lgl[1d]>    <int>
## 1 FALSE       243848
## 2 TRUE         48484

So getting rid of these would remove 16.6% of all sample events. Are these biased in time?

Not obviously - although we may lose some of the much older events.

Final assembly of sample_events

To finalise sample_events, re-create it from na_events, add an id variable from the rownumber, and deal with the dates in the future (here simply by flagging them in the includes_na variable so they can be filtered out easily in one step):

sample_events <- na_events %>% 
  mutate(sample_id = row_number()) %>% 
  dplyr::select(sample_id, everything()) %>% 
  mutate(includes_na = ifelse(date > lubridate::now(), TRUE, includes_na))

This can then be written to file (write_csv(sample_events, file = here("data", "derived_data/sample_events.csv")) allowing the final assembled version of sample_events to be read in quickly here:

sample_events <- read_csv(here("data",
                               "derived_data/sample_events.csv"))
## Rows: 292332 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (4): sample_id, lat, lon, year
## lgl  (1): includes_na
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Assemble species presence x site data frame

This assembles the long species presence x site df, for non-na sites only. It uses a revised version of the previous get_presabs_sp() function which excludes NA sites:

get_presabs_sp_na_rm <- function(sp_index, sp_df = spp, samps = sample_events){
  
  # get pres-abs data
  presabs <- ncvar_get(nc_fil, "Pres_abs",
                       start = c(1, sp_index), count = c(-1,1))
  # assemble data frame
  ret <- tibble(aphia_id = spp$AphiaID[spp$aphia_index == sp_index],
                presabs = presabs)
  ret <- samps %>% bind_cols(ret)
  
  # filter to presences and non-NA sites
  ret <- ret %>% filter(presabs == 1 & includes_na == FALSE) %>% 
    dplyr::select(sample_id, aphia_id)
  
  ret
}

Run for one species - this needs the spp table of species:

taxo <- read_csv(here("data", "derived_data/benthos_taxo_rank.csv"))
spp <- taxo %>% 
  filter(rank == "Species") %>% 
  mutate(aphia_index = match(.$AphiaID, taxo$AphiaID))

Then run the presence-absence function for a single species:

get_presabs_sp_na_rm(sp_index = spp$aphia_index[1])
## # A tibble: 31,099 Ă— 2
##    sample_id aphia_id
##        <dbl>    <dbl>
##  1       212   141579
##  2       266   141579
##  3       302   141579
##  4      2728   141579
##  5      2729   141579
##  6      2739   141579
##  7      2743   141579
##  8      2831   141579
##  9      3431   141579
## 10      4143   141579
## # … with 31,089 more rows

Now run over all species to generate a full dataframe of species presence at each site (excluding sites with NA values) - this takes around 5 minutes:

pres_df <- spp %>% pull(aphia_index) %>%
  purrr::map(get_presabs_sp_na_rm, .progress = TRUE) %>%
  bind_rows()

From this we can get (and plot) e.g. a summary of species occurrence frequencies:

sp_occ_summary <- pres_df %>% count(aphia_id) %>% arrange(n)
ggplot(sp_occ_summary) + geom_histogram(aes(x = n))

So for instance 1064 species occur in only a single sample; 5706 occur in more than one.

Before closing the nc connection, first extract the Coordinate Reference System:

presabs_crs <- ncatt_get(nc_fil, "crs")

Also worth saving the presence data (pres_df) to file for direct loading (write_csv(pres_df, file = here("data", "derived_data/pres_df.csv"))) allowing the final presence data frame to be read in quickly here:

pres_df <- read_csv(here("data",
                         "derived_data/pres_df.csv"))

Finally close the nc connection:

nc_close(nc_fil)

The two derived datasets, sample_events.csv and pres_df.csv are now available in the data/derived_data folder for use in calculating temporal turnove metrics as described in documents/benthos-trends-turnover.

Temporal turnover of benthic species in European Seas

This section describes the workflow for generating data and maps of spatial estimates of temporal turnover (species loss and gain, beta diversity) from the EMODnet macrobenthos presence-absence product. It uses processed versions of the dataset that are fully described above.

First, load required packages:

# basic data manipulation and visualisation
library(tidyverse)
library(here)
library(janitor)
library(viridis)
library(biscale)
# spatial data processing and mapping
library(sf)
library(terra)
library(tidyterra)
library(rnaturalearth)
library(RNetCDF)
# diversity and turnover analysis
library(vegan)
library(BAT)

Read in the datasets of sampling events and species presence that were generated in the previous document (benthos-trends-dataprep):

sample_events <- read_csv(here("data",
                               "derived_data/sample_events.csv"))
pres_df <- read_csv(here("data",
                         "derived_data/pres_df.csv"))

To retain all sampling events (including those which do not sample the full benthic community, i.e. which return NA occurrences for some species), first assign the full data set to a new object and then filter to remove events which include NA occurrences:

sample_events_includingNA <- sample_events
sample_events <- sample_events %>% filter(includes_na == FALSE)

Note that some sample events do not include any species from our presence data - for instance, they may only have recorded taxa at higher taxonomic levels. To restrict futher analyses to only those sampling events where we know we have species presences:

sample_events <- sample_events %>% filter(sample_id %in% unique(pres_df$sample_id))

To determine a suitable set of time periods for temporal correlations, first look at number of events in five year blocks:

sample_events <- sample_events %>% 
  mutate(year_5 = round(year/5)*5)
ggplot(sample_events) + geom_bar(aes(x = year_5))

To examine the spatial distribution of samples through time, we can make the data spatial (using the WGS84 EPG) and produce a quick plot:

ggplot(st_as_sf(sample_events, coords = c("lon", "lat"), crs = 4326)) +
  geom_sf(size = 0.1) +
  theme(axis.text.x = element_text(angle = 75, hjust = 1, vjust = 0.5, size = 6),
        axis.text.y = element_text(size = 6)) +
  facet_wrap(~ year_5)

This shows not only more individual events through time, but also greater geographical spread, especially since ~1970s. It also suggests that regular time periods (e.g. every 5 or 10 years) will not be appropriate for estimating turnover. To try to get reasonably equal coverage, the following time periods are defined: before 1990, 1990-1999, 2000-2004, 2005-2009, 2010-2014, and 2015 and after:

sample_events <- sample_events %>%
  mutate(time_slice = case_when(
    year < 1990 ~ 1,
    year >= 1990 & year <2000 ~ 2,
    year >= 2000 & year <2005 ~ 3,
    year >= 2005 & year <2010 ~ 4,
    year >= 2010 & year <2015 ~ 5,
    year >= 2015 ~ 6
  ))

sample_events %>% count(time_slice)
## # A tibble: 6 Ă— 2
##   time_slice     n
##        <dbl> <int>
## 1          1 26192
## 2          2 34900
## 3          3 28246
## 4          4 36915
## 5          5 79314
## 6          6 18162

Create gridded sampling events dataset

The next step is to grid the presence-absence data. This uses a 1 degree grid to increase the number of grid cells that have samples at multiple time points. To set the extent of the grid we use the geographic extent of the sampling events data:

sample_events %>% select(lon, lat) %>% summary()
##       lon                lat       
##  Min.   :-33.3196   Min.   :28.19  
##  1st Qu.: -4.7088   1st Qu.:51.23  
##  Median : -0.2602   Median :53.33  
##  Mean   :  2.2960   Mean   :53.76  
##  3rd Qu.:  4.1272   3rd Qu.:56.12  
##  Max.   : 58.9430   Max.   :81.12

Use this to set a sensible extent and create a raster based on this:

extent_tb <- tibble(lon = c(-34, 59), lat = c(28, 82)) %>% 
  st_as_sf(coords = c("lon", "lat"), crs = 4326)
r <- rast(ext(extent_tb), resolution = 1, crs = crs(extent_tb))

Then create a raster of sampling events by time, with a layer for each of the six time slices, which we then fill from sampling events:

# create 6-layer raster
samp_event_by_time_r <- rep(r, 6)

# vector of names for time slices
time_slices <- c("pre1990", "1990s", "2000-2004", "2005-2009", "2010-2014", "post2015")

# fill the raster from sample_events
for(i in 1:nlyr(samp_event_by_time_r)){
  samp_event_by_time_r[[i]] <- sample_events %>%
    filter(time_slice == i) %>% 
    dplyr::select(lon, lat) %>% 
    as.matrix() %>% 
    rasterize(r, fun = "length") %>%
    setNames(time_slices[i])
}

To map these, first use the rnaturalearth package to get a world coastline basemap:

world <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf") %>% 
  st_transform(crs = crs(r))

Then produce the map:

ggplot() +
  geom_spatraster(data = log10(samp_event_by_time_r)) +
  scale_fill_viridis_c(name = "log N Samples") +
  geom_sf(data = world, colour = "grey95", fill = "grey85", lwd = 0.1, alpha = 0.5) +
  xlim(as.vector(ext(r)[1:2])) +
  ylim(as.vector(ext(r)[3:4])) +
  coord_sf(expand = FALSE) +
  facet_wrap(~lyr)

Add a cell index from this grid back into the sampling events data:

sample_events <- sample_events %>% 
  mutate(sample_cell = cellFromXY(r,
                                  as.matrix(dplyr::select(sample_events, lon, lat))))

We can now check the number of time periods with samples per grid cell, summarised here:

sample_events %>%
  dplyr::select(sample_cell, time_slice) %>%
  distinct() %>%
  count(sample_cell) %>%
  count(n) %>% 
  rename(n_time_periods = n, n_grid_cells = nn) %>% 
  mutate(p_cells = round(n_grid_cells / sum(n_grid_cells), 2))
## # A tibble: 6 Ă— 3
##   n_time_periods n_grid_cells p_cells
##            <int>        <int>   <dbl>
## 1              1          316    0.43
## 2              2           82    0.11
## 3              3           64    0.09
## 4              4           47    0.06
## 5              5           65    0.09
## 6              6          156    0.21

So of the 730 grid cells with at least one sample in, around 43% have samples from only one of our defined time periods, and around 21% have samples from all six time periods. We can show the spatial distribution of these - here, separating cells sampled only once (i.e., those that cannot contribute to measures of turnover or temporal change), labelled single_t, and those sampled at least two times, labelled multi_t. This code creates a raster with two layers, one to display the single_t cells and one for the multi_t cells.

cells_single_time <- 
  sample_events %>% select(sample_cell, time_slice) %>% 
  distinct() %>% 
  count(sample_cell) %>% filter(n == 1) %>%
  pull(sample_cell)

cells_multi_time <- 
  sample_events %>% select(sample_cell, time_slice) %>% 
  distinct() %>%  
  count(sample_cell) %>% filter(n > 1) %>%
  pull(sample_cell)

single_multi_time <- c(r,r) %>% setNames(c("single_t", "multi_t"))

values(single_multi_time[[1]]) <- NA
values(single_multi_time[[1]])[cells_single_time] <- 1

values(single_multi_time[[2]]) <- NA
values(single_multi_time[[2]])[cells_multi_time] <- 1

(single_v_multi_t_cell_map <- ggplot() +
  geom_spatraster(data = single_multi_time) +
  scale_fill_viridis_c(option = "turbo") +
  geom_sf(data = world, colour = "grey95", fill = "grey85", lwd = 0.1, alpha = 0.5) +
  xlim(as.vector(ext(r)[1:2])) +
  ylim(as.vector(ext(r)[3:4])) +
  coord_sf(expand = FALSE) +
  theme(legend.position = "none") +
  facet_wrap(~lyr)
)

Gridded presence-absence

We can now join the sampling events data to the main species presence dataset and summarise by grid cell:

gridded_occs <- pres_df %>%
  left_join(sample_events, join_by(sample_id)) %>%
  group_by(aphia_id, time_slice, sample_cell) %>% 
  summarise(sp_occs = n()) %>% 
  arrange(sample_cell, time_slice) %>% ungroup()

For this to be useful for temporal turnover analyses, we need to filter only those cells with samples in more than one time period, using the cells_multi_time object created above:

gridded_occs <- gridded_occs %>%
  filter(sample_cell %in% cells_multi_time)

This can be used to get occurrences of an individual species per grid cell through time, e.g.:

gridded_occs %>%
  filter(aphia_id == 101160) %>% arrange(sample_cell, time_slice)
## # A tibble: 214 Ă— 4
##    aphia_id time_slice sample_cell sp_occs
##       <dbl>      <dbl>       <dbl>   <int>
##  1   101160          3         982       2
##  2   101160          3         992       1
##  3   101160          5         994       1
##  4   101160          2        1075       1
##  5   101160          2        1438       1
##  6   101160          4        1462       1
##  7   101160          2        1463       3
##  8   101160          1        1464       1
##  9   101160          1        1562       2
## 10   101160          2        1562       2
## # … with 204 more rows

Which can be summarised, for example showing the number of individual grid cells occupied by this species in different numbers of time periods:

gridded_occs %>%
  filter(aphia_id == 101160) %>%
  count(sample_cell) %>% 
  count(n) %>% 
  rename(n_time_periods = n, n_grid_cells = nn)
## # A tibble: 6 Ă— 2
##   n_time_periods n_grid_cells
##            <int>        <int>
## 1              1           47
## 2              2           16
## 3              3            7
## 4              4           12
## 5              5            6
## 6              6            6

However this does not account for the identity of cells, i.e. which cells were surveyed in the different time periods. The following section addresses that issue at the community level, be quantifying losses and gains of species at the gridd cell level over time, directly comparing cells sampled in successive time periods.

Benthic community turnover

Calculating temporal turnover at the grid cell level requires a matrix of species presences and absences over time. To give an example of this, it is useful to identify a grid cell which has samples in each time period:

gridded_occs %>% dplyr::select(time_slice, sample_cell) %>%
  distinct() %>%
  count(sample_cell) %>%
  arrange(desc(n)) %>% 
  head()
## # A tibble: 6 Ă— 2
##   sample_cell     n
##         <dbl> <int>
## 1        1544     6
## 2        1545     6
## 3        1546     6
## 4        1547     6
## 5        1637     6
## 6        1638     6

Any of these cells will work, we’ll just use the first one (cell 1544). To convert this into a matrix of species occurrences:

(comm_mat_eg <- gridded_occs %>% filter(sample_cell == 1544) %>% 
  mutate(pa = ifelse(sp_occs > 0, 1, 0)) %>% 
  dplyr::select(time_slice, aphia_id, pa) %>% 
  pivot_wider(names_from = aphia_id, values_from = pa, values_fill = 0) %>% 
  column_to_rownames(var = "time_slice")
)
##   103077 119034 102101 122739 147123 264152 732756 102299 137419 137556 137569
## 1      1      1      0      0      0      0      0      0      0      0      0
## 2      1      1      1      1      1      1      1      0      0      0      0
## 3      1      1      0      1      1      1      1      1      0      0      0
## 4      1      1      1      1      1      1      1      0      1      1      1
## 5      1      1      1      1      1      1      1      0      0      0      0
## 6      1      1      1      1      1      1      1      0      0      0      0
##   166616 181483 182796 183235 494362 743898 955291 955309 120120 167642 181523
## 1      0      0      0      0      0      0      0      0      0      0      0
## 2      0      0      0      0      0      0      0      0      0      0      0
## 3      0      0      0      0      0      0      0      0      0      0      0
## 4      1      1      1      1      1      1      1      1      0      0      0
## 5      1      0      0      1      1      0      1      0      1      1      1
## 6      0      0      0      1      0      0      0      0      1      0      0
##   884604
## 1      0
## 2      0
## 3      0
## 4      0
## 5      1
## 6      0

This gives us a matrix with 6 rows (one per time slice) and 23 columns (one per species found in this grid cell over the whole survey period). This is now in the correct format to use the diversity functions in the vegan and BAT packages. Specifically here we use BAT::beta() and vegan::betadiver()

b_eg <- comm_mat_eg %>% 
  as.matrix() %>% beta()
abc_eg <- comm_mat_eg %>% betadiver(method = NULL)

These can be converted into summaries of species shared (‘a’), species lost (‘b), and species gained (’c’) between each pair of time periods, as well as total beta diversity (‘Btotal’) and its richness (‘Brich’, i.e. due to changes in richness) and replacement (‘Brepl’, i.e. due to changes in composition) components. All of this is done in a single function, get_species_turnover, available in the ‘scripts’ folder, that we can source here:

source(here("scripts", "get_species_turnover.R"))

To run this for our example grid cell:

gridded_occs %>% filter(sample_cell == 1544) %>% get_species_turnover()
## # A tibble: 15 Ă— 9
##    sample_cell time_start time_comp beta_tot beta_repl beta_…¹     a     b     c
##          <dbl>      <dbl>     <dbl>    <dbl>     <dbl>   <dbl> <dbl> <dbl> <dbl>
##  1        1544          1         2    0.714     0       0.714     2     5     0
##  2        1544          1         3    0.714     0       0.714     2     5     0
##  3        1544          1         4    0.889     0       0.889     2    16     0
##  4        1544          1         5    0.867     0       0.867     2    13     0
##  5        1544          1         6    0.778     0       0.778     2     7     0
##  6        1544          2         3    0.25      0.25    0         6     1     1
##  7        1544          2         4    0.611     0       0.611     7    11     0
##  8        1544          2         5    0.533     0       0.533     7     8     0
##  9        1544          2         6    0.222     0       0.222     7     2     0
## 10        1544          3         4    0.684     0.105   0.579     6    12     1
## 11        1544          3         5    0.625     0.125   0.5       6     9     1
## 12        1544          3         6    0.4       0.2     0.2       6     3     1
## 13        1544          4         5    0.5       0.364   0.136    11     4     7
## 14        1544          4         6    0.579     0.105   0.474     8     1    10
## 15        1544          5         6    0.4       0       0.4       9     0     6
## # … with abbreviated variable name ¹​beta_rich

This can then be run over all cells, with a little extra manipulation at the end to add a variable indicating the relevant time comparison, and to show species lost and species gained as a proportion of all species occurring across both time periods (takes around a minute to run):

species_turnover_allcells <- gridded_occs %>%
  split(.$sample_cell) %>%
  purrr::map(get_species_turnover, .progress = TRUE) %>% 
  bind_rows()

The output is here manipulated a bit to add longer labels to the time periods, and to create composite time comparison variables (useful for plotting):

(species_turnover_allcells  <- species_turnover_allcells  %>% 
  mutate(
    t_start_long = case_match(
      time_start,
      1 ~ "pre1990",
      2 ~ "1990s",
      3 ~ "2000-2004",
      4 ~ "2005-2009",
      5 ~ "2012-2014",
      6 ~ "post2015"
    ),
    t_comp_long = case_match(
      time_comp,
      1 ~ "pre1990",
      2 ~ "1990s",
      3 ~ "2000-2004",
      4 ~ "2005-2009",
      5 ~ "2012-2014",
      6 ~ "post2015"),
    t_comp = paste(time_start, time_comp, sep = "_"),
    t_comp_lab = paste(t_start_long, t_comp_long, sep = "_v_"),
    t_steps = time_comp - time_start,
    b_prop = b / (a + b + c), c_prop = c / (a + b + c)
  ) %>% 
  select(sample_cell, t_comp, t_steps, beta_tot:c, b_prop, c_prop, t_comp_lab)
)
## # A tibble: 3,546 Ă— 12
##    sample_cell t_comp t_steps beta_tot beta_r…¹ beta_…²     a     b     c b_prop
##          <dbl> <chr>    <dbl>    <dbl>    <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1         336 1_2          1    0.870   0.0435   0.826     6    39     1  0.848
##  2         792 1_2          1    0.739   0.319    0.420    18    40    11  0.580
##  3         982 3_4          1    0.848   0.268    0.580    34    30   160  0.134
##  4         984 2_3          1    0.739   0.267    0.472    47    24   109  0.133
##  5         984 2_4          2    0.902   0.304    0.598    18    28   138  0.152
##  6         984 3_4          1    0.907   0.673    0.234    10    36    61  0.336
##  7         985 2_4          2    0.794   0.658    0.135    67   107   151  0.329
##  8         986 2_4          2    0.764   0.631    0.133    89   169   119  0.448
##  9         988 2_3          1    0.686   0.429    0.257    44    30    66  0.214
## 10         988 2_5          3    0.802   0.642    0.160    32    52    78  0.321
## # … with 3,536 more rows, 2 more variables: c_prop <dbl>, t_comp_lab <chr>, and
## #   abbreviated variable names ¹​beta_repl, ²​beta_rich

This results in a data frame giving the sample cell (sample_cell), the relevant time comparison (t_comp - e.g. a value of 1_2 is the comparison between time slice 1 (pre-1990) and time slice 2 (1990s) - this is made explicit in t_comp_lab), the number of time steps in this comparison (t_steps - e.g. pre-1990 vs 1990s would be 1, pre-1990 vs 2000-2004 would be 2), and then the 6 diversity metrics: total, replacement, and richness beta diversity (beta_tot, beta_repl, beta_rich), and the number of species shared, lost, and gained from the first to the second time period (a, b, and c). Species lost and gained are also represented as a proportion of the total number of species lost, gained, and shared (b_prop and c_prop). To simplify visualisation and interpretation, it may be useful to filter the output to single time step comparisons.

To turn this into gridded products for each time comparison and turnover measure, we need to add some more spatial meta data (lon and lot of each cell). This code also adds the total number of unique sampling events in each cell (summed over all time periods):

# Get sample cell metadata
sample_cell_meta <- sample_events %>%
  group_by(sample_cell) %>%
  summarise(n_samps = n_distinct(sample_id)) %>% 
  bind_cols(xyFromCell(r, .$sample_cell)) %>% 
  rename(lon = x, lat = y)

# Then join to species turnover data
species_turnover_allcells <- species_turnover_allcells %>% 
  left_join(sample_cell_meta, join_by(sample_cell)) %>% 
  select(lon, lat, everything())

This then allows the creation of a raster for each diversity measure, with a layer for each time comparison. For example, for total beta diversity:

beta_tot_r <- species_turnover_allcells %>% 
  select(lon, lat, t_comp, t_comp_lab, beta_tot) %>% 
  arrange(t_comp) %>% 
  select(-t_comp) %>% 
  pivot_wider(names_from = t_comp_lab,
              values_from = beta_tot) %>% 
  rast(type = "xyz", crs = crs(r)) %>% 
  project(r)

This can be plotted if required:

(beta_tot_map <- ggplot() +
  geom_spatraster(data = beta_tot_r) +
  scale_fill_viridis_c(name = "beta (total)") +
  geom_sf(data = world, colour = "grey95", fill = "grey85", lwd = 0.1, alpha = 0.5) +
  xlim(as.vector(ext(r)[1:2])) +
  ylim(as.vector(ext(r)[3:4])) +
  coord_sf(expand = FALSE) +
  facet_wrap(~lyr) +
  theme_bw(base_size = 8) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
)

To plot an individual layer (or selection - here, pre-1990s v 1990s and pre-1990s v post-2015):

ggplot() +
  geom_spatraster(data = select(beta_tot_r, 1, 5)) +
  scale_fill_viridis_c(name = "beta (total)") +
  geom_sf(data = world, colour = "grey95", fill = "grey85", lwd = 0.1, alpha = 0.5) +
  xlim(as.vector(ext(r)[1:2])) +
  ylim(as.vector(ext(r)[3:4])) +
  coord_sf(expand = FALSE) +
  facet_wrap(~lyr) +
  theme_bw(base_size = 10) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

A function to do all of this for a specified turnover metric, create_turnover_raster, is available in the ‘scripts’ folder, sourced here:

source(here("scripts", "create_turnover_raster.R"))

Running for c_prop, creating (but not displaying) a plot for 2000-2004 compared to all later time periods:

turnover_r_cprop <- create_turnover_raster(turnover_metric = "c_prop",
                       create_plot = TRUE,
                       display_plot = FALSE,
                       plot_layers = c("2000-2004_v_2005-2009",
                                       "2000-2004_v_2012-2014",
                                       "2000-2004_v_post2015")
                       )

To examine the data returned:

turnover_r_cprop$turnover_rast
## class       : SpatRaster 
## dimensions  : 54, 93, 15  (nrow, ncol, nlyr)
## resolution  : 1, 1  (x, y)
## extent      : -34, 59, 28, 82  (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326) 
## source(s)   : memory
## names       : pre19~1990s, pre19~-2004, pre19~-2009, pre19~-2014, pre19~t2015, 1990s~-2004, ... 
## min values  :   0.0000000,   0.0000000,   0.0000000,   0.0000000,   0.0000000,    0.000000, ... 
## max values  :   0.9968944,   0.9910714,   0.9576923,   0.9946524,   0.9861111,    0.997151, ...

And to display the plot:

turnover_r_cprop$turnover_plot

The final step is to create the rasters for all diversity measures.

beta_tot_r <- create_turnover_raster(turnover_metric = "beta_tot", create_plot = FALSE)
beta_repl_r <- create_turnover_raster(turnover_metric = "beta_repl", create_plot = FALSE)
beta_rich_r <- create_turnover_raster(turnover_metric = "beta_rich", create_plot = FALSE)

sp_shared_r <- create_turnover_raster(turnover_metric = "a", create_plot = FALSE)
sp_lost_r <- create_turnover_raster(turnover_metric = "b", create_plot = FALSE)
sp_gained_r <- create_turnover_raster(turnover_metric = "c", create_plot = FALSE)

p_sp_lost_r <- create_turnover_raster(turnover_metric = "b_prop", create_plot = FALSE)
p_sp_gained_r <- create_turnover_raster(turnover_metric = "c_prop", create_plot = FALSE)

Write products to file

Write all of the rasters to file, as GeoTIFFs with LZW compression (set gdal = "COMPRESS=NONE" in the call to writeRaster if uncompressed files are required). Setting overwrite = TRUE means that any existing file of the same name will be overwritten.

writeRaster(beta_tot_r, filename = here("product", "beta_tot_r.tif"), overwrite = TRUE)
writeRaster(beta_repl_r, filename = here("product", "beta_repl_r.tif"), overwrite = TRUE)
writeRaster(beta_rich_r, filename = here("product", "beta_rich_r.tif"), overwrite = TRUE)
writeRaster(sp_shared_r, filename = here("product", "sp_shared_r.tif"), overwrite = TRUE)
writeRaster(sp_lost_r, filename = here("product", "sp_lost_r.tif"), overwrite = TRUE)
writeRaster(sp_gained_r, filename = here("product", "sp_gained_r.tif"), overwrite = TRUE)
writeRaster(p_sp_lost_r, filename = here("product", "p_sp_lost_r.tif"), overwrite = TRUE)
writeRaster(p_sp_gained_r, filename = here("product", "p_sp_gained_r.tif"), overwrite = TRUE)

To create the NetCDF versions, we adapt the EMODnet guide to creating NetCDF files from https://emodnet.github.io/EMODnet-Biology-products-erddap-demo/. This has been developed into a function in the scripts folder, netcdfify, that will generate a NetCDF from an input SpatRaster file:

source(here("scripts", "netcdfify.R"))

First, read in the GeoTIFFs created above:

beta_total <- rast(here("product", "beta_tot_r.tif"))
beta_repl <- rast(here("product", "beta_repl_r.tif"))
beta_rich <- rast(here("product", "beta_rich_r.tif"))
sp_shared <- rast(here("product", "sp_shared_r.tif"))
sp_lost <- rast(here("product", "sp_lost_r.tif"))
sp_gained <- rast(here("product", "sp_gained_r.tif"))
p_sp_lost <- rast(here("product", "p_sp_lost_r.tif"))
p_sp_gained <- rast(here("product", "p_sp_gained_r.tif"))

Because these will all be combined, we add the diveristy measure as a prefix to the layer names in each raster:

names(beta_total) <- paste0("beta_total_", names(beta_total))
names(beta_repl) <- paste0("beta_repl_", names(beta_repl))
names(beta_rich) <- paste0("beta_rich_", names(beta_rich))
names(sp_shared) <- paste0("sp_shared_", names(sp_shared))
names(sp_lost) <- paste0("sp_lost_", names(sp_lost))
names(sp_gained) <- paste0("sp_gained_", names(sp_gained))
names(p_sp_lost) <- paste0("p_sp_lost_", names(p_sp_lost))
names(p_sp_gained) <- paste0("p_sp_gained_", names(p_sp_gained))

Then combine the rasters into a single raster with 120 layers, one for each combination of diversity measure and time comparison:

all_measures_r <- c(
  beta_total, beta_repl, beta_rich,
  sp_shared, sp_lost, sp_gained,
  p_sp_lost, p_sp_gained
)

Write this to file as an single GeoTiff version of the product:

writeRaster(all_measures_r,
            filename = here("product", "all_diversity_measures.tif"),
            overwrite = TRUE)

It’s also useful to know the min and max values of each layer. This is done here, and written to file as a simple dataframe with a min and max value for each diveristy measure x time comparison combination:

all_measures_minmax <- all_measures_r %>% 
  minmax() %>% 
  as_tibble() %>% 
  mutate(value = c("min", "max")) %>% 
  select(value, everything()) %>% 
  pivot_longer(cols = -value, names_to = "diversity_measure", values_to = "val") %>% 
  pivot_wider(names_from = value, values_from = val)

write_csv(all_measures_minmax,
          here::here("data", "derived_data/all_measures_minmax.csv"))

We can set some global attributes for the output NetCDF file. This version is for all measures together. It could be adapted to create files for individual diversity measures if required.

global_attr <- list(
  title = "Diversity_measures",
  summary = "Eight measures of changes in species compositions of European macrobenthic communities between different time periods based on presence / absence in  on a 1 degree grid",                       
  Conventions = "CF-1.8",
  # id = "",
  naming_authority = "emodnet-biology.eu",
  history = "https://github.com/EMODnet/benthos-trends",
  source = "https://github.com/EMODnet/benthos-trends",
  # processing_level = "",
  # comment = "", 
  # acknowledgment = "",
  license = "CC-BY",
  standard_name_vocabulary = "CF Standard Name Table v1.8",
  date_created = as.character(Sys.Date()),
  creator_name = "Tom Webb",
  creator_email = "t.j.webb@sheffield.ac.uk",
  creator_url = "https://www.sheffield.ac.uk/biosciences/people/academic-staff/tom-webb",
  institution = "The University of Sheffield",
  project = "EMODnet-Biology",
  publisher_name = "EMODnet-Biology",                 
  publisher_email = "bio@emodnet.eu",                
  publisher_url = "www.emodnet-biology.eu",                  
  # geospatial_bounds = "",              
  # geospatial_bounds_crs = "",          
  # geospatial_bounds_vertical_crs = "", 
  geospatial_lat_min = ext(all_measures_r)[3],
  geospatial_lat_max = ext(all_measures_r)[4],
  geospatial_lon_min = ext(all_measures_r)[1],
  geospatial_lon_max = ext(all_measures_r)[2],
  # geospatial_vertical_min = "",        
  # geospatial_vertical_max = "",        
  # geospatial_vertical_positive = "",  
  # time_coverage_start = "1911",            
  # time_coverage_end = "2016",              
  # time_coverage_duration = "",         
  # time_coverage_resolution = "",       
  # uuid = "",                           
  # sea_name = "",                       
  # creator_type = "",                   
  creator_institution = "The University of Sheffield",            
  # publisher_type = "",                 
  publisher_institution = "Flanders Marine Institute (VLIZ)",        
  # program = "",                        
  # contributor_name = "",               
  # contributor_role  = "",              
  geospatial_lat_units = "degrees_north",           
  geospatial_lon_units = "degrees_east",           
  # geospatial_vertical_units   = "",    
  # date_modified = "",               
  # date_issued = "",                    
  # date_metadata_modified   = "",       
  # product_version = "",            
  # keywords_vocabulary = "",          
  # platform  = "",              
  # platform_vocabulary = "",          
  # instrument = "",          
  # instrument_vocabulary  = "",        
  # featureType = "Point",                  
  # metadata_link = "",                  
  # references = "",
  comment = "Uses attributes recommended by http://cfconventions.org",
  license = "CC-BY", 
  publisher_name = "EMODnet Biology Data Management Team",
  citation = "Webb, T.J. (2023). Temporal turnover of macrobenthos in European seas.",
  acknowledgement = "European Marine Observation Data Network (EMODnet) Biology project (EMFF/2019/1.3.1.9/Lot 6/SI2.837974), funded by the European Union under Regulation (EU) No 508/2014 of the European Parliament and of the Council of 15 May 2014 on the European Maritime and Fisheries Fund"
)

To create the NetCDF for total beta diversity, just run the netcdfify function (setting add_measure_name to FALSE because our layers already include the diversity measure prefix):

netcdfify(focal_rast = all_measures_r,
          output_fname = "all_diversity_measures",
          add_measure_name = FALSE,
          global_atts = global_attr)

For completeness, this shows how to creat a NetCDF file for a single diversity measure, here beta_total:

global_attr$title <- "Beta diversity"
global_attr$summary <- "Total beta diversity between different time periods based on presence / absence in European macrobenthic communities on a 1 degree grid"
beta_total <- rast(here("product", "beta_tot_r.tif"))
netcdfify(focal_rast = beta_total,
          output_fname = "beta_total",
          add_measure_name = TRUE,
          global_atts = global_attr)

Reproducibility

Reproducibility receipt

Date of rendering:

## [1] "2023-04-17 15:52:07 BST"

Repository: https://github.com/EMODnet/EMODnet-Biology-benthos-trends

Session info:

## R version 4.2.2 (2022-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.2.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] BAT_2.9.2           vegan_2.6-4         lattice_0.20-45    
##  [4] permute_0.9-7       RNetCDF_2.6-2       rnaturalearth_0.3.2
##  [7] tidyterra_0.3.2     terra_1.7-3         sf_1.0-9           
## [10] biscale_1.0.0       viridis_0.6.2       viridisLite_0.4.1  
## [13] janitor_2.2.0       worrms_0.4.2        here_1.0.1         
## [16] ncdf4_1.21          forcats_1.0.0       stringr_1.5.0      
## [19] dplyr_1.1.0         purrr_1.0.1         readr_2.1.3        
## [22] tidyr_1.3.0         tibble_3.2.1        ggplot2_3.4.0      
## [25] tidyverse_1.3.2    
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.4.1            backports_1.4.1         plyr_1.8.8             
##   [4] sp_1.6-0                splines_4.2.2           listenv_0.9.0          
##   [7] urltools_1.7.3          digest_0.6.31           foreach_1.5.2          
##  [10] htmltools_0.5.4         fansi_1.0.4             magrittr_2.0.3         
##  [13] googlesheets4_1.0.1     cluster_2.1.4           doParallel_1.0.17      
##  [16] ks_1.14.0               tzdb_0.3.0              globals_0.16.2         
##  [19] recipes_1.0.5           fastcluster_1.2.3       gower_1.0.1            
##  [22] modelr_0.1.10           vroom_1.6.1             hardhat_1.3.0          
##  [25] timechange_0.2.0        pdist_1.2.1             prettyunits_1.1.1      
##  [28] colorspace_2.1-0        rvest_1.0.3             haven_2.5.1            
##  [31] xfun_0.36               crayon_1.5.2            jsonlite_1.8.4         
##  [34] survival_3.4-0          iterators_1.0.14        ape_5.7-1              
##  [37] glue_1.6.2              gtable_0.3.1            gargle_1.3.0           
##  [40] ipred_0.9-14            future.apply_1.10.0     maps_3.4.1             
##  [43] abind_1.4-5             scales_1.2.1            mvtnorm_1.1-3          
##  [46] DBI_1.1.3               Rcpp_1.0.10             progress_1.2.2         
##  [49] palmerpenguins_0.1.1    magic_1.6-1             units_0.8-1            
##  [52] bit_4.0.5               proxy_0.4-27            mclust_6.0.0           
##  [55] stats4_4.2.2            lava_1.7.2.1            prodlim_2023.03.31     
##  [58] httr_1.4.4              ellipsis_0.3.2          pkgconfig_2.0.3        
##  [61] farver_2.1.1            nnet_7.3-18             sass_0.4.5             
##  [64] dbplyr_2.3.0            utf8_1.2.2              crul_1.3               
##  [67] caret_6.0-94            reshape2_1.4.4          tidyselect_1.2.0       
##  [70] labeling_0.4.2          rlang_1.1.0             munsell_0.5.0          
##  [73] cellranger_1.1.0        tools_4.2.2             cachem_1.0.6           
##  [76] cli_3.6.0               generics_0.1.3          broom_1.0.3            
##  [79] evaluate_0.20           geometry_0.4.7          fastmap_1.1.0          
##  [82] yaml_2.3.7              ModelMetrics_1.2.2.2    knitr_1.42             
##  [85] bit64_4.0.5             fs_1.6.1                future_1.32.0          
##  [88] nlme_3.1-160            pracma_2.4.2            xml2_1.3.3             
##  [91] compiler_4.2.2          rstudioapi_0.14         curl_5.0.0             
##  [94] e1071_1.7-13            reprex_2.0.2            bslib_0.4.2            
##  [97] stringi_1.7.12          highr_0.10              rgeos_0.6-2            
## [100] Matrix_1.5-1            classInt_0.4-9          vctrs_0.6.1            
## [103] pillar_1.8.1            lifecycle_1.0.3         triebeard_0.3.0        
## [106] jquerylib_0.1.4         data.table_1.14.6       raster_3.6-20          
## [109] R6_2.5.1                KernSmooth_2.23-20      gridExtra_2.3          
## [112] parallelly_1.35.0       codetools_0.2-18        MASS_7.3-58.1          
## [115] assertthat_0.2.1        proto_1.0.0             rprojroot_2.0.3        
## [118] withr_2.5.0             httpcode_0.3.0          mgcv_1.8-41            
## [121] parallel_4.2.2          hms_1.1.2               rpart_4.1.19           
## [124] grid_4.2.2              timeDate_4022.108       rnaturalearthdata_0.1.0
## [127] class_7.3-20            nls2_0.3-3              rmarkdown_2.20         
## [130] snakecase_0.11.0        hypervolume_3.1.0       googledrive_2.0.0      
## [133] pROC_1.18.0             lubridate_1.9.1