Global streamflow data availability

Mar 09, 2021 / Hung Nguyen

Data

How much streamflow data is publicly available globally? Do et al. (2018) compiled a massive global streamflow data set from multiple sources, and spent a lot of time on quality control. This data set is called GSIM—Global Streamflow Indices and Metadata. Thanks to their effort we have the answer for our question. We will explore GSIM to find out how many streamflow stations are available in each continent, and how many years of data are available at those stations.

GSIM can be downloaded here. This is a massive data set. For this exercise we only need the file GSIM_metadata.csv, which I’ve included in the GitHub repo of this post. We will handle this file with data.table and plot it with ggplot2.

library(data.table)
library(ggplot2)

We will also need the package countrycode, install it if you don’t have it already.

install.packages('countrycode')

Now let’s read and inspect the metadata record.

gm <- fread('GSIM_metadata.csv') # gm stands for GSIM metadata
head(gm)

##       gsim.no reference.db reference.no grdb.merge grdb.no paired.db paired.db.no  river          station country latitude longitude altitude  area unit river.dist station.dist latlon.dist bin.latlon.dist mean.dist number.overlap number.available.days number.missing.days frac.missing.days year.start year.end year.no
## 1: AF_0000001         grdb      2217100         No 2217100      <NA>         <NA> KUNDUZ      KULUKH TEPA      AF  36.9833   68.3000      320 37100 m3/s         NA           NA          NA              NA        NA             NA                  4748                   0         0.0000000       1965     1978      14
## 2: AF_0000002         grdb      2217101         No 2217101      <NA>         <NA> KUNDUZ        CHAR DARA      AF  36.7000   68.8333      401 24820 m3/s         NA           NA          NA              NA        NA             NA                  5170                   0         0.0000000       1964     1978      15
## 3: AF_0000003         grdb      2217102         No 2217102      <NA>         <NA> KUNDUZ          BAGHLAN      AF  36.1000   68.6667      562 19740 m3/s         NA           NA          NA              NA        NA             NA                  3829                   0         0.0000000       1968     1978      11
## 4: AF_0000004         grdb      2217103         No 2217103      <NA>         <NA> KUNDUZ PUL-I-KONDA SANG      AF  35.6000   68.6000      899 12610 m3/s         NA           NA          NA              NA        NA             NA                  4018                   0         0.0000000       1967     1978      12
## 5: AF_0000005         grdb      2217104         No 2217104      <NA>         <NA> KUNDUZ    DASHT-I-SAFED      AF  35.3000   67.9167     1588  3795 m3/s         NA           NA          NA              NA        NA             NA                  3177                 830         0.2071375       1967     1978      12
## 6: AF_0000006         grdb      2217110         No 2217110      <NA>         <NA> BAMYAN             DOAB      AF  35.2667   67.9833     1468  5005 m3/s         NA           NA          NA              NA        NA             NA                  4018                   0         0.0000000       1967     1978      12

In this record, each row contains the attributes of a station. The very last column, year.no, is what we need: it’s the number of years in each station.

The record tells us which country the station is in. There is a small problem: Namibia’s country code is "NA", and data.table mistook it as NA value.

head(gm[is.na(country), .(gsim.no, country)])

##       gsim.no country
## 1: NA_0000001    <NA>
## 2: NA_0000002    <NA>
## 3: NA_0000003    <NA>
## 4: NA_0000004    <NA>
## 5: NA_0000005    <NA>
## 6: NA_0000006    <NA>

Fixing this is easy:

gm[is.na(country), country := 'NA']

Histogram by continent

We can use country to determine which continent each station is in with help from the package countrycode.

# Assign continent to country
gm[, continent := countrycode::countrycode(
  warn = FALSE,
  country,
  origin = 'iso2c',
  destination = 'continent'
)]
# Serbia and Montenegro have become two countries since 2006.
# They were listed as one country in some source data of GSIM.
# This doesn't affect us as they are both in Europe.
gm[country == 'CS', continent := 'Europe']
# Split "Americas" into North America and South America
gm[country %in% c('US', 'CA', 'MX'), continent := 'North America']
gm[continent == 'Americas', continent := 'South America']

Let’s now plot the histogram of data length in each continent.

ggplot(gm) +
  geom_histogram(aes(year.no), binwidth = 1) +
  facet_wrap(vars(continent)) +
  scale_y_continuous(expand = c(0, 0)) +
  labs(x = 'Data length [years]', y = 'Count')

North America has many stations. Africa has few. Europe’s histogram has a very long right tail; some stations there have about 200 years of data.

Highlight degree of missingness

Some streamflow stations have none or only a few days of missing data. Others may have data only intermittently. This information is provided by GSIM in the column frac.missing.days (fraction of missing days). Let’s try to incorporate this information into our plot. We have used the x-axis for length and the y-axis for count, so missingness needs a third dimension: colours.

The fraction of missing days vary from station to station, even for those having the same lengths and in the same continent. Therefore, we can’t just highlight a column of the histogram with the same colour. So here’s the idea. Each station occupies one pixel in the plot: its x-ordinate is its length, its y-ordinate is its position among other stations in the same continent and having the same length, and its colour is the degree of missingness.

# Assign each station an index among all those having the same length
# in the same continent. This is then the y-ordinate.
gm[, num := 1:.N, by = .(continent, year.no)]
ggplot(gm) +
  geom_tile(aes(year.no, num, fill = frac.missing.days)) +
  facet_wrap(vars(continent)) +
  scale_fill_viridis_c() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(x = 'Data length [years]', y = 'Count', fill = 'Fraction of\nmissing days')

We are almost there. The problem with this plot is that the colours are quite scattered. We might improve it by arranging the colours from light to dark (higher to lower degrees of missingness) to enhance the visual expression.

Another issue is that the shaded areas of Africa and Asia are too small, because there are two few stations there. So we could give each continent its own axis scales to “zoom in.”

Let’s try these two modifications now.

# First we rearrange the stations by degree of missingness within each group
gm <- gm[, .SD[order(frac.missing.days, decreasing = TRUE)], 
         by = .(continent, year.no)]
# Now we assign the index after sorting
gm[, num := 1:.N, by = .(continent, year.no)]
# And plot
ggplot(gm) +
  # We have to use geom_tile() to plot the pixels instead of geom_histogram as before
  geom_tile(aes(year.no, num, fill = frac.missing.days)) +
  facet_wrap(vars(continent), scales = 'free') +
  # Use the Viridis palette
  scale_fill_viridis_c() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(
    x = 'Data length [years]', 
    y = 'Count', 
    fill = 'Fraction of\nmissing days')

I believe that this looks better. We lose the direct size comparisons across panels, but that can still be inferred from the axes. Trading that off, we now have a much smoother visual display of the missingness.

Now just a minor final touch. With custom y-axis scales, the tallest column in each panel is touching the top of the plot, and it looks a bit crampy. We still want the columns to start from zero, and we only need to expand it a bit on top. Thankfully, ggplot2 recently gained this capability with the function expansion().

ggplot(gm) +
  geom_tile(aes(year.no, num, fill = frac.missing.days)) +
  facet_wrap(vars(continent), scales = 'free') +
  scale_fill_viridis_c() +
  # We expand by zero at the bottom and by 10 at the top
  scale_y_continuous(expand = expansion(add = c(0, 10))) +
  labs(
    x = 'Data length [years]', 
    y = 'Count', 
    fill = 'Fraction of\nmissing days')

Summary

There you have it. We plotted a highlighted histogram by not using geom_histogram(), but by using geom_tile() to plot each pixel separately. We also arranged the pixel to have a better visual feel. I hope this has been useful. Have fun with your next visualization.

References

Do, Hong Xuan, Lukas Gudmundsson, Michael Leonard, and Seth Westra. 2018. “The Global Streamflow Indices and Metadata Archive (GSIM) Part 1: The Production of a Daily Streamflow Archive and Metadata.” Earth System Science Data 10 (2): 765–85. https://doi.org/10.5194/essd-10-765-2018.