Join Copernicus Climate Data Store Data with Socio-Economic and Opinion Poll Data

Part I: Get the data and aggregate it over Europe’s NUTS statistical regions

In this series of blogposts we will show how to collect environmental data from the EU’s Copernicus Climate Data Store, and bring it to a data format that you can join with Eurostat’s socio-economic and environmental data. We have shown in a previous blogpost how to connect this to survey (opinion poll) and tax data, and a real policy problem in Belgium. We will create now subsequent tutorials to do more!

But first, why are we doing this? The European Union and its members states are releasing every year more and more data for open re-use since 2003, yet these are often not used in the EU’s data dissemination projects (the observatories) or in EU-funded research. We believe that there are many reasons behind this. Whilst more and more people can conduct business, scientific or policy analysis programmatically or with statistical software, knowledge how to systematically collect the data from the exponentially growing availability is not everybody’s specialty. And the lack of documentation, and high re-processing and validation need for open data is another drawback.

rOpenGov has long been producing high-quality, peer-reviewed R packages to work with open data, but their use is not for all. In an open collaboration, where you can join, too, rOpenGov teamed up with open source developers, knowledgeable data curators, and a service developer team lead by the Dutch reproducible research start-up Reprex to create a sustainable infrastructure that is permanently collecting, processing, documenting and visualizing open data. What we do is that we access open data (that is not always available for direct download) and re-process it to usable data that is tidy to be integrated with your existing data or databases. We are competing for the EU Datathon Challenge 1: supporting a European Green Deal agenda with open data as a service, and research as a servcie, and you are more than welcome to join our effort as a developer, a data curator, or as an occasional contributor to open government packages.

Register to the Copernicus Climate Data Store

Koen Hufkens, Reto Stauffer and Elio Campitelli created the ecmwfr R package for programmatically accessing the Copernicus Data Store service. Follow the CDS Functionality vignette to get started.

You will need to create a Register yourself for CDS services after accepting the Terms and conditions.

wf_set_key(user = "12345", 
           key = "00000000-aaaa-b1b1-0000-a1a1a1a1a1a1", 
           service = "cds")

You can check if you were successful with:

ecmwfr::wf_get_key(user = "12345", service = "cds")

Get the Data

Let us formulate our first request:

request_lai_hv_2019_06 <- list(
  "dataset_short_name" = "reanalysis-era5-land-monthly-means",
  "product_type"   = "monthly_averaged_reanalysis",
  "variable"       = "leaf_area_index_high_vegetation",
  "year"           = "2019",
  "month"          =  "06",
  "time"           = "00:00",
  "area"           = "70/-20/30/60",
  "format"         = "netcdf",
  "target"         = "demo_file.nc")

lai_hv_2019_06.nc  <- wf_request(user = "<your_ID>",
                     request = request_lai_hv_2019_06 ,
                     transfer = TRUE,
                     path = "data-raw",
                     verbose = FALSE)

Effective Leaf Area Index

You can find this data either in global computer raster images, or in re-processed monthly averages. Working with the raw data is not very practical – in case of cloudy weather you have missing data, and the files are extremely huge for a personal computer. For the purposes of our Green Deal Data Observatory the monthly average values are far more practical, which are called monthly_averaged_reanalysis product types.

For compatibility with other R packages, convert the data with the from raster package from rSpatial.org.

lai_file <- here::here( "data-raw", "demo_file.nc")
lai_raster <- raster::raster(lai_file)

## Loading required namespace: ncdf4

Let us convert this to a SpatialDataPointsDataFrame class, which is an augmented data frame class with coordinates.

LAI_df <- raster::rasterToPoints(lai_raster, fun=NULL, spatial=TRUE)

Get The Map

With the help fo rOpenGov, we are creating various R packages to programmatically access open data and put them into the right format. The popular eurostat package is not only useful to download data from Eurostat, but also to map it.

In this case, we want to create regional maps. Europe has five levels of geographical regions: NUTS0 for countries, NUTS1 for larger areas like states, provinces; NUTS2 for smaller areas like countries, NUTS3 for even smaller areas. The LAU level contains settlemens and their surrounding areas.

Country borders change sometimes (think about the unification of Germany, or the breakup of Czechoslovakia and Yugoslavia), but they are relatively stable entities. Sub-national regional border change very-very frequently – since 2000 there were many thousand changes in Europe. This means that you must choose one regional boundary definition. The latest edition is NUTS2021 but most of the data available is still in the NUTS2016 format, and often you will find NUTS2013 or even NUTS2010 data around. Our Green Deal Data Observatory uses the NUTS2016 definition, because it is far the most used in 2021. An offspring of the eurostat package, regions helps you take care of NUTS changes when you work, and can convert your data to NUTS2021 if you later need it.

## sf at resolution 1:60 read from local file

## Warning in eurostat::get_eurostat_geospatial(resolution = "60", nuts_level =
## "2", : Default of 'make_valid' for 'output_class="sf"' will be changed in the
## future (see function details).

plot(map_nuts_2)

Our measurement of the average Effective Leaf Area Index is a raster data, it is given for many points of Europe’s map. What we need to do is to overlay this raster information of the statistical map of Europe. We use the excellent sp: R Classes and Methods for Spatial Data package for this purpose. The sp::over() function decides if a point of Leaf Area Index measurement falls into the polygon (shape) of a particular NUTS2 regions, for example, Zuid-Holland or South Holland in the Netherlands, or Saarland in Germany, or not. Then it averages with the mean() function those measurements falling in the area.

LAI_nuts_2 = sp::over(sp::geometry(
  as(map_nuts_2, 'Spatial')), 
  LAI_df,
  fn=mean)

Let’s call the average LAI index lai, and bind it to the Eurostat map:

names(LAI_nuts_2)[1] <- "lai"
LAI_sfdf <- bind_cols ( map_nuts_2, LAI_nuts_2 )

If you want to work with the data in a numeric context, you do not need the geographical information, and you can “downgrade” the SpatialDataPointsDataFrame to a simple data frame.

set.seed(2019) #to always see the same sample
LAI_sfdf %>%
  as.data.frame() %>%
  select ( all_of(c("NUTS_NAME", "NUTS_ID", "lai")) ) %>%
  sample_n(10)

##                      NUTS_NAME NUTS_ID lai
## 281                       Vest    RO42  NA
## 125                     Kassel    DE73  NA
## 69              Friesland (NL)    NL12  NA
## 237 Agri, Kars, Igdir, Ardahan    TRA2  NA
## 273                East Anglia    UKH1  NA
## 119                Prov. Liège    BE33  NA
## 61                   Bourgogne    FRC1  NA
## 275                      Essex    UKH3  NA
## 282                   Istanbul    TR10  NA
## 174                    Leipzig    DED5  NA

We’ll plot the map with ggplot2.

library(ggplot2)
library(sf)
ggplot(data=LAI_sfdf) + 
  geom_sf(aes(fill=lai),
          color="dim grey", size=.1) + 
  scale_fill_gradient( low ="#FAE000", high = "#00843A") +
  guides(fill = guide_legend(reverse=T, title = "LAI")) +
  labs(title="Leaf Area Index",
       subtitle = "High vegetation half, NUTS2 regional avareage values",
       caption="\ua9 EuroGeographics for the administrative boundaries 
                \ua9 Copernicus Data Service, June 2019 average values
                Tutorial and ready-to-use data on greendeal.dataobservatory.eu") +
  theme_light() + theme(legend.position=c(.88,.78)) +
  coord_sf(xlim=c(-22,48), ylim=c(34,70))

Data Integrity

Our Green Deal Data Observatory has a data API where we place the new data with metadata for programmatic download in CSV, JSON or even with SQL queries. For data integrity purposes, we are placing an authoritative copy on Zenodo (Green Deal Data Observatory Community). You can use this for scientific citations. We are also happy if you place your own climate policy related research data here, so that we can include it in our observatory. In our subsequent tutorials, we will show how to do this programmatically in R. This particular dataset (not only with the month June, which we selected to streamline the tutorial) is available here with the digital object identifier doi.org/10.5281/zenodo.4903940.

Join us

Join our open collaboration Green Deal Data Observatory team as a data curator, developer or business developer. More interested in antitrust, innovation policy or economic impact analysis? Try our Economy Data Observatory team! Or your interest lies more in data governance, trustworthy AI and other digital market problems? Check out our Digital Music Observatory team!

Daniel Antal
Daniel Antal
Developer of open-source statistical software

My research interests include reproducible social science, economics and finance.

Related