Daniel Antal | Automated Data Observatories

How We Add Value to Public Data With Imputation and Forecasting

Mon, 08 Nov 2021 10:00:00 +0100

Public data sources are often plagued by missng values. Naively you may think that you can ignore them, but think twice: in most cases, missing data in a table is not missing information, but rather malformatted information. This approach of ignoring or dropping missing values will not be feasible or robust when you want to make a beautiful visualization, or use data in a business forecasting model, a machine learning (AI) applicaton, or a more complex scientific model. All of the above require complete datasets, and naively discarding missing data points amounts to an excessive waste of information. In this example we are continuing the example a not-so-easy to find public dataset.

In the previous blogpost we explained how we added value by documenting data following the FAIR principle and with the professional curatorial work of placing the data in context, and linking it to other information sources, such as other datasets, books, and publications, regardless of their natural language (i.e., whether these sources are described in English, German, Portugese or Croatian). Photo: Jack Sloop.

Completing missing datapoints requires statistical production information (why might the data be missing?) and data science knowhow (how to impute the missing value.) If you do not have a good statistician or data scientist in your team, you will need high-quality, complete datasets. This is what our automated data observatories provide.

Why is data missing?

International organizations offer many statistical products, but usually they are on an ‘as-is’ basis. For example, Eurostat is the world’s premiere statistical agency, but it has no right to overrule whatever data the member states of the European Union, and some other cooperating European countries give to them. And they cannot force these countries to hand over data if they fail to do so. As a result, there will be many data points that are missing, and often data points that have wrong (obsolete) descriptions or geographical dimensions. We will show the geographical aspect of the problem in a separate blogpost; for now, we only focus on missing data.

Some countries have only recently started providing data to the Eurostat umbrella organization, and it is likely that you will find few datapoints for North Macedonia or Bosnia-Herzegovina. Other countries provide data with some delay, and the last one or two years are missing. And there are gaps in some countries’ data, too.

See the authoritative copy of the dataset.

This is a headache if you want to use the data in some machine learning application or in a multiple or panel regression model. You can, of course, discard countries or years where you do not have full data coverage, but this approach usually wastes too much information–if you work with 12 years, and only one data point is available, you would be discarding an entire country’s 11-years’ worth of data. Another option is to estimate the values, or otherwise impute the missing data, when this is possible with reasonable precision. This is where things get tricky, and you will likely need a statistician or a data scientist onboard.

What can we improve?

Consider that the data is only missing from one year for a particular country, 2015. The naive solution would be to omit 2015 or the country at hand from the dataset. This is pretty destructive, because we know a lot about the radio market turnover in this country and in this year! But leaving 2015 blank will not look good on a chart, and will make your machine learning application or your regression model stop.

A statistician or a radio market expert will tell you that you know more-or-less the missing information: the total turnover was certainly not zero in that year. With some statistical or radio domain-specific knowledge you will use the 2014, or 2016 value, or a combination of the two and keep the country and year in the dataset.

Our improved dataset added backcasted (using the best time series model fitting the country’s actually present data), forecasted (again, using the best time series model), and approximated data (using linear approximation.) In a few cases, we add the last or next known value. To give a few quantiative indicators about our work:

Increased number of observations: 65%
Reduced missing values: -48.1%
Increased non-missing subset for regression or AI: +66.67%

If your organization is working with panel (longitudional multiple) regressions or various machine learning applications, then your team knows that not havint the +66.67% gain would be a deal-breaker in the choice of models and punctuality of estimates or KPIs or other quantiative products. And that they would spent about 90% of their data resources on achieving this +66.67% gain in usability.

If you happen to work in an NGO, a business unit or a research institute that does not employ data scientists, then it is likely that you can never achieve this improvement, and you have to give up on a number of quantitative tools or visualizations. If you have a data scientist onboard, that professional can use our work as a starting point.

Can you trust our data?

We believe that you can trust our data better than the original public source. We use statistical expertise to find out why data may be missing. Often, it is present in a wrong location (for example, the name of a region changed.)

If you are reluctant to use estimates, think about discarding known actual data from your forecast or visualization, because one data point is missing. How do you provide more accurate information? By hiding known actual data, because one point is missing, or by using all known data and an estimate?

Our codebooks and our API uses the Statistical Data and Metadata eXchange documentation standards to clearly indicate which data is observed, which is missing, which is estimated, and of course, also how it is estimated. This example highlights another important aspect of data trustworthiness. If you have a better idea, you can replace them with a better estimate.

Our indicators come with standardized codebooks that do not only contain the descriptive metadata, but administrative metadata about the history of the indicator values. You will find very important information about the statistical method we used the fill in the data gaps, and even link the reliable, the peer-reviewed scientific, statistical software that made the calculations. For data scientists, we record the plenty of information about the computing environment, too-–this can come handy if your estimates need external authentication, or you suspect a bug.

Avoid the data Sisyphus

If you work in an academic institution, in an NGO or a consultancy, you can never be sure who downloaded the Annual detailed enterprise statistics for services (NACE Rev. 2 H-N and S95) Eurostat folder from Eurostat. Did they modify the dataset? Did they already make corrections with the missing data? What method did they use? To prevent many potential problems, you will likely download it again, and again, and again…

See our The Data Sisyphus blogpost.

We have a better solution. You can always rely on our API to import directly the latest, best data, but if you want to be sure, you can use our regular backups on Zenodo. Zenodo is an open science repository managed by CERN and supported by the European Union. On Zenodo, you can find an authoritative copy of our indicator (and its previous versions) with a digital object identifier, in this case, 10.5281/zenodo.5652118. These datasets will be preserved for decades, and nobody can manipulate them. You cannot accidentally overwrite them, and we have no backdoor access to modify them.

Are you a data user? Give us some feedback! Shall we do some further automatic data enhancements with our datasets? Document with different metadata? Link more information for business, policy, or academic use? Please give us any feedback!

How We Add Value to Public Data With Better Curation And Documentation?

Mon, 08 Nov 2021 09:00:00 +0100

In this example, we show a simple indicator: the Turnover in Radio Broadcasting Enterprises in many European countries. This is an important demand driver in the Music economy pillar of our Digital Music Observatory, and important indicator in our more general Cultural & Creative Sectors and Industries Observatory. We show a very similar example in our Green Deal Data Observatory with environmental R&D public spending in Europe.

This dataset comes from a public datasource, the data warehouse of the European statistical agency, Eurostat. Yet it is not trivial to use: unless you are familiar with national accounts, you will not find this dataset on the Eurostat website.

The data can be retrieved from the Annual detailed enterprise statistics for services NACE Rev.2 H-N and S95 Eurostat folder.

Our version of this statistical indicator is documented following the FAIR principles: our data assets are findable, accessible, interoperable, and reusable. While the Eurostat data warehouse partly fulfills these important data quality expectations, we can improve them significantly. And we can also improve the dataset, too, as we will show in the next blogpost.

Findable Data

Our data observatories add value by curating the data–we bring this indicator to light with a more descriptive name, and we place it in context with our Digital Music Observatory and Cultural & Creative Sectors and Industries Observatory. While many people may need this dataset in the creative sectors, or among cultural policy designers, most of them have no training in working with national accounts, which imply decyphering national account data codes in records that measure economic activity at a national level. Our curated data observatories bring together many available data around important domains. Our Digital Music Observatory, for example, aims to form an ecosystem of music data users and producers.

We added descriptive metadata that help you find our data and match it with other relevant data sources.

We added descriptive metadata that help you find our data and match it with other relevant data sources. For example, we add keywords and standardized metadata identifiers from the Library of Congress Linked Data Services, probably the world’s largest standardized knowledge library description. This ensures that you can find relevant data around the same key term (radio broadcasting) in addition to our turnover data. This allows connecting our dataset unambiguosly with other information sources that use the same concept, but may be listed under different keywords, such as Radio–Broadcasting, or Radio industry and trade, or maybe Hörfunkveranstalter in German, or Emitiranje radijskog programa in Croatian or Actividades de radiodifusão in Portugese.

Accessible Data

Our data is accessible in two forms: in csv tabular format (which can be read with Excel, OpenOffice, Numbers, SPSS and many similar spreadsheet or statistical applications) and in JSON for automated importing into your databases. We can also provide our users with SQLite databases, which are fully functional, single user relational databases.

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This makes the data easier to clean, and far more easier to use in a much wider range of applications than the original data we used. In theory, this is a simple objective, yet we find that even governmental statistical agencies–and even scientific publications–often publish untidy data. This poses a significant problem that implies productivity loses: tidying data will require long hours of investment, and if a reproducible workflow is not used, data integrity can also be compromised: chances are that the process of tidying will overwrite, delete, or omit a data or a label.

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.

While the original data source, the Eurostat data warehouse is accessible, too, we added value with bringing the data into a tidy format. Tidy data can immediately be imported into a statistical application like SPSS or STATA, or into your own database. It is immediately available for plotting in Excel, OpenOffice or Numbers.

Interoperability

Our data can be easily imported with, or joined with data from other internal or external sources.

All our indicators come with standardized descriptive metadata, and statistical (processing) metadata. See our API

All our indicators come with standardized descriptive metadata, following two important standards, the Dublin Core and DataCite–implementing not only the mandatory, but the recommended descriptions, too. This will make it far easier to connect the data with other data sources, e.g. turnover with the number of radio broadcasting enterprises or radio stations within specific territories.

Our passion for documentation standards and best practices goes much further: our data uses Statistical Data and Metadata eXchange standardized codebooks, unit descriptions and other statistical and administrative metadata.

Reuse

All our datasets come with standardized information about reusabililty. We add citation, attribution data, and licensing terms. Most of our datasets can be used without commercial restriction after acknowledging the source, but we sometimes work with less permissible data licenses.

In the case presented here, we added further value to encourage re-use. In addition to tidying, we significantly increased the usability of public data by handling missing cases. This is the subject of our next blogpost.

Digital Music Observatory on MaMA 2021

Thu, 14 Oct 2021 12:15:00 +0000

Use Cases

Public advocacy reports, scientific uses

Use Cases 2

Business Confidential Reports with Digital Music Observatory

Damage claims in private copying
Royalty setting for restaurants, hotels, broadcasting
Music streaming market indicators
Evidence for competition law / regulatory affairs

Questions?

LinkedIn: Daniel Antal - Digital Music Observatory

Open Data - The New Gold Without the Rush

Thu, 14 Oct 2021 12:15:00 +0000

Use Cases

Public advocacy reports, scientific uses

Use Cases 2

Business Confidential Reports with Digital Music Observatory

Damage claims in private copying
Royalty setting for restaurants, hotels, broadcasting
Music streaming market indicators
Evidence for competition law / regulatory affairs

Questions?

LinkedIn: Daniel Antal - Digital Music Observatory

CCSI Data Observatory

Wed, 06 Oct 2021 16:00:00 +0200

The creative and cultural sectors and industries are mainly made of networks of freelancers and microenterprises, with very few medium-sized companies. Their economic performance, problems, and innovation capacities hidden. Our open collaboration to create this data observatory is committed to change this. Relying on modern data science, the re-use of open governmental data, open science data, and novel harmonized data collection we aim to fill in the gaps left in the official statistics of the European Union.

We believe that introducing Open Policy Analysis standards with open data, open-source software and research automation can help better understanding how creative people and their enterprises and institutions add value to the European economy, how they create jobs, innovate, and increase the well-being of a diverse European society. Our collaboration is open for individuals, citizens scientists.

The new observatory can be reached on ccsi.dataobservatory.eu and will be institutionally hosted by IViR, the Institute for Information Law of the University of Amsterdam, where Reprex’s co-founder, Daniel Antal will coordinate the development of this new, open scientific tool. Reprex will continue to develop the working model of the data observatory and continue to build open source software tools within the rOpenGov community and the R-Universe initative of ROpenSci.

The Scuola Superiore di Studi Universitari e di Perfezionamento Sant’Anna and Università degli Studi di Trento (Italy); University of Glasgow (United Kingdom); Universiteit van Amsterdam and Stichting Europeana from the Netherlands; the National University of Ireland Maynooth (Ireland); Tartu Ulikool (Estonia); Szegedi Tudományegyetem (Hungary); Fundacion Santa Maria La Real del Patrimonio Historico from Spain; the Katholieke Universiteit Leuven, (Belgium); Culture Action Europe AISBL and IDEA Strategische Economische Consulting (Belgium) and Reprex created the the RECREO consortium, which will mainly develop new policy evidence in the field of innovation and inclusiveness for the creative and cultural sectors, industries. The Consortium applies for a Horizon Europe grant with the HORIZON-CL2-2021-HERITAGE-01-03 Cultural and creative industries as a driver of innovation and competitiveness call of the European Commission.

The Data Sisyphus

Thu, 08 Jul 2021 09:00:00 +0200

Sisyphus was punished by being forced to roll an immense boulder up a hill only for it to roll down every time it neared the top, repeating this action for eternity. This is the price that project managers and analysts pay for the inadequate documentation of their data assets.

When was a file downloaded from the internet? What happened with it sense? Are their updates? Did the bibliographical reference was made for quotations? Missing values imputed? Currency translated? Who knows about it – who created a dataset, who contributed to it? Which is an intermediate format of a spreadsheet file, and which is the final, checked, approved by a senior manager?

Big data creates inequality and injustice. On aspect of this inequality is the cost of data processing and documentation – a greatly underestimated, and usually not reported cost item. In small organizations, where there are no separate data science and data engineering roles, data is usually supposed to be processed and documented by (junior) analysts or researchers. This a very important source of the gap between Big Tech and them: the data usually ends up very expensive, ill-formatted, not readable by computers that use machine learning and AI. Usually the documentation steps are completely omitted.

“Data is potential information, analogous to potential energy: work is required to release it.” – Jeffrey Pomerantz

Metadata, which is information about the history of the data, and information how it can be technically and legally reused, has a hidden cost. Cheap or low-quality external data comes with poor or no metadata, and small organizations lack the resources to add high-quality metadata to their datasets. However, this only perpetuates the problem.

The hidden cost item behind the unbillable hours

As we have shown with our research partners, such metadata problems are not unique to data analysis. Independent artists and small labels are suffering on music or book sales platforms, because their copyrighted content is not well documented. If you automatically document tens of thousands of songs or datasets, the documentation cost is very small per item. If you, do it manually, the cost may be higher than the expected revenue from the song, or the total cost of the dataset itself. (See our research consortiums' preprint paper: Ensuring the Visibility and Accessibility of European Creative Content on the World Market: The Need for Copyright Data Improvement in the Light of New Technologies)

In the short run, small consultancies, NGOs, or as a matter of fact, musicians, seem to logically give up on high-quality documentation and logging. In the long run, this has two devastating consequences: computers, such as machine learning algorithms cannot read their documents, data, songs. And as memory fades, the ill-documented resources need to be re-created, re-checked, reformatted. Often, they are even hard to find on your internal server or laptop archive.

Metadata is a hidden destroyer of the competitiveness of corporate or academic research, or independent content management. It never quoted on external data vendor invoices, it is not planned as a cost item, because metadata, the description of a dataset, a document, a presentation, or song, is meaningless without the resource that it describes. You never buy metadata. But if your dataset comes without proper metadata documentation, you are bound, like Sisyphus, to search for it, to re-arrange it, to check its currency units, its digits, its formatting. Data analysts are reported to spend about 80% of their working hours on data processing and not data analysis – partly, because data processing is a very laborious task that can be done by computers at a scale far cheaper, and partly because they do not know if the person who sat before them at the same desk has already performed these tasks, or if the person responsible for quality control checked for errors.

Uncut diamonds need to be cut, polished, and you have to make sure that they come from a legal source. Data is similar: it needs to be tidied up, checked and documented before use. Photo: Dave Fischer.

Undocumented data is hardly informative – it may be a page in a book, a file in an obsolete file format on a governmental server, an Excel sheet that you do not remember to have checked for updates. Most data are useless, because we do not know how it can inform us, or we do not know if we can trust it. The processing can be a daunting task, not to mention the most boring and often neglected documentation duties after the dataset is final and pronounced error-free by the person in charge of quality control.

Our observatory automatically processes and documents the data

The good news about documentation and data validation costs is that they can be shared. If many users need GDP/capita data from all over the world in euros, then it is enough if only one entity, a data observatory, collects all GDP and population data expresed in dollars, korunas, and euros, and makes sure that the latest data is correctly translated to euros, and then correctly divided by the latest population figures. These task are error-prone,and should not be repeaeted by every data journalist, NGO employee, PhD student or junior analyst. This is one of the services of our data observatory.

The tidy data format means that the data has a uniform and clear data structure and semantics, therefore it can be automatically validated for many common errors and can be automatically documented by either our software or any other professional data science application. It is not as strict as the schema for a relational database, but it is strict enough to make, among other things, importing into a database easy.
The descriptive metadata contains information on how to find the data, access the data, join it with other data (interoperability) and use it, and reuse it, even years from now. Among others, it contains file format information and intellectual property rights information.
The processing metadata makes the data usable in strictly regulated professional environments, such as in public administration, law firms, investment consultancies, or in scientific research. We give you the entire processing history of the data, which makes peer-review or external audit much easier and cheaper.
The authoritative copy is held at an independent repository, it has a globally unique identifier that protects you from accidental data loss, mixing up with unfinished an untested version.

Cutting the dataset to a format with clear semantics and documenting it with the FAIR metadata concep exponentially increases the value of data. It can be publisehd or sold at a premium. Photo: Andere Andre.

While humans are much better at analysing the information and human agency is required for trustworthy AI, computers are much better at processing and documenting data. We apply to important concepts to our data service: we always process the data to the tidy format, we create an authoritative copy, and we always automatically add descriptive and processing metadata.

The value of metadata

Metadata is often more valuable and more costly to make than the data itself, yet it remains an elusive concept for senior or financial management. Metadata is information about how to correctly use the data and has no value without the data itself. Data acquisition, such as buying from a data vendor, or paying an opinion polling company, or external data consultants appears among the material costs, but metadata is never sold alone, and you do not see its cost.

In most cases, the reason why there is no gold rush for open data is that fact that while the EU member states release billions of euros' worth data for free, or at very low cost, annually, it comes without proper metadata.

Data-as-Service

Reusable, legal, easy-to-import, interoperable, always fresh data in tidy formats with a modern API. Photo: Edgar Soto.

If the data source is cheap or has a low quality, you do not even get it. If you do not have it, it will show up as a human resource cost in research (when your analysist or junior researcher are spending countless hours to find out the missing metadata information on the correct use of the data) or in sales costs (when you try to reuse a research, consulting or legal product and you have comb through your archive and retest elements again and again.)

The data, together with the descriptive and administrative metadata, and links to the use license and the authoritative copy can be found in our API. Try it out!

Including Indicators from Arab Barometer in Our Observatory

Mon, 28 Jun 2021 09:00:00 +0200

A new version of the retroharmonize R package – which is working with retrospective, ex post harmonization of survey data – was released yesterday after peer-review on CRAN. It allows us to compare opinion polling data from the Arab Barometer with the Eurobarometer and Afrorbarometer. This is the first version that is released in the rOpenGov community, a community of R package developers on open government data analytics and related topics.

Surveys are the most important data sources in social and economic statistics – they ask people about their lives, their attitudes and self-reported actions, or record data from companies and NGOs. Survey harmonization makes survey data comparable across time and countries. It is very important, because often we do not know without comparison if an indicator value is low or high. If 40% of the people think that climate change is a very serious problem, it does not really tell us much without knowing what percentage of the people answered this question similarly a year ago, or in other parts of the world.

With the help of Ahmed Shabani and Yousef Ibrahim, we created a third case study after the Eurobarometer, and Afrobarometer, about working with the Arab Barometer harmonized survey data files.

Ex ante survey harmonization means that researchers design questionnaires that are asking the same questions with the same survey methodology in repeated, distinct times (waves), or across different countries with carefully harmonized question translations. Ex post harmonizations means that the resulting data has the same variable names, same variable coding, and can be joined into a tidy data frame for joint statistical analysis. While seemingly a simple task, it involves plenty of metadata adjustments, because established survey programs like Eurobarometer, Afrobarometer or Arab Barometer have several decades of history, and several decades of coding practices and file formatting legacy.

Variable harmonization means that if the same question is called in one microdata source Q108 and the other eval-parl-elections then we make sure that they get a harmonize and machine readable name without spaces and special characters.
Variable label harmonization means that the same questionnaire items get the same numeric coding and same categorical labels.
Missing case harmonization means that various forms of missingness are treated the same way.

For the climate awareness dataset get the country averages and aggregates from Zenodo, and the plot in jpg or png from figshare.

In our new Arab Barometer case study, the evaulation of parliamentary elections has the following labels. We code them consistently 1 = free_and_fair, 2 = some_minor_problems, 3 = some_major_problems and 4 = not_free.

“0. missing”	“1. they were completely free and fair”
“2. they were free and fair, with some minor problems”	“3. they were free and fair, with some major problems”
“4. they were not free and fair”	“8. i don’t know”
“9. declined to answer”	“Missing”
“They were completely free and fair”	“They were free and fair, with some minor breaches”
“They were free and fair, with some major breaches”	“They were not free and fair”
“Don’t know”	“Refuse”
“Completely free and fair”	“Free and fair, but with minor problems”
“Free and fair, with major problems”	“Not free or fair”
“Don’t know (Do not read)”	“Decline to answer (Do not read)”

Of course, this harmonization is essential to get clean results like this:

For evaluation or reuse of parliamentary elections dataset get the replication data and the code from the Zenodo open repository.

In our case study, we had three forms of missingness: the respondent did not know the answer, the respondent did not want to answer, and at last, in some cases the respondent was not asked, because the country held no parliamentary elections. While in numerical processing, all these answers must be left out from calculating averages, for example, in a more detailed, categorical analysis they represent very different cases. A high level of refusal to answer may be an indicator of surpressing democratic opinion forming in itself.

Survey harmonization with many countries entails tens of thousands of small data management task, which, unless automatically documented, logged, and created with a reproducible code, is a helplessly error-prone process. We believe that our open-source software will bring many new statistical information to the light, which, while legally open, was never processed due to the large investment needed.

We also started building experimental APIs data is running retroharmonize regularly. We will place cultural access and participation data in the Digital Music Observatory, climate awareness, policy support and self-reported mitigation strategies into the Green Deal Data Observatory, and economy and well-being data into our Economy Data Observatory.

Further plans

Retrospective survey harmonization is a far more complex task than this blogpost suggest. Retrospective survey harmonization is a far more complex task than this blogpost suggest, because established survey programs have gathered decades of legacy data in legacy coding schemes and legacy file formats. Putting the data right, and especially putting the invaluable descriptive and administrative (processing) metadata right is a huge undertaking. We are releasing example codes, datasets and charts for researchers to comapre our harmonized results with theirs, and improve our software. We are releasing example codes, datasets and charts for researchers to comapre our harmonized results with theirs, and improve our software.

Use our software

The retroharmonize R package can be freely used, modified and distributed under the GPL-3 license. For the main developer and contributors, see the package homepage. If you use it for your work, please kindly cite it as:

Daniel Antal (2021). retroharmonize: Ex Post Survey Data Harmonization. R package version 0.1.17. https://doi.org/10.5281/zenodo.5034752

Download the BibLaTeX entry.

Tutorial to work with the Arab Barometer survey data

Daniel Antal, & Ahmed Shaibani. (2021, June 26). Case Study: Working With Arab Barometer Surveys for the retroharmonize R package (Version 0.1.6). Zenodo. https://doi.org/10.5281/zenodo.5034759

For the replication data to report potential issues and improvement suggestions with the code:

Daniel Antal, & Ahmed Shaibani. (2021). Replication Data for the retroharmonize R Package Case Study: Working With Arab Barometer Surveys (Version 0.1.6) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5034741

Experimental API

We are also experimenting with the automated placement of authoritative and citeable figures and datasets in open repositories. For the climate awareness dataset get the country averages and aggregates from Zenodo, and the plot in jpg or png from figshare. Our plan is to release open data in a modern API with rich descriptive metadata meeting the Dublin Core and DataCite standards, and further administrative metadata for correct coding, joining and further manipulating or data, or for easy import into your database.

Join our open source effort

Want to help us improve our open data service? Include Lationbarómetro and the Caucasus Barometer in our offering? Join the rOpenGov community of R package developers, an our open collaboration to create the automated data observatories. We are not only looking for developers, but data curators and service design associates, too.

Open Data - The New Gold Without the Rush

Fri, 18 Jun 2021 17:00:00 +0200

If open data is the new gold, why even those who release fail to reuse it? We created an open collaboration of data curators and open-source developers to dig into novel open data sources and/or increase the usability of existing ones. We transform reproducible research software into research- as-service.

Every year, the EU announces that billions and billions of data are now “open” again, but this is not gold. At least not in the form of nicely minted gold coins, but in gold dust and nuggets found in the muddy banks of chilly rivers. There is no rush for it, because panning out its value requires a lot of hours of hard work. Our goal is to automate this work to make open data usable at scale, even in trustworthy AI solutions.

There is no rush for it, because panning out its value requires a lot of hours of hard work. Our goal is to automate this work to make open data usable at scale, even in trustworthy AI solutions.

Most open data is not public, it is not downloadable from the Internet – in the EU parlance, “open” only means a legal entitlement to get access to it. And even in the rare cases when data is open and public, often it is mired by data quality issues. We are working on the prototypes of a data-as-service and research-as-service built with open-source statistical software that taps into various and often neglected open data sources.

We are in the prototype phase in June and our intentions are to have a well-functioning service by the time of the conference, because we are working only with open-source software elements; our technological readiness level is already very high. The novelty of our process is that we are trying to further develop and integrate a few open-source technology items into technologically and financially sustainable data-as-service and even research-as-service solutions.

Our review of about 80 EU, UN and OECD data observatories reveals that most of them do not use these organizations’s open data - instead they use various, and often not well processed proprietary sources.

We are taking a new and modern approach to the data observatory concept, and modernizing it with the application of 21st century data and metadata standards, the new results of reproducible research and data science. Various UN and OECD bodies, and particularly the European Union support or maintain more than 60 data observatories, or permanent data collection and dissemination points, but even these do not use these organizations and their members open data. We are building open-source data observatories, which run open-source statistical software that automatically processes and documents reusable public sector data (from public transport, meteorology, tax offices, taxpayer funded satellite systems, etc.) and reusable scientific data (from EU taxpayer funded research) into new, high quality statistical indicators.

We are taking a new and modern approach to the ‘data observatory’ concept, and modernizing it with the application of 21st century data and metadata standards, the new results of reproducible research and data science

We are building various open-source data collection tools in R and Python to bring up data from big data APIs and legally open, but not public, and not well served data sources. For example, we are working on capturing representative data from the Spotify API or creating harmonized datasets from the Eurobarometer and Afrobarometer survey programs.
Open data is usually not public; whatever is legally accessible is usually not ready to use for commercial or scientific purposes. In Europe, almost all taxpayer funded data is legally open for reuse, but it is usually stored in heterogeneous formats, processed into an original government or scientific need, and with various and low documentation standards. Our expert data curators are looking for new data sources that should be (re-) processed and re-documented to be usable for a wider community. We would like to introduce our service flow, which touches upon many important aspects of data scientist, data engineer and data curatorial work.
We believe that even such generally trusted data sources as Eurostat often need to be reprocessed, because various legal and political constraints do not allow the common European statistical services to provide optimal quality data – for example, on the regional and city levels.
With rOpenGov and other partners, we are creating open-source statistical software in R to re-process these heterogenous and low-quality data into tidy statistical indicators to automatically validate and document it.
We are carefully documenting and releasing administrative, processing, and descriptive metadata, following international metadata standards, to make our data easy to find and easy to use for data analysts.
We are automatically creating depositions and authoritative copies marked with an individual digital object identifier (DOI) to maintain data integrity.
We are building simple databases and supporting APIs that release the data without restrictions, in a tidy format that is easy to join with other data, or easy to join into databases, together with standardized metadata.
We maintain observatory websites (see: Digital Music Observatory, Green Deal Data Observatory, Economy Data Observatory) where not only the data is available, but we provide tutorials and use cases to make it easier to use them. Our mission is to show a modern, 21st century reimagination of the data observatory concept developed and supported by the UN, EU and OECD, and we want to show that modern reproducible research and open data could make the existing 60 data observatories and the planned new ones grow faster into data ecosystems.

We are working around the open collaboration concept, which is well-known in open source software development and reproducible science, but we try to make this agile project management methodology more inclusive, and include data curators, and various institutional partners into this approach. Based around our early-stage startup, Reprex, and the open-source developer community rOpenGov, we are working together with other developers, data scientists, and domain specific data experts in climate change and mitigation, antitrust and innovation policies, and various aspects of the music and film industry.

Our open collaboration is truly open: new data curators,developers and service designers, even volunteers and citizen scientists are welcome to join.

Our open collaboration is truly open: new data curators, data scientists and data engineers are welcome to join. We develop open-source software in an agile way, so you can join in with an intermediate programming skill to build unit tests or add new functionality, and if you are a beginner, you can start with documentation and testing our tutorials. For business, policy, and scientific data analysts, we provide unexploited, exciting new datasets. Advanced developers can join our development team: the statistical data creation is mainly made in the R language, and the service infrastructure in Python and Go components.

Music Creators’ Earnings in the Streaming Era

Fri, 18 Jun 2021 08:00:00 +0200

The idea of our Digital Music Observatory was brought to the UK policy debate on music streaming by the Written evidence submitted by The state51 Music Group to the Economics of music streaming review of the UK Parliaments' DCMS Committee¹.

The music industry requires a permanent market monitoring facility to win fights in competition tribunals, because it is increasingly disputing revenues with the world’s biggest data owners. This was precisely the role of the former CEEMID² program, which was initiated by a group of collective management societies. Starting with three relatively data-poor countries, where data pooling allowed rightsholders to increase revenues, the CEEMID data collection program was extended in 2019 to 12 countries.The final regional report, after the release of the detailed Hungarian, Slovak and Croatian reports of CEEMID was sponsored by Consolidated Independent (of the state51 music group.)

CEEMID was eventually to formed into the Demo Music Observatory in 2020³, following the planned structure of the European Music Observatory, and validated in the world’s 2nd ranked university-backed incubator, the Yes!Delft AI+Blockchain Validation Lab. In 2021, under the final name Digital Music Observatory, it became open for any rightsholder or stakeholder organization or music research institute, and it is being launched with the help of the JUMP European Music Market Accelerator Programme which is co-funded by the Creative Europe Programme of the European Union.

In December 2020, we started investigating how the music observatory concept could be introduced in the UK, and how our data and analytical skills could be used in the Music Creators’ Earnings in the Streaming Era (in short: MCE) project, which is taking place paralell to the heated political debates around the DCMS inquiry. After the state51 music group gave permission for the UK Intellectual Property Office to reuse the data that was originally published as the experimental CEEMID-CI Streaming Volume and Revenue Indexes, we came to a cooperation agreement between the MCE Project and the Digital Music Observatory. We provided a detailed historical analysis and computer simulation for the MCE Project, and we will host all the data of the Music Creators’ Earnings Report in our observatory, hopefully no later than early July 2021.

The Digital Music Observatory contributes to the Music Creators’ Earnings in the Streaming Era project with understanding the level of justified and unjustified differences in rightsholder earnings, and putting them into a broader music economy context.

We started our cooperation with the two principal investigators of the project, Prof David Hesmondhalgh and Dr Hyojugn Sun back in April and will start releasing the findings and the data in July 2021.

Justified and Unjustified Differences in Earnings

Stating that the greatest difference among rightsholders’ earnings is related to the popularity of their works and recorded fixations can appear banal and trivial. Yet, because many payout problems appear in the hard-to-describe long tail, understanding the justified differences of rightsholder earnings is an important step towards identifying the unjustified differences. It would be a breach of copyright law if less popular, or never played artists, would receive significantly more payment at the expense of popular artists from streaming providers. The earnings must reflect the difference in use and the economic value in use among rightsholders.

In our analysis we quantify differences using the actual data of the CEEMID-CI Streaming Indexes, created from hundreds of millions of data points, and computer simulations under realistic scenarios.

Justified Difference & Changes Over Time

Among the justified differences we quantify four objective justifications:

The variability of the domestic price of a stream over time shows a diminishing, but variable value of streams. Depending on the release date of a recording, and how quickly it builds up or loses the interest of the audience, the same number of streams can result in about 28% different earnings. In the period 2015-2019, later releases were facing diminishing revenues on streaming platforms.
The variability of international market share and international streaming prices in International Competitiveness. Compared to UK streaming prices, most international markets, particularly emerging markets, have a much greater variability of streaming prices. The variability of prices in advanced foreign markets such as Germany was similar to that of the British market, but in emerging markets and smaller advanced markets–such as the Netherlands–we measured a variability of around 50-80%. Artists who have a significant foreign presence, depending on their foreign market share, can experience 2-3 times greater differences in earnings than artists whose audience is predominantly British.

In our simulated results, the depreciation of the GBP shielded the internationally competitive rightsholders from a significant part of the otherwise negative price change in streaming markets.

The variability of the exchange rate that is applied when translating foreign currency revenues to the British pound in Exchange Rate Effects. Our CEEMID-CI Streaming Indexes index covers the post-Brexit referendum period, when the British pound was generally depreciating against most currencies. This resulted in a GBP-denominated translation gain for artists with a foreign presence. We show that the variability of the GBP exchange rates can add bigger justified differences among rightsholders’ earnings than the entire British price variation. The exchange rate movements are typically in the range of 30%, or at the level of the British domestic price variations in streaming prices. In our simulated results, this effect shielded the internationally competitive rightsholders from a significant part of the otherwise negative price change in foreign markets.
We were also investigating the choice of distribution model. , i.e., Both models, the currently used pro-rata model and the user-centric distribution, which has many proponents (and was introduced by SoundCloud in 2021), changes the earnings of artists. We think that both models represent a bad compromise, but they are legal, and a change of zero-sum distribution change could potentially increase the income of less popular and older artists at the expense of very popular and younger artists. More about this in our forthcoming report!

In our understanding, there are some known and some hypothetical causes of unjustified earning differences.

We did not have systemic data on the uncollected revenues–these are earnings that are legally made, but due to documentation, matching, processing, accounting, or other problems, the earnings are not paid. We could not even attempt to estimate this problem in the absence of relevant British empirical data. The problem is likely to be greater in the case of composers than in producer and performer revenues. In ideal cases, of course, the unclaimed royalty is near 0% of the earnings; it seems that on advanced markets the magnitude of this problem is in the single digits, and in emerging markets much greater, sometimes up to 50%. Royalty distribution is a costly business, and the smaller the revenue, the smaller the cost base to manage billions of transactions and related micropayments. We are working with a large group of eminent copyright researchers to understand this program better and provide regulatory solutions. See Recommendation Systems: What can Go Wrong with the Algorithm? - Effects on equitable remuneration, fair value, cultural and media policy goals
More analysis is required to understand how the algorithmic, highly autonomous recommendation systems of digital platforms such as Spotify, YouTube, Apple, and Deezer, among others, impact music rightsholders’ earnings. There are some empirical findings that suggest that such biases are present in various platforms, but due to the high complexity of recommendation systems, it is impossible to intuitively assign blame to pre-existing user biases, wrong training datasets, improper algorithm design, and other factors. We are working with our data curators, competition economist Dr Peter Ormosi, antropologist and data scientist Dr Botond Vitos and musicologist Dominika Semaňáková - We Want Machine Learning Algorithms to Learn More About Slovak Music to understand what can go wrong here (see Trustworthy AI: Check Where the Machine Learning Algorithm is Learning From).
There is always a hypothetical possibility that organizations with monopolistic power try to corner the market or make the playing field uneven. The music industry requires a permanent market monitoring facility to win fights in competition tribunals, because it is increasingly disputing revenues with the world’s biggest data owners. We are working with our data curators, competition economist Dr Peter Ormosi and copyright lawyer dr Eszter Kabai - New Indicators for Royalty Pricing and Music Antitrust to find potential traces of an uneven playing field (see: Music Streaming: Is It a Level Playing Field?.)

Solidarity & Equitable Remuneration

Equitable remuneration is a legal concept which has an economic aspect. In international law, it simply means that men and women should receive equal pay for equal work. Within the context of international copyright law, it was introduced by the Rome Convention, and it means that equitable remuneration means the same payment for the same use, regardless of genre, gender or other unrelated characteristics of the rightsholder.

While the word equitable in everyday usage often implies some level of equality and solidarity, in the context of royalty payments, these terms should not be mixed.

Digital Music Observatory uses harmonized, anonimous surveys conducted among musicians to find out about their living conditions compared to their peers in other countries, and people in other professions.

Solidarity is present in many royalty payout schemes, but it is unrelated to the legal concept of equitable remuneration. Music earnings are very heavily skewed towards a small number of very successful composers, performers, and producers. The music streaming licensing model has little elements of solidarity, unlike some of the licensing models that it is replacing–particularly public performance licensing. However, in those cases, the solidarity element is decided by the rightsholders themselves, and not by external parties like radio broadcasters or streaming service providers.

The so-called socio-cultural funds that provide assistance for artists in financial need must be managed by rightsholders, not others, and the current streaming model makes the organizations of such solidaristic action particularly difficult. Our data curator, Katie Long is working on us to find metrics and measurement possibilities on solidarity among rightsolders.

The Size of the Pie and the Distribution of the Pie

The current debate in the United Kingdom is often organized around the submission of the #BrokenRecord campaign to the DCMS committee, which calls for a legal re-definition of equitable remuneration rights⁴. This idea is not unique to the United Kingdom, in various European jurisdictions, performers fought similar campaigns for legislation or went to court, sometimes successfully, for example, in Hungary.

Another hot redistribution topic is the choice between the so-called pro-rata versus user-centric distribution of streaming royalties. We think that both models represent a bad compromise, but they are legal, and a zero-sum distribution change could potentially increase the income of less popular and older artists at the expense of very popular and younger artists. We instead propose an alternative approach of artist-centric distribution that could be potentially a win-win for all rightsholder groups; further elaboration of the concept lies outside the scope of this report.

Digital Music Observatory uses monitors volumes, prices and revenues on the total market, and compares them to calculated fair values. By understand the entire music economy, we can highlight when music is devaluing in various uses or countries.

Our mission with the Digital Music Observatory is to help focus the policy debate with facts around the economics of music streaming. The legal concept equitable remuneration is inseparable from the economic concept of fair valuation, and music streaming earnings cannot be subject to a valid economic analysis without analyzing the economics of music. Streaming services are competing with digital downloads, physical sales, and radio broadcasting; and the media streaming of YouTube and similar services is competing with music streaming, radio, and television broadcasting as well as retransmissions. It is a critically important to determine if in replacing earlier services and sales channels, the new streaming licensing model (a mix of the mechanical and public performance licensing) is also capable of replacing the revenues for all rightsholders.

The current streaming licensing model in Europe is a mix of mechanical and public performance rights. Therefore, when we are talking about music streaming, we must compare the streaming sub-market with the digital downloads, physical sales and private copying markets (mechanical licensing), and with the radio, television, cable and satellite retransmission markets (public performance licensing.) Our experience outside the UK suggests that these replacement values are very low.

Join us

Join our open collaboration Music Data Observatory team as a data curator, developer or business developer. More interested in antitrust, innovation policy or economic impact analysis? Try our Economy Data Observatory team! Or your interest lies more in climate change, mitigation or climate action? Check out our Green Deal Data Observatory team!

Footnote References

state51 Music Group. 2020. “Written Evidence Submitted by The state51 Music Group. Economics of Music Streaming Review. Response to Call for Evidence.” UK Parliament website. https://committees.parliament.uk/writtenevidence/15422/html/. ↩︎
Artisjus, HDS, SOZA, and Candole Partners. 2014. “Measuring and Reporting Regional Economic Value Added, National Income and Employment by the Music Industry in a Creative Industries Perspective. Memorandum of Understanding to Create a Regional Music Database to Support Professional National Reporting, Economic Valuation and a Regional Music Study.” ↩︎
Antal, Daniel. 2021. “Launching Our Demo Music Observatory.” Data & Lyrics. Reprex. https://dataandlyrics.com/post/2020-09-15-music-observatory-launch/. ↩︎
Gray, Tom. 2020. “#BrokenRecord Campaign Submission (Supplementary to Oral Evidence).” UK Parliament website. https://committees.parliament.uk/writtenevidence/15512/html/. ↩︎

Analyze Locally, Act Globally: New regions R Package Release

Wed, 16 Jun 2021 12:00:00 +0200

The new version of our rOpenGov R package regions was released today on CRAN. This package is one of the engines of our experimental open data-as-service Green Deal Data Observatory, Economy Data Observatory, Digital Music Observatory prototypes, which aim to place open data packages into open-source applications.

In international comparison the use of nationally aggregated indicators often have many disadvantages: they inhibit very different levels of homogeneity, and data is often very limited in number of observations for a cross-sectional analysis. When comparing European countries, a few missing cases can limit the cross-section of countries to around 20 cases which disallows the use of many analytical methods. Working with sub-national statistics has many advantages: the similarity of the aggregation level and high number of observations can allow more precise control of model parameters and errors, and the number of observations grows from 20 to 200-300.

The change from national to sub-national level comes with a huge data processing price: internal administrative boundaries, their names, codes codes change very frequently.

Yet the change from national to sub-national level comes with a huge data processing price. While national boundaries are relatively stable, with only a handful of changes in each recent decade. The change of national boundaries requires a more-or-less global consensus. But states are free to change their internal administrative boundaries, and they do it with large frequency. This means that the names, identification codes and boundary definitions of sub-national regions change very frequently. Joining data from different sources and different years can be very difficult.

Our regions R package helps the data processing, validation and imputation of sub-national, regional datasets and their coding.

There are numerous advantages of switching from a national level of the analysis to a sub-national level comes with a huge price in data processing, validation and imputation, and the regions package aims to help this process.

You can review the problem, and the code that created the two map comparisons, in the Maping Regional Data, Maping Metadata Problems vignette article of the package. A more detailed problem description can be found in Working With Regional, Sub-National Statistical Products.

This package is an offspring of the eurostat package on rOpenGov. It started as a tool to validate and re-code regional Eurostat statistics, but it aims to be a general solution for all sub-national statistics. It will be developed parallel with other rOpenGov packages.

Get the Package

You can install the development version from GitHub with:

devtools::install_github("rOpenGov/regions")

or the released version from CRAN:

install.packages("regions")

You can review the complete package documentation on regions.dataobservaotry.eu. If you find any problems with the code, please raise an issue on Github. Pull requests are welcome if you agree with the Contributor Code of Conduct

If you use regions in your work, please cite the package as: Daniel Antal. (2021, June 16). regions (Version 0.1.7). CRAN. http://doi.org/10.5281/zenodo.4965909

Download the BibLaTeX entry.

Join us

Join our open collaboration Green Deal Data Observatory team as a data curator, developer or business developer. More interested in antitrust, innovation policy or economic impact analysis? Try our Economy Data Observatory team! Or your interest lies more in data governance, trustworthy AI and other digital market problems? Check out our Digital Music Observatory team!

Open Data is Like Gold in the Mud Below the Chilly Waves of Mountain Rivers

Thu, 10 Jun 2021 07:00:00 +0200

Open data is like gold in the mud below the chilly waves of mountain rivers. Panning it out requires a lot of patience, or a good machine.

As the founder of the automated data observatories that are part of Reprex’s core activities, what type of data do you usually use in your day-to-day work?

The automated data observatories are results of syndicated research, data pooling, and other creative solutions to the problem of missing or hard-to-find data. The music industry is a very fragmented industry, where market research budgets and data are scattered in tens of thousands of small organizations in Europe. Working for the music and film industry as a data analyst and economist was always a pain because most of the efforts went into trying to find any data that can be analyzed. I spent most of the last 7-8 years trying to find any sort of information—from satellites to government archives—that could be formed into actionable data. I see three big sources of information: textual,numeric, and continuous recordings for on-site, offsite, and satellite sensors. I am much better with numbers than with natural language processing, and I am improving with sensory sources. But technically, I can mint any systematic information—the text of an old book, a satellite image, or an opinion poll—into datasets.

For you, what would be the ultimate dataset, or datasets that you would like to see in the Green Deal Data Observatory?

Our retroharmonize and regions packages can create regional statistics from Eurobarometer and Afrobarometer surveys on how people think locally about climate change. I would like to combine this with local information on observable climate change, such as drought, urban heat, and extreme weather conditions. Do people have to feel the pain of climate change to believe in the phenomenon? How do self-reported mitigation steps correlate with what people already feel in their local environment? Suzan is talking about measuring mitigation and damage control, because she’s aware of the already present health risks in overheating urban environments. I am more interested in what people think.

See our case study on connecting local tax revenues, climate awareness poll data and drought data in Belgium - we want to extend this to Europe and then to Africa. We also published the code how to do it with tutorials 1, 2 for our International Open Data Day 2021 Event.

Is there a number or piece of information that recently surprised you? If so, what was it?

There were a few numbers that surprised me, and some of them were brought up by our observatory teams. Karel is talking about the fact that not all green energy is green at all: many hydropower stations contribute to the greenhouse effect and not reduce it. Annette brought up the growing interest in the Dalmatian breed after the Disney 101 Dalmatians movies, and it reminded me of the astonishing growth in interest for chess sets, chess tutorials, and platform subscriptions after the success of Netflix’s The Queen’s Gambit.

The Queen’s Gambit’ Chess Boom Moves Online By Rachael Dottle on bloomberg.com

Annette is talking about the importance of cultural influencers, and on that theme, what could be more exciting that Netflix’s biggest success so far is not a detective series or a soap opera but a coming-of-age story of a female chess prodigy. Intelligence is sexy, and we are in the intelligence business.

But to tell a more serious and more sobering number, I recently read with surprise that there are more people smoking cigarettes on Earth in 2021 than in 1990. Population growth in developing countries replaced the shrinking number of developed country smokers. While I live in Europe, where smoking is strongly declining, it reminds me that Europe’s population is a small part of the world. We cannot take for granted that our home-grown experiences about the world are globally valid.

Do you have a good example of really good, or really bad use of data?

FiveThirtyEight.com had a wonderful podcast series, produced by Jody Avirgan, called What’s the Point. It is exactly about good and bad uses of data, and each episode is super interesting. Maybe the most memorable is Why the Bronx Really Burned. New York City tried to measure fire response times, identify redundancies in service, and close or re-allocate fire stations accordingly. What resulted, though, was a perfect storm of bad data: The methodology was flawed, the analysis was rife with biases, and the results were interpreted in a way that stacked the deck against poorer neighborhoods. It is similar to many stories told in a very compelling argument by Catherine D’Ignazio and Lauren F. Klein in their much celebrated book, Data Feminism. Usually, the bad use of data starts with a bad data collection practice. Data analysts in corporations, NGOs, public policy organizations and even in science usually analyze the data that is available.

You can find these examples, together with many more that our contributors recommend, in the motivating examples of Create New Datasets and the Remain Critical parts of our onboarding material. We hope that more and more professionals and citizen scientist will help us to create high-quality and open data.

The real power lies in designing a data collection program. A consistent data collection program usually requires an investment that only powerful organizations, such as government agencies, very large corporations, or the richest universities can afford. You cannot really analyze the data that is not collected and recorded; and usually what is not recorded is more interesting than what is. Our observatories want to democratize the data collection process and make it more available, more shared with research automation and pooling.

You cannot really analyze the data that is not collected and recorded; and usually what is not recorded is more interesting than what is. Our observatories want to democratize the data collection process and make it more available, more shared with research automation and pooling.

From your perspective, what do you see being the greatest problem with open data in 2021?

I have been involved with open data policies since 2004. The problem has not changed much: more and more data are available from governmental and scientific sources, but in a form that makes them useless. Data without clear description and clear processing information is useless for analytical purposes: it cannot be integrated with other data, and it cannot be trusted and verified. If researchers or government entities that fall under the Open Data Directive release data for reuse in a way that does not have descriptive or processing metadata, it is almost as if they did not release anything. You need this additional information to make valid analyses of the data, and to reverse-engineer them may cost more than to recollect the data in a properly documented process. Our developers, particularly Leo and Pyry are talking eloquently about why you have to be careful even with governmental statistical products, and constantly be on the watch out for data quality.

Our API is not only publishing descriptive and processing metadata alongside with our data, but we also make all critical elements of our processing code available for peer-review on rOpenGov

What do you think the Green Deal Data Observatory, and our other automated observatories do, to make open data more credible in the European economic policy community and be accepted as verified information?

Most of our work is in research automation, and a very large part of our efforts are aiming to reverse engineer missing descriptive and processing metadata. In a way, I like to compare ourselves to the working method of the open-source intelligence platform Bellingcat. They were able to use publicly available, scattered information from satellites and social media to identify each member of the Russian military company that illegally entered the territory of Ukraine and shot down the Malaysian Airways MH17 with 297, mainly Dutch, civilians on board.

How we create value for research-oriented consultancies, public policy institutes, university research teams, journalists or NGOs.

We do not do such investigations but work very similarly to them in how we are filtering through many data sources and attempting to verify them when their descriptions and processing history is unknown. In the last years, we were able to estore the metadata of many European and African open data surveys, economic impact, and environmental impact data, or many other open data that was lying around for many years without users.

Open data is like gold in the mud below the chilly waves of mountain rivers. Panning it out requires a lot of patience, or a good machine. I think we will come to as surprising and strong findings as Bellingcat, but we are not focusing on individual events and stories, but on social and environmental processes and changes.

Join us

Join Copernicus Climate Data Store Data with Socio-Economic and Opinion Poll Data

Sun, 06 Jun 2021 10:00:00 +0200

In this series of blogposts we will show how to collect environmental data from the EU’s Copernicus Climate Data Store, and bring it to a data format that you can join with Eurostat’s socio-economic and environmental data. We have shown in a previous blogpost how to connect this to survey (opinion poll) and tax data, and a real policy problem in Belgium. We will create now subsequent tutorials to do more!

But first, why are we doing this? The European Union and its members states are releasing every year more and more data for open re-use since 2003, yet these are often not used in the EU’s data dissemination projects (the observatories) or in EU-funded research. We believe that there are many reasons behind this. Whilst more and more people can conduct business, scientific or policy analysis programmatically or with statistical software, knowledge how to systematically collect the data from the exponentially growing availability is not everybody’s specialty. And the lack of documentation, and high re-processing and validation need for open data is another drawback.

rOpenGov has long been producing high-quality, peer-reviewed R packages to work with open data, but their use is not for all. In an open collaboration, where you can join, too, rOpenGov teamed up with open source developers, knowledgeable data curators, and a service developer team lead by the Dutch reproducible research start-up Reprex to create a sustainable infrastructure that is permanently collecting, processing, documenting and visualizing open data. What we do is that we access open data (that is not always available for direct download) and re-process it to usable data that is tidy to be integrated with your existing data or databases. We are competing for the EU Datathon Challenge 1: supporting a European Green Deal agenda with open data as a service, and research as a servcie, and you are more than welcome to join our effort as a developer, a data curator, or as an occasional contributor to open government packages.

Register to the Copernicus Climate Data Store

Koen Hufkens, Reto Stauffer and Elio Campitelli created the ecmwfr R package for programmatically accessing the Copernicus Data Store service. Follow the CDS Functionality vignette to get started.

You will need to create a Register yourself for CDS services after accepting the Terms and conditions.

wf_set_key(user = "12345", 
           key = "00000000-aaaa-b1b1-0000-a1a1a1a1a1a1", 
           service = "cds")

You can check if you were successful with:

ecmwfr::wf_get_key(user = "12345", service = "cds")

Get the Data

Let us formulate our first request:

request_lai_hv_2019_06 <- list(
  "dataset_short_name" = "reanalysis-era5-land-monthly-means",
  "product_type"   = "monthly_averaged_reanalysis",
  "variable"       = "leaf_area_index_high_vegetation",
  "year"           = "2019",
  "month"          =  "06",
  "time"           = "00:00",
  "area"           = "70/-20/30/60",
  "format"         = "netcdf",
  "target"         = "demo_file.nc")

lai_hv_2019_06.nc  <- wf_request(user = "<your_ID>",
                     request = request_lai_hv_2019_06 ,
                     transfer = TRUE,
                     path = "data-raw",
                     verbose = FALSE)

Effective Leaf Area Index

You can find this data either in global computer raster images, or in re-processed monthly averages. Working with the raw data is not very practical – in case of cloudy weather you have missing data, and the files are extremely huge for a personal computer. For the purposes of our Green Deal Data Observatory the monthly average values are far more practical, which are called monthly_averaged_reanalysis product types.

For compatibility with other R packages, convert the data with the from raster package from rSpatial.org.

lai_file <- here::here( "data-raw", "demo_file.nc")
lai_raster <- raster::raster(lai_file)

## Loading required namespace: ncdf4

Let us convert this to a SpatialDataPointsDataFrame class, which is an augmented data frame class with coordinates.

LAI_df <- raster::rasterToPoints(lai_raster, fun=NULL, spatial=TRUE)

Get The Map

With the help fo rOpenGov, we are creating various R packages to programmatically access open data and put them into the right format. The popular eurostat package is not only useful to download data from Eurostat, but also to map it.

In this case, we want to create regional maps. Europe has five levels of geographical regions: NUTS0 for countries, NUTS1 for larger areas like states, provinces; NUTS2 for smaller areas like countries, NUTS3 for even smaller areas. The LAU level contains settlemens and their surrounding areas.

Country borders change sometimes (think about the unification of Germany, or the breakup of Czechoslovakia and Yugoslavia), but they are relatively stable entities. Sub-national regional border change very-very frequently – since 2000 there were many thousand changes in Europe. This means that you must choose one regional boundary definition. The latest edition is NUTS2021 but most of the data available is still in the NUTS2016 format, and often you will find NUTS2013 or even NUTS2010 data around. Our Green Deal Data Observatory uses the NUTS2016 definition, because it is far the most used in 2021. An offspring of the eurostat package, regions helps you take care of NUTS changes when you work, and can convert your data to NUTS2021 if you later need it.

## sf at resolution 1:60 read from local file

## Warning in eurostat::get_eurostat_geospatial(resolution = "60", nuts_level =
## "2", : Default of 'make_valid' for 'output_class="sf"' will be changed in the
## future (see function details).

plot(map_nuts_2)

Our measurement of the average Effective Leaf Area Index is a raster data, it is given for many points of Europe’s map. What we need to do is to overlay this raster information of the statistical map of Europe. We use the excellent sp: R Classes and Methods for Spatial Data package for this purpose. The sp::over() function decides if a point of Leaf Area Index measurement falls into the polygon (shape) of a particular NUTS2 regions, for example, Zuid-Holland or South Holland in the Netherlands, or Saarland in Germany, or not. Then it averages with the mean() function those measurements falling in the area.

LAI_nuts_2 = sp::over(sp::geometry(
  as(map_nuts_2, 'Spatial')), 
  LAI_df,
  fn=mean)

Let’s call the average LAI index lai, and bind it to the Eurostat map:

names(LAI_nuts_2)[1] <- "lai"
LAI_sfdf <- bind_cols ( map_nuts_2, LAI_nuts_2 )

If you want to work with the data in a numeric context, you do not need the geographical information, and you can “downgrade” the SpatialDataPointsDataFrame to a simple data frame.

set.seed(2019) #to always see the same sample
LAI_sfdf %>%
  as.data.frame() %>%
  select ( all_of(c("NUTS_NAME", "NUTS_ID", "lai")) ) %>%
  sample_n(10)

##                      NUTS_NAME NUTS_ID lai
## 281                       Vest    RO42  NA
## 125                     Kassel    DE73  NA
## 69              Friesland (NL)    NL12  NA
## 237 Agri, Kars, Igdir, Ardahan    TRA2  NA
## 273                East Anglia    UKH1  NA
## 119                Prov. Liège    BE33  NA
## 61                   Bourgogne    FRC1  NA
## 275                      Essex    UKH3  NA
## 282                   Istanbul    TR10  NA
## 174                    Leipzig    DED5  NA

We’ll plot the map with ggplot2.

library(ggplot2)
library(sf)
ggplot(data=LAI_sfdf) + 
  geom_sf(aes(fill=lai),
          color="dim grey", size=.1) + 
  scale_fill_gradient( low ="#FAE000", high = "#00843A") +
  guides(fill = guide_legend(reverse=T, title = "LAI")) +
  labs(title="Leaf Area Index",
       subtitle = "High vegetation half, NUTS2 regional avareage values",
       caption="\ua9 EuroGeographics for the administrative boundaries 
                \ua9 Copernicus Data Service, June 2019 average values
                Tutorial and ready-to-use data on greendeal.dataobservatory.eu") +
  theme_light() + theme(legend.position=c(.88,.78)) +
  coord_sf(xlim=c(-22,48), ylim=c(34,70))

Data Integrity

Our Green Deal Data Observatory has a data API where we place the new data with metadata for programmatic download in CSV, JSON or even with SQL queries. For data integrity purposes, we are placing an authoritative copy on Zenodo (Green Deal Data Observatory Community). You can use this for scientific citations. We are also happy if you place your own climate policy related research data here, so that we can include it in our observatory. In our subsequent tutorials, we will show how to do this programmatically in R. This particular dataset (not only with the month June, which we selected to streamline the tutorial) is available here with the digital object identifier doi.org/10.5281/zenodo.4903940.

Join us

Data API

Tue, 01 Jun 2021 11:00:00 +0000

Our observatory has a new data API which allows access to our daily refreshing open data. You can access the API via api.greendeal.dataobservatory.eu

All the data and the metadata are available as open data, without database use restrictions, under the ODbL license. However, the metadata contents are fully not finalized yet. We are currently working on a solution that applies the FAIR Guiding Principles for scientific data management and stewardship, and fulfills the mandatory requirements of the Dublic Core metadata standards and at the same time the mandatory requirements, and most of the recommended requirements of DataCite.

Data table

The indicator table contains the actual values, and the various estimated/imputed values of the indicator, clearly marking missing values, too.

api.greendeal.dataobservatory.eu data retrieval

You can get the data in CSV or json format, or write SQL queries. (Tutorials in SQL, R, Python will be posted shortly.)

Descriptive metadata table

api.greendeal.dataobservatory.eu descriptive metadata

For further reference, see Descriptive Metadata.

Statistical Processing metadata table

api.greendeal.dataobservatory.eu processing metadata

For further reference, see Administrative (Processing) Metadata

Authoritative Copies

Greendeal Data Observatory on Zenodo

Metadata

Tue, 01 Jun 2021 11:00:00 +0000

Our observatory has a new data API which allows access to our daily refreshing open data. You can access the API via api.greendeal.dataobservatory.eu

All the data and the metadata are available as open data, without database use restrictions, under the ODbL license. However, the metadata contents are not finalized yet. We are currently working on a solution that applies the FAIR Guiding Principles for scientific data management and stewardship, and fulfills the mandatory requirements of the Dublic Core metadata standards and at the same time the mandatory requirements, and most of the recommended requirements of DataCite. These changes will be effective before 1 July 2021.

The Competition Data Observatory temporarily shares an API with the Economy Data Observatory, which serves as an incubator for similar economy-oriented reproducible research resources.

api.greendeal.dataobservatory.eu descriptive metadata

Descriptive Metadata


Identifier	An unambiguous reference to the resource within a given context. (Dublin Core item), but several identifiders allowed, and we will use several of them.
Creator	The main researchers involved in producing the data, or the authors of the publication, in priority order. To supply multiple creators, repeat this property. (Extends the Dublin Core with multiple authors, and legal persons, and adds affiliation data.)
Title	A name given to the resource. Extends Dublin Core with alternative title, subtitle, translated Title, and other title(s).
Publisher	The name of the entity that holds, archives, publishes prints, distributes, releases, issues, or produces the resource. This property will be used to formulate the citation, so consider the prominence of the role. For software, use Publisher for the code repository. (Dublin Core item.)
Publication Year	The year when the data was or will be made publicly available.
Resource Type	We publish Datasets, Images, Report, and Data Papers. (Dublin Core item with controlled vocabulary.)

Recommended for discovery

The Recommended (R) properties are optional, but strongly recommended for interoperability.


Subject	The topic of the resource. (Dublin Core item.)
Contributor	The institution or person responsible for collecting, managing, distributing, or otherwise contributing to the development of the resource. (Extends the Dublin Core with multiple authors, and legal persons, and adds affiliation data.) When applicable, we add Distributor (of the datasets and images), Contact Person, Data Collector, Data Curator, Data Manager, Hosting Institution, Producer (for images), Project Manager, Researcher, Research Group, Rightsholder, Sponsor, Supervisor
Date	A point or period of time associated with an event in the lifecycle of the resource, besides the Dublin Core minimum we add Collected, Created, Issued, Updated, and if necessary, Withdrawn dates to our datasets.
Related Identifier	An identifier or identifiers other than the primary Identifier applied to the resource being registered.
Rights	We give SPDX License List standards rights description with URLs to the actual license. (Dublin Core item: Rights Management)
Description	Recommended for discovery.(Dublin Core item.)
GeoLocation	Similar to Dublin Core item Coverage

The Subject property: we need to set standard coding schemas for each observatory.
Contributor property:
- DataCurator the curator of the dataset, who sets the mandatory properties.
- DataManager the person who keeps the dataset up-to-date.
- ContactPerson the person who can be contacted for reuse requests or bug reports.
The Date property contains the following dates, which are set automatically by the dataobservatory R package:
- Updated when the dataset was updated;
- EarliestObservation, which the earliest, not backcasted, estimated or imputed observation.
- LatestObservation, which the earliest, not backcasted, estimated or imputed observation.
- UpdatedatSource, when the raw data source was last updated.
The GeoLocation is automatically created by the dataobservatory R package.
The Description property optional elements, and we adopted them as follows for the observatories:
- The Abstract is a short, textual description; we try to automate its creation as much as a possible, but some curatorial input is necessary.
- In the TechnicalInfo sub-field, we record automatically the utils::sessionInfo() for computational reproducability. This is automatically created by the dataobservatory R package.
- In the Other sub-field, we record the keywords for structuring the observatory.

Optional

The Optional (O) properties are optional and provide richer description. For findability they are not so important, but to create a web service, they are essential. In the mandatory and recommended fields, we are following other metadata standards and codelists, but in the optional fields we have to build up our own system for the observatories.


Language	A language of the resource. (Dublin Core item.)
Alternative Identifier	An identifier or identifiers other than the primary Identifier applied to the resource being registered.
Size	We give the CSV, downloadable dataset size in bytes.
Format	We give file format information. We mainly use CSV and JSON, and occasionally rds and SPSS types. (Dublin Core item.)
Version	The version number of the resource.
Rights	We give SPDX License List standards rights description with URLs to the actual license. (Dublin Core item: Rights Management)
Funding Reference	We provide the funding reference information when applicable. This is usually mandatory with public funds.
Related Item	We give information about our observatory partners' related research products, awards, grants (also Dublin Core item as Relation.) We particularly include source information when the dataset is derived from another resource (which is a Dublin Core item.)

In the Language we only use English (eng) at the moment.
By default We do not use the Alternative Identifier property. We will do this when the same dataset will be used in several observatories.
The Size property is measured in bytes for the CSV representation of the dataset. During creations, the software creates a temporary CSV file to check if the dataset has no writing problems, and measures the dataset size.
The Version property needs further work. For a daily re-freshing API we need to find an applicable versioning system.
The Funding reference will contain information for donors, sponsors, and co-financing partners.
Our default setting for Rights is the CC-BY-NC-SA-4.0 license and we provide an URI for the license document.
In the RelatedItem we give information about:
- The original (raw) data source.
- Methodological bibilography reference, when needed.
- The open-source statistical software code that processed the data.

Administrative (Processing) Metadata

Like with diamonds, it is better to know the history of a dataset, too. Our administrative metadata contains codelists that follow the SXDX statistical metadata standards, and similarly strucutred information about the processing history of the dataset.

api.greendeal.dataobservatory.eu processing metadata

See for further reference The codebook Class.


Observation Status	SDMX Code list for Observation Status 2.2 (CL_OBS_STATUS), such as actual, missing, imputed, etc. values.
Method	If the value is estimated, we provide modelling information.
Unit	We provide the measurement unit of the data (when applicable.)
Frequency	SDMX Code list for Frequency 2.1 (CL_FREQ) frequency values
Codelist	Euros-SDMX Codelist entries for the observational units, such as sex, etc.
Imputation	SDMX Code list for Frequency 2.1 (CL_IMPUT_METH) imputation values
Estimation	The estimation methodology of data that we calculated, together with citation information and URI to the actual processing code
Related Item	We give information about the software code that processed the data (both Dublin Core and DataCite compliant.)

See an example in the The codebook Class article of the dataobservatory R package.

Is Drought Risk Uninsurable?

Fri, 23 Apr 2021 00:00:00 +0000

Climate change is real and it is everywhere. Whereas island nations in the Pacific are threatened with rising sea levels, Europe suffers from ever more frequent scorching summers and resulting drought. Take the case of Belgium, where heat waves in 2018 or 2020 have exacerbated an already fragile drought risk profile. An all too tangible effect is that houses built in areas where groundwater reservoirs are dwindling start to rupture. What adds insult to injury is that insurers appear unwilling to pay for damages: these climate-related risks simply did not feature in insurance policies made up decades ago. The public and the media have called upon the secretary of state responsible for consumer protection to come up with a solution. (Download this document in pdf.)

The Belgian insurance sector and government are currently investigating how to address the ecological and financial issue. Should the risk premium be raised on all insurance policies in an effort to spread risk, or should only policy holders in designated risk areas be subject to a raise in premia? Should urban planning initiatives and real estate projects be required to assess these new types of risk beforehand?

Driven by the Open Data Directive, we went in search for data at government websites such as waterinfo.be. That proved harder than you would want, with quite a number of technological barriers to cross. We independently explored the matter ourselves and came up with this: a dynamic map that pictures the spatial distribution of drought risk - as measured by a climate indicator known as the standardised recipitation-evapotranspiration index.

Actual drying soil.

This SPEI index, measured as a standardized variate, shows the deviations of the current climatic balance (precipitation minus evapotranspiration potential) in the long run and is presented on a monthly basis. As the SPEI in this form is more predictive for flood risk, we simply inverted the index to suggest a measure of drought risk¹.

Readers familiar with the “Kingdom by the sea” will remark that Belgium cannot possibly have a lack of precipitation. It rains more than the average Belgian cares for in the country. As a result, the water management system has historically been based on getting the water out as quickly as possible to the sea, in particular through the Ijzer, Schelde and Maas rivers. Add the abundance of concrete in the densely populated country - and its grossly mismanaged urban planning - and the capacity to hold water in surface and ground reservoirs is severely impaired. With climate change in full swing, these historical practices come back to haunt Belgium.

Are Belgians aware of climate risk?

We projected the public opinion data from Eurobarometer 90.2 (fieldwork: October-November 2018.) on the municipal map of Belgium. We used the answers to the multiple choice question QB1 Do you think that the following extreme weather events are due to climate change? We highlighted areas where people find it more likely to be exposed to Droughts and wildfires. We used the GESIS datafile (European Commission 2019) and used the (Antal 2021b, 2021a) packages to project the values to municipalities.

We see a weak spatial correlation between awareness of drought risk and actual draught risk. The least affected parts of Belgium appear least concerned. Despite its weakness, authorities and insurers can at least build their mitigation policies on a hypothesis of positive correlation. Of note is that concern for climate change effects follows regional, linguistic and other patterns. The map in particular suggests the Belgian provinces as markers for awareness.

Perception of likely drought.

Financial Capacity to Pay for Insurance

The next question we asked ourselves, was if the drought risk correlates with the ability to pay as distributed among local communities. Whether an insurance policy – or the regulation of insurance – attempts to provide cover on an individual level (through increased premia), or looks for local, regional or national mitigation strategies, the income/tax base might be an appropriate benchmark to test for financial capacity.

Financial capacity to mitigate drought risk.

The match between the (inverted) SPEI and total net income is less than perfect. Some of the areas most at risk coincide with the highest-income communities, but other threatened communities are low-income by Belgian standards. The actual risk awareness and the financial capacity to solve the problem are again only weakly correlated.^[2]

Correlation

Let’s have a look at the variables on NUTS3 level:

Correlation of the variables.

Average SPEI, which is a measure of increasing humidity, is negatively correlated with dry that we defined as -1 x avg_spei.
Dry areas, that are losing water, are less populous and more rich regions.
Dry_18 is a version of dry that only shows 12 months before the Eurobarometer survey about opinions on climate change effects, to see if the recent memory of actual weather conditions has had an affect of the perception of Belgians about these risk. It is seemingly not correlated with worries about floods or droughts.
Thedry_18 and the dry variables are largely correlated. One possible explanation is that the year before the survey was not an unusual period, it fit very well with the 2016-2020 trend.
Worries about extreme weather conditions are correlated with each other – i.e., some part of the population (concentrated geographically) is far more concerned with climate change than others.

The same on municipality (local administrative unit) level:

Correlation on the level of municipalities.

The correlations with opinion polling data are a little bit distorted, because the data is on NUTS2, and to bring it down to NUTS3 or LAU level would be a complicated small area statistical estimation task. We have also computed geospatial cross-correlation. Awareness of the climate problem and the dryness in 2018 were positively correlated in time – the drier the year was in an area, the more likely it was that people are aware of the problem; and the poorer areas were more likely to be afraid of this problem. The global spatial cross-correlation of the drying and local income was very low. This is a neutral situation: local income is not more concentrated to drying areas (which would be a lucky coincidence) nor concentrated in the relatively stable areas.

Generally, the problem map appears to be neutral to mildly favorable. The financial capacity to solve the problem is not working in the favor, nor against the problem, and awareness seems to be somewhat higher in the more affected areas.

The codes are in R/join_belgium_water_lau_dataset.R and R/join_belgium_water_nuts3_dataset.R.

Adverse Selection and Climate Solidarity

In addition to these historical analyses that put the drought risk in context, we are investigating whether climate data from integrated climate models might be harnessed to predict medium- to longer-term risk profiles on a spatially distributed basis. Urban planners, real estate promoters, individual households and governments will need to rely on such predictions to better adapt to climate change and reverse some of the earlier policy choices we mentioned.

To quote the Nobel Prize winning thoughts of Finn E. Kydland and Edward C. Prescott (Kydland and Prescott 1977):

The issues are obvious in many well-known problems of public policy. For example, suppose the socially desirable outcome is not to have houses built in a particular flood plain but, given that they are there, to take certain costly flood-control measures. If the government’s policy were not to build the dams and levees needed for flood protection and agents knew this was the case, even if houses were built there, rational agents would not live in the flood plains. But the rational agent knows that, if he and others build houses there, the government will take the necessary flood-control measures. Consequently, in the absence of a law prohibiting the construction of houses in the flood plain, houses are built there, and the army corps of engineers subsequently builds the dams and levees.

Our initial explorations at least suggest that leaving the resolution entirely to market forces, for example through increased property insurance premia may well lead to underinsurance in poorer areas that is dynamically inconsistent with government policy. If in particular severe drought will bankrupt farmers in such areas, eventually regional or national government will be forced to bail them out.

The other extreme approach, i.e., leaving the climate-change related damages entirely to the taxpayer, therefore does not seem feasible either with climate awareness and local income tax base only weakly correlating with the drought patterns. In addition, drought of course does not confine itself to municipal borders; the hydrological topology of the issue inherently implies a coordination problem between local, regional and federal entities passing the buck from one to another. One can imagine some form of solidarity and redistribution will be required to align interests and avoid adverse selection. To address these typical market failures, government will need to step in to allow these risks, that may be privately uninsurable, to be covered on a society-wide basis.

These problems are not unique to property damage. Similar problems arise in many student loan systems in the world (where it is desirable that the loan can be taken by arts students or future teachers, who may not have as high earning potential as easy-to-credit future lawyers, engineers, managers) or in many social security issues: a minimum level of health insurance for the unemployed and poor is desirable not only on the basis of humanity, but to avoid epidemic risks. Such special loan systems and special insurance systems are balancing some social welfare with individual welfare and individual risk considerations, and at the same time they try to avoid adverse selection, free-riding. We believe that our example can spark some ideas how a desirable social outcome can be aligned with the principles of insurance and personal responsibility.

In this case, on a longer term basis, incentives that may transfer water-intensive industrial and agricultural activities from the areas most at risk, could be called for, as well as better hydrological management to safeguard water reserves. We invite the authorities and relevant stakeholders to render the appropriate data needed to assess climate and drought evolution and to calculate risk premia scenarios and solidarity mechanisms open data, verified for quality through unit-tests and peer review.

References

Antal, Daniel. 2021a. Regions: Processing Regional Statistics. https://regions.danielantal.eu/.

———. 2021b. Retroharmonize: Ex Post Survey Data Harmonization. https://retroharmonize.dataobservatory.eu/.

Beguerı́a, Santiago, Sergio M Vicente-Serrano, Fergus Reig, and Borja Latorre. 2014. “Standardized Precipitation Evapotranspiration Index (SPEI) Revisited: Parameter Fitting, Evapotranspiration Models, Tools, Datasets and Drought Monitoring.” International Journal of Climatology 34 (10): 3001–23.

European Commission. 2019. “Eurobarometer 90.2 (2018).” GESIS Data Archive, Cologne. ZA7488 Data file Version 1.0.0, https://doi.org/10.4232/1.13289. https://doi.org/10.4232/1.13289.

Kydland, Finn E., and Edward C. Prescott. 1977. “Rules Rather Than Discretion: The Inconsistency of Optimal Plans.” Journal of Political Economy 85 (3): 473–91. http://www.jstor.org/stable/1830193.

Statbel. 2020. “Fiscal statistics on income.” Eurostat. https://statbel.fgov.be/en/open-data/fiscal-statistics-income.

Vicente-Serrano, Sergio M, Santiago Beguerı́a, and Juan I López-Moreno. 2010. “A Multiscalar Drought Index Sensitive to Global Warming: The Standardized Precipitation Evapotranspiration Index.” Journal of Climate 23 (7): 1696–1718.

As a standardized variate, SPEI can be compared across space and time. The original calculation of SPEI is based on the FAO-56 Penman-Monteith method. Other relevant indicators might consider the soil composition for example: clay and lime soils tend to be more vulnerable to drought. We combined this ecological dimension with the socio-economic dimension to suggest that insurance premia design might be targeted to, say, income levels as well - or alternatively to real estate prices. See (Beguerı́a et al. 2014; Vicente-Serrano, Beguerı́a, and López-Moreno 2010) ↩︎

Identifying Roadblocks to Net Zero Legislation

Tue, 16 Mar 2021 00:00:00 +0000

In our use case we are merging data about Europe’s coal regions, harmonized surveys about the acceptance of climate policies, and socio-economic data. While the work starts out from existing European research, our retroharmonize survey harmonization solution, our regions sub-national boundary harmonization solution and iotables allows us to connect open data and open knowledge from other coal regions of the world, for example, from the Appalachian economy.

Policy Context

The Just Transition Platform aims to assist EU countries and regions to unlock the support available through the Just Transition Mechanism. It builds on and expands the work of the existing Initiative for Coal Regions in Transition, which already supports fossil fuel producing regions across the EU in achieving a just transition through tailored, needs-oriented assistance and capacity-building.

The Initiative has a secretariat that is co-run by Ecorys, Climate Strategies, ICLEI Europe, and the Wuppertal Institute for Climate. While the initiative is an EU project, it cooperates with other similar initiatives, for example, with the Coalfield Development social enterprise in the Appalachian economy.

Data Sources

Coal regions: Our starting point is the EU coal regions: opportunities and challenges ahead publication Joint Research Centre (JRC), the European Commission’s science and knowledge service. This publication maps Europe’s coal dependent energy and transport infrastructure, and regions that depend on coal-related jobs.
Harmonized Survey Data: The dataset of the Eurobarometer 91.3 (April 2019) harmonized survey. Our transition policy variable is the four-level agreement with the statement More public financial support should be given to the transition to clean energies even if it means subsidies to fossil fuels should be reduced (EN) and Davantage de soutien financier public devrait être donné à la transition vers les énergies propres même si cela signifie que les subventions aux énergies fossiles devraient être réduites (FR) which is then translated to the language use of all participating country.
Environmental Variables: We used data on pm and SO2 polution measured by participating stations in the European Environmental Agency’s monitoring program. The station locations were mapped by Milos to the NUTS sub-national regions.

Exploratory Data Analysis

Our coal-dependency dummy variable is base on the policy document Coal regions in transition.

readRDS(file.path("data", "coal_regions.rds"))

## # A tibble: 253 x 5
##    country_code_is~ region_nuts_nam~ region_nuts_cod~ coal_region is_coal_region
##    <chr>            <fct>            <chr>            <chr>                <dbl>
##  1 BE               Brussels hoofds~ BE10             <NA>                     0
##  2 BE               Liege            BE33             <NA>                     0
##  3 BE               Brabant Wallon   BE31             <NA>                     0
##  4 BE               Antwerpen        BE21             <NA>                     0
##  5 BE               Limburg [BE]     BE22             <NA>                     0
##  6 BE               Oost-Vlaanderen  BE23             <NA>                     0
##  7 BE               Vlaams Brabant   BE24             <NA>                     0
##  8 BE               West-Vlaanderen  BE25             <NA>                     0
##  9 BE               Hainaut          BE32             <NA>                     0
## 10 BE               Namur            BE35             <NA>                     0
## # ... with 243 more rows

Our exploratory data analysis shows that respondent in 2019, agreement with the policy measure significantly differed among EU member states and regions.

transition_policy <- eb19_raw %>%
  rowid_to_column() %>%
  mutate ( transition_policy = normalize_text(transition_policy)) %>%
  fastDummies::dummy_cols(select_columns = 'transition_policy') %>%
  mutate ( transition_policy_agree = case_when(
    transition_policy_totally_agree + transition_policy_tend_to_agree > 0 ~ 1, 
    TRUE ~ 0
  )) %>%
  mutate ( transition_policy_disagree = case_when(
    transition_policy_totally_disagree + transition_policy_tend_to_disagree > 0 ~ 1, 
    TRUE ~ 0
  )) 

eb19_df  <- transition_policy %>% 
  left_join ( air_pollutants, by = 'region_nuts_codes' ) %>%
  mutate ( is_poland = ifelse ( country_code == "PL", 1, 0))

Preliminary Results

Significantly more people agree where

there are more polutants
who are younger
where people are more educated

Significantly less people agree

in rural areas
where more people are older
where more people are less educated
in less polluted areas
in coal regions

A simple model run:

c("transition_policy_totally_agree" , "pm10", "so2", "age_exact", "is_highly_educated" , "is_rural")

## [1] "transition_policy_totally_agree" "pm10"                           
## [3] "so2"                             "age_exact"                      
## [5] "is_highly_educated"              "is_rural"

summary( glm ( transition_policy_totally_agree ~ pm10 + so2 + 
                 age_exact +
                 is_highly_educated + is_rural + is_coal_region +
                 country_code, 
               data = eb19_df, 
               family = binomial ))

## 
## Call:
## glm(formula = transition_policy_totally_agree ~ pm10 + so2 + 
##     age_exact + is_highly_educated + is_rural + is_coal_region + 
##     country_code, family = binomial, data = eb19_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7690  -1.0253  -0.8165   1.2264   1.9085  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.1975096  0.0921551  -2.143 0.032095 *  
## pm10                0.0068505  0.0017445   3.927 8.60e-05 ***
## so2                 0.1381994  0.0405867   3.405 0.000662 ***
## age_exact          -0.0075018  0.0007873  -9.529  < 2e-16 ***
## is_highly_educated  0.2953905  0.0311127   9.494  < 2e-16 ***
## is_rural           -0.1277983  0.0313321  -4.079 4.53e-05 ***
## is_coal_region     -0.2624005  0.0640233  -4.099 4.16e-05 ***
## country_codeBE     -0.3290891  0.0916117  -3.592 0.000328 ***
## country_codeBG     -0.6470116  0.1125114  -5.751 8.89e-09 ***
## country_codeCY      0.8471483  0.1273306   6.653 2.87e-11 ***
## country_codeCZ     -0.5754008  0.0965974  -5.957 2.57e-09 ***
## country_codeDE      0.0106430  0.0856322   0.124 0.901088    
## country_codeDK      0.0577724  0.0925391   0.624 0.532429    
## country_codeEE     -0.8041188  0.0989047  -8.130 4.28e-16 ***
## country_codeES      1.1266903  0.0941495  11.967  < 2e-16 ***
## country_codeFI     -0.2617501  0.0946837  -2.764 0.005702 ** 
## country_codeFR      0.0130239  0.1639339   0.079 0.936678    
## country_codeGB      0.2454631  0.0891845   2.752 0.005918 ** 
## country_codeGR      0.2169278  0.1209199   1.794 0.072816 .  
## country_codeHR     -0.1632727  0.1001563  -1.630 0.103064    
## country_codeHU      0.5779928  0.1020987   5.661 1.50e-08 ***
## country_codeIT     -0.1427249  0.0940144  -1.518 0.128985    
## country_codeLU     -0.3111627  0.1140426  -2.728 0.006363 ** 
## country_codeLV     -0.6246590  0.0963526  -6.483 8.99e-11 ***
## country_codeMT      0.3303363  0.1228611   2.689 0.007173 ** 
## country_codeNL      0.1707080  0.0902189   1.892 0.058470 .  
## country_codePL     -0.2843198  0.1228657  -2.314 0.020664 *  
## country_codePT      0.1447295  0.0899079   1.610 0.107452    
## country_codeRO     -0.0479674  0.0930433  -0.516 0.606177    
## country_codeSE      0.4865939  0.0922486   5.275 1.33e-07 ***
## country_codeSK     -0.2427307  0.0964652  -2.516 0.011861 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 30568  on 22401  degrees of freedom
## Residual deviance: 29313  on 22371  degrees of freedom
##   (5253 observations deleted due to missingness)
## AIC: 29375
## 
## Number of Fisher Scoring iterations: 4

summary( glm ( transition_policy_agree ~ pm10 + so2 + age_exact +
                 is_highly_educated + is_rural, 
               data = eb19_df, 
               family = binomial ))

## 
## Call:
## glm(formula = transition_policy_agree ~ pm10 + so2 + age_exact + 
##     is_highly_educated + is_rural, family = binomial, data = eb19_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1970   0.5035   0.5803   0.6495   0.8465  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.807823   0.079297  22.798  < 2e-16 ***
## pm10                0.005092   0.001239   4.108 3.99e-05 ***
## so2                 0.003274   0.051410   0.064  0.94922    
## age_exact          -0.009781   0.000988  -9.900  < 2e-16 ***
## is_highly_educated  0.396743   0.039735   9.985  < 2e-16 ***
## is_rural           -0.107448   0.037953  -2.831  0.00464 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20488  on 22401  degrees of freedom
## Residual deviance: 20250  on 22396  degrees of freedom
##   (5253 observations deleted due to missingness)
## AIC: 20262
## 
## Number of Fisher Scoring iterations: 4

Next Steps

After careful documentation, we will very soon publish all the processed, clean datasets on the EU Zenodo repository with clear digital object identification and versioning.
We will seek contact with the Secretariat of the Initiative for Coal Regions in Transition to process all the data annexes in the EU coal regions: opportunities and challenges ahead report.
With our volunteers we want to include coal regions from the United States, Latin America, Australia, Africa first – because we have harmonized survey results – and gradually add the rest of the world.
We will ask political scientists and policy researchers to interpret our findings.

Regional Geocoding Harmonization Case Study - Regional Climate Change Awareness Datasets

Sat, 06 Mar 2021 00:00:00 +0000

library(regions)
library(lubridate)
library(dplyr)

if ( dir.exists('data-raw') ) {
  data_raw_dir <- "data-raw"
} else {
  data_raw_dir <- file.path("..", "..", "data-raw")
  }

Going beyond the national level

Let’s start with a dirty averaging by sub-national unit. The w1 weighting variable contains the post-stratification weight for the national samples. The Eurobarometer samples represent nations (with the exception of East and West Germany, Northern Ireland and Great Britain.) The average of the w1 variable is 1.00 for each sample, but it is not necessarily 1 for smaller territorial units. If sum(w)>1 for say, AT23 it only means that the AT23 region was undersampled relatively to the rest of Austria, and responses must be over-weighted in post-stratification.

There is no way to make the samples become regionally representative, and a correct post-stratification would require further data about the sampel design. But we can simply adjust to over/undersampling by making sure that oversampled territorial averages are proportionally increased and undersampled ones are decreased. [Another ‘dirty’ averaging would be the use of an unweighted average, but our method is better, because it more-or-less adjusts gender and education level biases, but leaves intra-country regional biases in the sample.]

panel <- readRDS((file.path(data_raw_dir, "climate-panel.rds")))

climate_data <-  panel %>%
  mutate ( year = lubridate::year(date_of_interview)) %>%
  select ( all_of(c("isocntry", "geo", "w1")), 
           contains("problem")
  )  %>%
  mutate ( 
    # use the post-stratification weights for national samples
    serious_world_problems_first = w1*serious_world_problems_first , 
    serious_world_problems_climate_change = w1*serious_world_problems_climate_change) %>%
  group_by (  .data$geo ) %>%
  summarise( serious_world_problems_first = mean(serious_world_problems_first, na.rm=TRUE),
             serious_world_problems_climate_change = mean (serious_world_problems_climate_change, na.rm=TRUE),
             mean_w1 = mean(w1)
             ) %>%
  mutate ( 
    # adjust for post-stratification weight bias due to regional over/undersampling
    climate_first = serious_world_problems_first / mean_w1, 
    climate_mentioned = serious_world_problems_climate_change / mean_w1
    )

So, we averaged, weighted and adjusted the mentioning of climate change as the world’s most serious, or one of the most serious problems by NUTS regions.

Aggregation level

The problem is that most statistical data is available in for the NUTS regional boundaries according to the NUTS2016 definition. However, GESIS uses NUTS2013 regions, so 252 regional codes in the four survey waves are invalid. Some data is available only on national level, but it can be projected to regional level, because small countries like Luxembourg have no regional divisions. Larger countries like Germany are divided only on state level (NUTS1), while small countries are divided on NUTS3 level.

This leads to various problems. Many data is available only on NUTS2 level, in which case NUTS1 data should be projected to its constituent smaller NUTS2 regions, and NUTS3 level data must be aggregated up to larger, containing NUTS2 levels.

Of course, we also must choose if we use `NUTS2013 or NUTS2016 boundaries. Sub-national boundaries have changed many thousand times in the EU27 countries alone since 1999.

## # A tibble: 5 x 2
##   validate         n
##   <chr>        <int>
## 1 country         15
## 2 invalid        252
## 3 nuts_level_1   132
## 4 nuts_level_2   452
## 5 nuts_level_3   141

Recoding the Regions

Our regions package was designed to keep track of sub-national regional boundary changes. It can validate regional data codes, and to some extent carry out recoding, imputation or simple aggregation.

Recoding means that the boundaries are unchanged, but the country changed the names/codes of regions, because there were other boundary changes which did not affect our observation unit.
Imputation must not be done with usual, general imputation tools, because our data is regionally structured. However, some imputations are very simple, because we can use equality equasions like MT = MT0, MT00.
Often the boundary change is additive, and merged territorial units can simple aggregated for comparison in earlier data.

regional_coding_2016 <- panel %>%
  mutate ( year = lubridate::year(date_of_interview)) %>%
  select (  all_of(c("isocntry", "geo", "region", "year") ) ) %>%
  distinct_all() %>%
  recode_nuts()

regional_coding_2013 <- panel %>%
  mutate ( year = lubridate::year(date_of_interview)) %>%
  select (  all_of(c("isocntry", "geo", "region", "year") ) ) %>%
  distinct_all() %>%
  recode_nuts( nuts_year = 2013)

climate_data_recoded <- climate_data %>% 
  left_join ( regional_coding_2016, by = 'geo' ) %>%
  left_join ( regional_coding_2013 %>% 
                select ( all_of(c("geo", "code_2013"))), 
              by = "geo") %>%
  distinct_all()

saveRDS ( climate_data_recoded , file.path(tempdir(), "climate_panel_recoded_agr.rds"), version = 2)

# not evaluated
saveRDS( climate_data_recoded , file = file.path("data-raw", "climate_panel_recoded_agr.rds"))

Where Are People More Likely To Treat Climate Change as the Most Serious Global Problem?

Sat, 06 Mar 2021 00:00:00 +0000

library(regions)
library(lubridate)
library(dplyr)

if ( dir.exists('data-raw') ) {
  data_raw_dir <- "data-raw"
} else {
  data_raw_dir <- file.path("..", "..", "data-raw")
  }

The first results of our longitudinal table were difficult to map, because the surveys used an obsolete regional coding. We will adjust the wrong coding, when possible, and join the data with the European Environment Agency’s (EEA) Air Quality e-Reporting (AQ e-Reporting) data on environmental pollution. We recoded the annual level for every available reporting stations [not shown here] and all values are in μg/m3. The period under observation is 2014-2016. Data file: https://www.eea.europa.eu/data-and-maps/data/aqereporting-8 (European Environment Agency 2021).

Recoding the Regions

Recoding means that the boundaries are unchanged, but the country changed the names and codes of regions because there were other boundary changes which did not affect our observation unit. We explain the problem and the solution in greater detail in our tutorial that aggregates the data on regional levels.

panel <- readRDS((file.path(data_raw_dir, "climate-panel.rds")))

climate_data_geocode <-  panel %>%
  mutate ( year = lubridate::year(date_of_interview)) %>%
  recode_nuts()

Let’s join the air pollution data and join it by corrected geocodes:

load(file.path("data", "air_pollutants.rda")) ## good practice to use system-independent file.path

climate_awareness_air <- climate_data_geocode %>%
  rename ( region_nuts_codes  = .data$code_2016) %>%
  left_join ( air_pollutants, by = "region_nuts_codes" ) %>%
  select ( -all_of(c("w1", "wex", "date_of_interview", 
                     "typology", "typology_change", "geo", "region"))) %>%
  mutate (
    # remove special labels and create NA_numeric_ 
    age_education = retroharmonize::as_numeric(age_education)) %>%
  mutate_if ( is.character, as.factor) %>%
  mutate ( 
    # we only have responses from 4 years, and this should be treated as a categorical variable
    year = as.factor(year) 
    ) %>%
  filter ( complete.cases(.) )

The climate_awareness_air data frame contains the answers of 75086 individual respondents. 17.07% thought that climate change was the most serious world problem and 33.6% mentioned climate change as one of the three most important global problems.

summary ( climate_awareness_air  )

##                  rowid       serious_world_problems_first
##  ZA5877_v2-0-0_1    :    1   Min.   :0.0000              
##  ZA5877_v2-0-0_10   :    1   1st Qu.:0.0000              
##  ZA5877_v2-0-0_100  :    1   Median :0.0000              
##  ZA5877_v2-0-0_1000 :    1   Mean   :0.1707              
##  ZA5877_v2-0-0_10000:    1   3rd Qu.:0.0000              
##  ZA5877_v2-0-0_10001:    1   Max.   :1.0000              
##  (Other)            :75080                               
##  serious_world_problems_climate_change    isocntry    
##  Min.   :0.000                         BE     : 3028  
##  1st Qu.:0.000                         CZ     : 3023  
##  Median :0.000                         NL     : 3019  
##  Mean   :0.336                         SK     : 3000  
##  3rd Qu.:1.000                         SE     : 2980  
##  Max.   :1.000                         DE-W   : 2978  
##                                        (Other):57058  
##                                    marital_status         age_education  
##  (Re-)Married: without children           :13242   18            :15485  
##  (Re-)Married: children this marriage     :12696   19            : 7728  
##  Single: without children                 : 7650   16            : 5840  
##  (Re-)Married: w children of this marriage: 6520   still studying: 5098  
##  (Re-)Married: living without children    : 6225   17            : 5092  
##  Single: living without children          : 4102   15            : 4528  
##  (Other)                                  :24651   (Other)       :31315  
##    age_exact                      occupation_of_respondent
##  Min.   :15.0   Retired, unable to work       :22911      
##  1st Qu.:36.0   Skilled manual worker         : 6774      
##  Median :51.0   Employed position, at desk    : 6716      
##  Mean   :50.1   Employed position, service job: 5624      
##  3rd Qu.:65.0   Middle management, etc.       : 5252      
##  Max.   :99.0   Student                       : 5098      
##                 (Other)                       :22711      
##             occupation_of_respondent_recoded
##  Employed (10-18 in d15a)   :32763          
##  Not working (1-4 in d15a)  :37125          
##  Self-employed (5-9 in d15a): 5198          
##                                             
##                                             
##                                             
##                                             
##                        respondent_occupation_scale_c_14
##  Retired (4 in d15a)                   :22911          
##  Manual workers (15 to 18 in d15a)     :15269          
##  Other white collars (13 or 14 in d15a): 9203          
##  Managers (10 to 12 in d15a)           : 8291          
##  Self-employed (5 to 9 in d15a)        : 5198          
##  Students (2 in d15a)                  : 5098          
##  (Other)                               : 9116          
##                   type_of_community   is_student      no_education     
##  DK                        :   34   Min.   :0.0000   Min.   :0.000000  
##  Large town                :20939   1st Qu.:0.0000   1st Qu.:0.000000  
##  Rural area or village     :24686   Median :0.0000   Median :0.000000  
##  Small or middle sized town: 9850   Mean   :0.0679   Mean   :0.008151  
##  Small/middle town         :19577   3rd Qu.:0.0000   3rd Qu.:0.000000  
##                                     Max.   :1.0000   Max.   :1.000000  
##                                                                        
##    education       year       region_nuts_codes  country_code  
##  Min.   :14.00   2013:25103   LU     : 1432     DE     : 4531  
##  1st Qu.:17.00   2015:    0   MT     : 1398     GB     : 3538  
##  Median :18.00   2017:25053   CY     : 1192     BE     : 3028  
##  Mean   :19.61   2019:24930   SK02   : 1053     CZ     : 3023  
##  3rd Qu.:22.00                EL30   :  974     NL     : 3019  
##  Max.   :30.00                EE     :  973     SK     : 3000  
##                               (Other):68064     (Other):54947  
##      pm2_5             pm10               o3              BaP        
##  Min.   : 2.109   Min.   :  5.883   Min.   : 66.37   Min.   :0.0102  
##  1st Qu.: 9.374   1st Qu.: 28.326   1st Qu.: 90.89   1st Qu.:0.1779  
##  Median :11.866   Median : 33.673   Median :102.81   Median :0.4105  
##  Mean   :12.954   Mean   : 38.637   Mean   :101.49   Mean   :0.8759  
##  3rd Qu.:15.890   3rd Qu.: 49.488   3rd Qu.:110.73   3rd Qu.:1.0692  
##  Max.   :41.293   Max.   :123.239   Max.   :141.04   Max.   :7.8050  
##                                                                      
##       so2              ap_pc1            ap_pc2             ap_pc3       
##  Min.   : 0.0000   Min.   :-4.6669   Min.   :-2.21851   Min.   :-2.1007  
##  1st Qu.: 0.0000   1st Qu.:-0.4624   1st Qu.:-0.49130   1st Qu.:-0.5695  
##  Median : 0.0000   Median : 0.4263   Median : 0.02902   Median :-0.1113  
##  Mean   : 0.1032   Mean   : 0.1031   Mean   : 0.04166   Mean   :-0.1746  
##  3rd Qu.: 0.0000   3rd Qu.: 0.9748   3rd Qu.: 0.57416   3rd Qu.: 0.3309  
##  Max.   :42.5325   Max.   : 2.0344   Max.   : 3.25841   Max.   : 4.1615  
##                                                                          
##      ap_pc4            ap_pc5        
##  Min.   :-1.7387   Min.   :-2.75079  
##  1st Qu.:-0.1669   1st Qu.:-0.18748  
##  Median : 0.0371   Median : 0.01811  
##  Mean   : 0.1154   Mean   : 0.06797  
##  3rd Qu.: 0.3050   3rd Qu.: 0.34937  
##  Max.   : 3.2476   Max.   : 1.42816  
##

Let’s see a simple CART tree! We remove the regional codes, because there are very serious differences among regional climate awareness. These differences, together with education level, and the year we are talking about, are the most important predictors of thinking about climate change as the most important global problem in Europe.

# Classification Tree with rpart
library(rpart)

# grow tree
fit <- rpart(as.factor(serious_world_problems_first) ~ .,
   method="class", data=climate_awareness_air %>%
     select ( - all_of(c("rowid", "region_nuts_codes"))), 
   control = rpart.control(cp = 0.005))

printcp(fit) # display the results

## 
## Classification tree:
## rpart(formula = as.factor(serious_world_problems_first) ~ ., 
##     data = climate_awareness_air %>% select(-all_of(c("rowid", 
##         "region_nuts_codes"))), method = "class", control = rpart.control(cp = 0.005))
## 
## Variables actually used in tree construction:
## [1] age_education                         isocntry                             
## [3] serious_world_problems_climate_change year                                 
## 
## Root node error: 12817/75086 = 0.1707
## 
## n= 75086 
## 
##          CP nsplit rel error  xerror      xstd
## 1 0.0240566      0   1.00000 1.00000 0.0080438
## 2 0.0082703      3   0.92783 0.92783 0.0078055
## 3 0.0050000      5   0.91129 0.91425 0.0077588

plotcp(fit) # visualize cross-validation results

summary(fit) # detailed summary of splits

## Call:
## rpart(formula = as.factor(serious_world_problems_first) ~ ., 
##     data = climate_awareness_air %>% select(-all_of(c("rowid", 
##         "region_nuts_codes"))), method = "class", control = rpart.control(cp = 0.005))
##   n= 75086 
## 
##            CP nsplit rel error    xerror        xstd
## 1 0.024056592      0 1.0000000 1.0000000 0.008043837
## 2 0.008270266      3 0.9278302 0.9278302 0.007805478
## 3 0.005000000      5 0.9112897 0.9142545 0.007758824
## 
## Variable importance
## serious_world_problems_climate_change                              isocntry 
##                                    31                                    26 
##                          country_code                                   BaP 
##                                    20                                     8 
##                                 pm2_5                                ap_pc1 
##                                     4                                     3 
##                         age_education                                  pm10 
##                                     2                                     2 
##                             education                                ap_pc2 
##                                     2                                     1 
##                                  year 
##                                     1 
## 
## Node number 1: 75086 observations,    complexity param=0.02405659
##   predicted class=0  expected loss=0.1706976  P(node) =1
##     class counts: 62269 12817
##    probabilities: 0.829 0.171 
##   left son=2 (25229 obs) right son=3 (49857 obs)
##   Primary splits:
##       serious_world_problems_climate_change < 0.5          to the right, improve=2214.2040, (0 missing)
##       isocntry                              splits as  RRLLLRRRLLRLRLLLLLLLLLLRRLLLRLL, improve= 728.0160, (0 missing)
##       country_code                          splits as  RRLLLRRLLRLLLLLLLLLLRRLLLRLL, improve= 673.3656, (0 missing)
##       BaP                                   < 0.4300347    to the right, improve= 310.6229, (0 missing)
##       pm2_5                                 < 13.38264     to the right, improve= 296.4013, (0 missing)
##   Surrogate splits:
##       age_education splits as  ----RRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRL-RRR-RRRRRRRRR--RRRLLR--R-R, agree=0.664, adj=0, (0 split)
##       pm10          < 7.491315     to the left,  agree=0.664, adj=0, (0 split)
## 
## Node number 2: 25229 observations
##   predicted class=0  expected loss=0  P(node) =0.3360014
##     class counts: 25229     0
##    probabilities: 1.000 0.000 
## 
## Node number 3: 49857 observations,    complexity param=0.02405659
##   predicted class=0  expected loss=0.2570752  P(node) =0.6639986
##     class counts: 37040 12817
##    probabilities: 0.743 0.257 
##   left son=6 (34631 obs) right son=7 (15226 obs)
##   Primary splits:
##       isocntry     splits as  RRLLLRRRLLRLRLLLLLLLLLLRRLLLRLL, improve=1454.9460, (0 missing)
##       country_code splits as  RRLLLRRLLRLLLLLLLLLLRRLLLRLL, improve=1359.7210, (0 missing)
##       BaP          < 0.4300347    to the right, improve= 629.8844, (0 missing)
##       pm2_5        < 13.38264     to the right, improve= 555.7484, (0 missing)
##       ap_pc1       < -0.005459537 to the left,  improve= 533.3579, (0 missing)
##   Surrogate splits:
##       country_code splits as  RRLLLRRLLRLLLLLLLLLLRRLLLRLL, agree=0.987, adj=0.957, (0 split)
##       BaP          < 0.1749425    to the right, agree=0.775, adj=0.264, (0 split)
##       pm2_5        < 5.206993     to the right, agree=0.737, adj=0.140, (0 split)
##       ap_pc1       < 1.405527     to the left,  agree=0.733, adj=0.126, (0 split)
##       pm10         < 25.31211     to the right, agree=0.718, adj=0.076, (0 split)
## 
## Node number 6: 34631 observations
##   predicted class=0  expected loss=0.1769802  P(node) =0.4612178
##     class counts: 28502  6129
##    probabilities: 0.823 0.177 
## 
## Node number 7: 15226 observations,    complexity param=0.02405659
##   predicted class=0  expected loss=0.4392487  P(node) =0.2027808
##     class counts:  8538  6688
##    probabilities: 0.561 0.439 
##   left son=14 (11607 obs) right son=15 (3619 obs)
##   Primary splits:
##       isocntry      splits as  LL---LLR--L-L----------LL---R--, improve=337.5462, (0 missing)
##       country_code  splits as  LL---LR--L-L--------LL---R--, improve=337.5462, (0 missing)
##       age_education splits as  ----LLLLLL-LLLRRRRRRR-RRRRRRRRRL-RRRRRRLLRR-RRRRLLRLRL-RRLRRR-RRR-LLLLRRR-----LR-----L-R, improve=294.0807, (0 missing)
##       education     < 22.5         to the left,  improve=262.3747, (0 missing)
##       BaP           < 0.053328     to the right, improve=232.7043, (0 missing)
##   Surrogate splits:
##       BaP           < 0.053328     to the right, agree=0.878, adj=0.485, (0 split)
##       pm2_5         < 4.810361     to the right, agree=0.827, adj=0.271, (0 split)
##       ap_pc2        < 0.8746175    to the left,  agree=0.792, adj=0.124, (0 split)
##       so2           < 0.3302972    to the left,  agree=0.781, adj=0.078, (0 split)
##       age_education splits as  ----LLLLLL-LLLLLLLRLR-LRRLRRRRRR-RRRRLLLLLR-LRLRLLRRLL-LLRLLR-LLR-RRLLLLL-----RR-----R-L, agree=0.779, adj=0.071, (0 split)
## 
## Node number 14: 11607 observations,    complexity param=0.008270266
##   predicted class=0  expected loss=0.3804601  P(node) =0.1545827
##     class counts:  7191  4416
##    probabilities: 0.620 0.380 
##   left son=28 (7462 obs) right son=29 (4145 obs)
##   Primary splits:
##       age_education                    splits as  ----LLLLLL-LRRRRRRRRR-RRLRRLRRLL-RRRRLRLLRR-RLRLLLRLRL-RR-RR--RRL-L-LLRRR------------L-R, improve=123.71070, (0 missing)
##       year                             splits as  R-LR, improve=107.79460, (0 missing)
##       education                        < 20.5         to the left,  improve= 90.28724, (0 missing)
##       occupation_of_respondent         splits as  LRRLRRRRRLRLLLRLLL, improve= 84.62865, (0 missing)
##       respondent_occupation_scale_c_14 splits as  LRLLLRRL, improve= 68.88653, (0 missing)
##   Surrogate splits:
##       education                        < 20.5         to the left,  agree=0.950, adj=0.861, (0 split)
##       occupation_of_respondent         splits as  LLLLRLLRRLRLLLRLLL, agree=0.738, adj=0.267, (0 split)
##       respondent_occupation_scale_c_14 splits as  LRLLLLRL, agree=0.733, adj=0.251, (0 split)
##       is_student                       < 0.5          to the left,  agree=0.709, adj=0.186, (0 split)
##       age_exact                        < 23.5         to the right, agree=0.676, adj=0.094, (0 split)
## 
## Node number 15: 3619 observations
##   predicted class=1  expected loss=0.3722023  P(node) =0.04819807
##     class counts:  1347  2272
##    probabilities: 0.372 0.628 
## 
## Node number 28: 7462 observations
##   predicted class=0  expected loss=0.326052  P(node) =0.09937938
##     class counts:  5029  2433
##    probabilities: 0.674 0.326 
## 
## Node number 29: 4145 observations,    complexity param=0.008270266
##   predicted class=0  expected loss=0.4784077  P(node) =0.05520337
##     class counts:  2162  1983
##    probabilities: 0.522 0.478 
##   left son=58 (2573 obs) right son=59 (1572 obs)
##   Primary splits:
##       year                     splits as  L-LR, improve=40.13885, (0 missing)
##       occupation_of_respondent splits as  LRLLRRRRRLRLLLRLLL, improve=18.33254, (0 missing)
##       marital_status           splits as  LRRRLRRRLRRLRLLRRRRRRLRLRLLRR, improve=17.86888, (0 missing)
##       type_of_community        splits as  LRLRL, improve=17.55254, (0 missing)
##       age_education            splits as  ------------LLRRRRRRR-RR-RL-RR---LRRR-R--LR-R-R---R-R--RR-RR--RR------RRR--------------R, improve=14.66121, (0 missing)
##   Surrogate splits:
##       type_of_community splits as  LLLRL, agree=0.777, adj=0.412, (0 split)
##       marital_status    splits as  RRLLLLLRLLLLLLLRRRLLLLLLRLRLL, agree=0.680, adj=0.155, (0 split)
##       isocntry          splits as  LL---LL---L-R----------LL------, agree=0.669, adj=0.127, (0 split)
##       country_code      splits as  LL---L---L-R--------LL------, agree=0.669, adj=0.127, (0 split)
##       o3                < 83.06345     to the right, agree=0.650, adj=0.076, (0 split)
## 
## Node number 58: 2573 observations
##   predicted class=0  expected loss=0.4240187  P(node) =0.03426737
##     class counts:  1482  1091
##    probabilities: 0.576 0.424 
## 
## Node number 59: 1572 observations
##   predicted class=1  expected loss=0.43257  P(node) =0.02093599
##     class counts:   680   892
##    probabilities: 0.433 0.567

# plot tree
plot(fit, uniform=TRUE,
   main="Classification Tree: Climate Change Is The Most Serious Threat")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

## Warning in labels.rpart(x, minlength = minlength): more than 52 levels in a
## predicting factor, truncated for printout

saveRDS ( climate_awareness_air , file.path(tempdir(), "climate_panel_recoded.rds"), version = 2)

# not evaluated
saveRDS( climate_awareness_air, file = file.path("data-raw", "climate-panel_recoded.rds"))

Retrospective Survey Harmonization Case Study - Climate Awareness Change in Europe 2013-2019.

Fri, 05 Mar 2021 00:00:00 +0000

Retrospective survey harmonization comes with many challenges, as we have shown in the introduction to this tutorial case study. In this example, we will work with Eurobarometer’s data.

Please use the development version of retroharmonize:

devtools::install_github("antaldaniel/retroharmonize")

library(retroharmonize)
library(dplyr)       # this is necessary for the example 
library(lubridate)   # easier date conversion

## Warning: package 'lubridate' was built under R version 4.0.4

library(stringr)     # You can also use base R string processing functions

Get the Data

retroharmonize is not associated with Eurobarometer, or its creators, Kantar, or its archivists, GESIS. We assume that you have acquired the necessary files from GESIS after carefully reading their terms and you placed it on a path that you call gesis_dir. The precise documentation of the data we use can be found in this supporting blogpost. To reproduce this blogpost, you will need ZA5877_v2-0-0.sav, ZA6595_v3-0-0.sav, ZA6861_v1-2-0.sav, ZA7488_v1-0-0.sav, ZA7572_v1-0-0.sav in a directory that you will name gesis_dir.

#Not run in the blogpost. In the repo we have a saved version.
climate_change_files <- c("ZA5877_v2-0-0.sav", "ZA6595_v3-0-0.sav",  "ZA6861_v1-2-0.sav", 
                          "ZA7488_v1-0-0.sav", "ZA7572_v1-0-0.sav")

eb_waves <- read_surveys(file.path(gesis_dir, climate_change_files), .f='read_spss')

if (dir.exists("data-raw")) {
  save ( eb_waves,  file = file.path("data-raw", "eb_climate_change_waves.rda") )
}

if ( file.exists( file.path("data-raw", "eb_climate_change_waves.rda") )) {
  load (file.path( "data-raw", "eb_climate_change_waves.rda" ) )
} else {
  load (file.path("..", "..",  "data-raw", "eb_climate_change_waves.rda") )
}

The eb_waves nested list contains five surveys imported from SPSS to the survey class of retroharmonize. The survey class is a data.frame that retains important metadata for further harmonization.

document_waves (eb_waves)

## # A tibble: 5 x 5
##   id            filename           ncol  nrow object_size
##   <chr>         <chr>             <int> <int>       <dbl>
## 1 ZA5877_v2-0-0 ZA5877_v2-0-0.sav   604 27919   139352456
## 2 ZA6595_v3-0-0 ZA6595_v3-0-0.sav   519 27718   119370440
## 3 ZA6861_v1-2-0 ZA6861_v1-2-0.sav   657 27901   151397528
## 4 ZA7488_v1-0-0 ZA7488_v1-0-0.sav   752 27339   169465928
## 5 ZA7572_v1-0-0 ZA7572_v1-0-0.sav   348 27655    80562432

Beware the object sizes. If you work with many surveys, memory-efficient programming becomes imperative. We will be subsetting whenever possible.

Metadata analysis

As noted before, prepare to work with nested lists. Each imported survey is nested as a data frame in the eb_waves list.

Metadata: Protocol Variables

Eurobarometer calls certain metadata elements, like interviewee cooperation level or the date of a survey interview as protocol variable. Let’s start here. This will be our template to harmonize more and more aspects of the five surveys (which are, in fact, already harmonization of about 30 surveys conducted in a single ‘wave’ in multiple countries.)

# select variables of interest from the metadata
eb_protocol_metadata <- eb_climate_metadata %>%
  filter ( .data$label_orig %in% c("date of interview") |
             .data$var_name_orig == "rowid")  %>%
  suggest_var_names( survey_program = "eurobarometer" )

# subset and harmonize these variables in all nested list items of 'waves' of surveys
interview_dates <- harmonize_var_names(eb_waves, 
                                       eb_protocol_metadata )

# apply similar data processing rules to same variables
interview_dates <- lapply (interview_dates, 
                      function (x) x %>% mutate ( date_of_interview = as_character(.data$date_of_interview) )
                      )

# join the individual survey tables into a single table 
interview_dates <- as_tibble ( Reduce (rbind, interview_dates) )

# Check the variable classes.

vapply(interview_dates, function(x) class(x)[1], character(1))

##             rowid date_of_interview 
##       "character"       "character"

This is our sample workflow for each block of variables.

Get a unique identifier.
Add other variables
Harmonize the variable names
Subset the data leaving out anything that you do not harmonize in this block.
Apply some normalization in a nested list.
When the variables are harmonized to same name, class, merge them into a data.frame-like tibble object.

Now finish the harmonization. Wednesday, 31st October 2018 should become a Date type 2018-10-31.

require(lubridate)
harmonize_date <- function(x) {
  x <- tolower(as.character(x))
  x <- gsub("monday|tuesday|wednesday|thursday|friday|saturday|sunday|\\,|th|nd|rd|st", "", x)
  x <- gsub("decemberber", "december", x) # all those annoying real-life data problems!
  x <- stringr::str_trim (x, "both")
  x <- gsub("^0", "", x )
  x <- gsub("\\s\\s", "\\s", x)
  lubridate::dmy(x) 
}

interview_dates <- interview_dates %>%
  mutate ( date_of_interview = harmonize_date(.data$date_of_interview) )

vapply(interview_dates, function(x) class(x)[1], character(1))

##             rowid date_of_interview 
##       "character"            "Date"

To avoid duplication of row IDs in surveys that may not be unique in different surveys, we created a simple, sequential ID for each survey, including the ID of the original file.

set.seed(2021)
sample_n(interview_dates, 6)

## # A tibble: 6 x 2
##   rowid               date_of_interview
##   <chr>               <date>           
## 1 ZA7488_v1-0-0_7016  2018-10-28       
## 2 ZA7488_v1-0-0_19187 2018-11-02       
## 3 ZA6861_v1-2-0_1218  2017-03-18       
## 4 ZA6861_v1-2-0_4142  2017-03-21       
## 5 ZA7572_v1-0-0_12363 2019-04-17       
## 6 ZA7572_v1-0-0_8071  2019-04-18

After this type-conversion problem let’s see an issue when an original SPSS variable can have two meaningful R representations.

Metadata: Geographical information

Let’s continue with harmonizing geographical information in the files. In this example, var_name_suggested will contain the harmonized variable name. It is likely that you have to make this call, after carefully reading the original questionnaires and codebooks.

eb_regional_metadata <- eb_climate_metadata %>%
  filter ( grepl( "rowid|isocntry|^nuts$", .data$var_name_orig)) %>%
  suggest_var_names( survey_program = "eurobarometer" ) %>%
  mutate ( var_name_suggested = case_when ( 
    var_name_suggested == "region_nuts_codes"     ~ "geo",
    TRUE ~ var_name_suggested ))

The harmonize_var_names() takes all variables in the subsetted, geographical metadata table, and brings them to the harmonized var_name_suggested name. The function subsets the surveys to avoid the presence of non-harmonized variables. All regional NUTS codes become geo in our case:

geography <- harmonize_var_names(eb_waves, 
                                 eb_regional_metadata)

If you are used to work with single survey files, you are likely to work in a tabular format, which easily converts into a data.frame like object, in our example, to tidyverse’s tibble. However, when working with longitudinal data, it is far simpler to work with nested lists, because the tables usually have different dimensions (neither the rows corresponding to observations or the columns are the same across all survey files.)

In the nested list, each list element is a single, tabular-format survey. (In fact, the survey are in retroharmonize’s survey class, which is a rich tibble that contains the metadata and the processing history of the survey.)

The regional information in the Eurobarometer files is contained in the nuts variable. We want to keep both the original labels and values. The original values are the region’s codes, and the labels are the names. The easiest and fastest solution is the base R lapply loop.

geography <- lapply ( geography, 
                      function (x) x %>% mutate ( region = as_character(geo), 
                                                  geo    = as.character(geo) )  
)

Because each table has exactly the same columns, we can simply use rbind() and reduce the list to a modern data.frame, i.e. a tibble.

geography <- as_tibble ( Reduce (rbind, geography) )

Let’s see a dozen cases:

set.seed(2021)
sample_n(geography, 12)

## # A tibble: 12 x 4
##    rowid               isocntry geo   region              
##    <chr>               <chr>    <chr> <chr>               
##  1 ZA7488_v1-0-0_7016  SI       SI012 Podravska           
##  2 ZA7488_v1-0-0_19187 PL       PL63  Pomorskie           
##  3 ZA6861_v1-2-0_1218  DK       DK02  Sjaelland           
##  4 ZA6861_v1-2-0_4142  FI       FI1B  Helsinki-Uusimaa    
##  5 ZA7572_v1-0-0_12363 SE       SE12  Oestra Mellansverige
##  6 ZA7572_v1-0-0_8071  IT       ITH   Nord-Est [IT]       
##  7 ZA6861_v1-2-0_6145  IE       IE021 Dublin              
##  8 ZA6861_v1-2-0_24638 RO       RO31  South [RO]          
##  9 ZA7488_v1-0-0_11315 CY       CY    REPUBLIC OF CYPRUS  
## 10 ZA6595_v3-0-0_27568 HR       HR041 Grad Zagreb         
## 11 ZA7572_v1-0-0_17397 CZ       CZ06  Jihovychod          
## 12 ZA6861_v1-2-0_10993 PT       PT17  Lisboa

The idea is that we do similar variable harmonization block by block, and eventually we will join them together. Next step: socio-demography and weights.

Socio-demography and Weights

There are a few peculiar issues to look out for. This example shows that survey harmonization requires plenty of expert judgment, and you cannot fully automate the process.

The Eurobarometer archives do not use all weight and demographic variable names consistently. For example, the wex variable, which is a projected weight for the country’s 15 years old or older population is sometimes called wex, sometimes wextra. The individual survey’s post-stratification weight is the w1 variable, but this is not necessarily what you need to use.

The suggest_var_names() function has a parameter for survey_program = "eurobaromater" which normalizes a bit the most used variables. For example, all variations of wex, wextra wil be noramlized to wex. You can ignore this parameter and use your own names, too.

eb_demography_metadata  <- eb_climate_metadata %>%
  filter ( grepl( "rowid|isocntry|^d8$|^d7$|^wex|^w1$|d25|^d15a|^d11$", .data$var_name_orig) ) %>%
  suggest_var_names( survey_program = "eurobarometer")

As you can see, using the original labels would not help, because they also contain various alterations.

eb_demography_metadata %>%
  select ( filename, var_name_orig, label_orig, var_name_suggested ) %>%
  filter (var_name_orig %in% c("wex", "wextra") )

##            filename var_name_orig                                  label_orig
## 1 ZA5877_v2-0-0.sav        wextra      weight extrapolated population 15 plus
## 2 ZA6595_v3-0-0.sav        wextra      weight extrapolated population 15 plus
## 3 ZA6861_v1-2-0.sav           wex weight extrapolated population aged 15 plus
## 4 ZA7488_v1-0-0.sav           wex weight extrapolated population aged 15 plus
## 5 ZA7572_v1-0-0.sav           wex weight extrapolated population aged 15 plus
##   var_name_suggested
## 1                wex
## 2                wex
## 3                wex
## 4                wex
## 5                wex

demography <- harmonize_var_names ( waves = eb_waves, 
                                    metadata = eb_demography_metadata )

Socio-demographic variables like level of highest education or occupation are rather country-specific. Eurobarometer uses standardized occupation and marital status scales, and a proxy for education levels, age of leaving full-time education.

This is a particularly tricky variable, because it’s coding in fact contains three different variables - school leaving age, except for students, and except for people who did not finish their compulsory primary school. And while school leaving age was a good proxy since the 1970s, in the age when the EU is promoting life-long-learning becomes less and less useful, as people stop and re-start their education throughout their lives.

example <- demography[[1]] %>%
  mutate ( across ( -any_of(c("rowid", "w1", "wex")), as_character) ) %>%
  mutate ( across (any_of(c("w1", "wex")), as_numeric) )
unique ( example$age_education )

##  [1] "22"                     "25"                     "17"                    
##  [4] "19"                     "12"                     "23"                    
##  [7] "18"                     "20"                     "21"                    
## [10] "14"                     "24"                     "16"                    
## [13] "26"                     "15"                     "Still studying"        
## [16] "DK"                     "31"                     "29"                    
## [19] "27"                     "13"                     "32"                    
## [22] "28"                     "30"                     "53"                    
## [25] "42"                     "62"                     "40"                    
## [28] "No full-time education" "Refusal"                "37"                    
## [31] "39"                     "34"                     "35"                    
## [34] "47"                     "36"                     "45"                    
## [37] "51"                     "33"                     "43"                    
## [40] "38"                     "49"                     "46"                    
## [43] "41"                     "57"                     "7"                     
## [46] "48"                     "44"                     "50"                    
## [49] "56"                     "8"                      "11"                    
## [52] "10"                     "9"                      "75 years"              
## [55] "6"                      "3"                      "54"                    
## [58] "55"                     "60"                     "64"                    
## [61] "2 years"                "58"                     "52"                    
## [64] "72"                     "61"                     "4"                     
## [67] "63"

The seamingly trival age_exact variable has its own issues, too:

unique ( example$age_exact)

##  [1] "54"       "66"       "56"       "53"       "33"       "72"      
##  [7] "83"       "62"       "86"       "77"       "64"       "46"      
## [13] "44"       "59"       "60"       "67"       "63"       "20"      
## [19] "43"       "37"       "78"       "49"       "90"       "45"      
## [25] "28"       "29"       "30"       "39"       "51"       "38"      
## [31] "41"       "71"       "25"       "48"       "79"       "88"      
## [37] "61"       "85"       "70"       "35"       "81"       "52"      
## [43] "57"       "27"       "47"       "15 years" "21"       "42"      
## [49] "32"       "68"       "36"       "34"       "19"       "31"      
## [55] "26"       "23"       "24"       "22"       "16"       "84"      
## [61] "65"       "18"       "55"       "40"       "50"       "73"      
## [67] "69"       "87"       "89"       "74"       "75"       "98 years"
## [73] "76"       "80"       "58"       "82"       "17"       "93"      
## [79] "91"       "92"       "95"       "94"       "97"

Let’s see all the strange labels attached to age-type variables:

collect_val_labels(metadata = eb_demography_metadata %>%
                     filter ( var_name_suggested %in% c("age_exact", "age_education")) )

##  [1] "2 years"                  "75 years"                
##  [3] "No full-time education"   "Still studying"          
##  [5] "15 years"                 "98 years"                
##  [7] "96 years"                 "[NOT CLEARLY DOCUMENTED]"
##  [9] "74 years"                 "99 and older"            
## [11] "Refusal"                  "87 years"                
## [13] "DK"                       "88 years"

We must handle many exception, so we created a function for this purpose:

remove_years  <- function(x) { 
  x <- gsub("years|and\\solder", "", tolower(x))
  stringr::str_trim (x, "both")}

process_demography <- function (x) { 
  
  x %>% mutate ( across ( -any_of(c("rowid", "w1", "wex")), as_character) ) %>%
    mutate ( across (any_of(c("w1", "wex")), as_numeric) ) %>%
    mutate ( across (contains("age"), remove_years)) %>%
    mutate ( age_exact = as.numeric (age_exact)) %>%
    mutate ( is_student = ifelse ( tolower(age_education) == "still studying", 
                                   1, 0), 
             no_education = ifelse ( tolower(age_education) == "no full-time education", 1, 0)) %>%
    mutate ( education = case_when (
      grepl("studying", age_education) ~ age_exact, 
      grepl ("education", age_education)  ~ 14, 
      grepl ("refus|document|dk", tolower(age_education)) ~ NA_real_,
      TRUE ~ as.numeric(age_education)
    ))  %>%
    mutate ( education = case_when ( 
      education < 14 ~ NA_real_, 
      education > 30 ~ 30, 
      TRUE ~ education )) 
}

demography <- lapply ( demography, process_demography )

## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by coercion

## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by coercion

## WE'll full join and not use rbind, because we have different variables in different waves.
demography <- Reduce ( full_join, demography )

## Joining, by = c("rowid", "isocntry", "w1", "wex", "marital_status", "age_education", "age_exact", "occupation_of_respondent", "occupation_of_respondent_recoded", "respondent_occupation_scale_c_14", "type_of_community", "is_student", "no_education", "education")
## Joining, by = c("rowid", "isocntry", "w1", "wex", "marital_status", "age_education", "age_exact", "occupation_of_respondent", "occupation_of_respondent_recoded", "respondent_occupation_scale_c_14", "type_of_community", "is_student", "no_education", "education")
## Joining, by = c("rowid", "isocntry", "w1", "wex", "marital_status", "age_education", "age_exact", "occupation_of_respondent", "occupation_of_respondent_recoded", "respondent_occupation_scale_c_14", "type_of_community", "is_student", "no_education", "education")
## Joining, by = c("rowid", "isocntry", "w1", "wex", "marital_status", "age_education", "age_exact", "occupation_of_respondent", "occupation_of_respondent_recoded", "respondent_occupation_scale_c_14", "type_of_community", "is_student", "no_education", "education")

Now let’s see what we have here:

set.seed(2021)
sample_n(demography, 12)

## # A tibble: 12 x 14
##    rowid    isocntry    w1    wex marital_status        age_education  age_exact
##    <chr>    <chr>    <dbl>  <dbl> <chr>                 <chr>              <dbl>
##  1 ZA7488_~ SI       0.828  1428. (Re-)Married: withou~ 19                    43
##  2 ZA7488_~ PL       1.01  32830. (Re-)Married: withou~ 19                    64
##  3 ZA6861_~ DK       0.641  3100. (Re-)Married: withou~ 22                    78
##  4 ZA6861_~ FI       1.83   8601. (Re-)Married: childr~ 30                    38
##  5 ZA7572_~ SE       0.342  2645. (Re-)Married: withou~ 17                    68
##  6 ZA7572_~ IT       0.630 32287. (Re-)Married: childr~ 20                    40
##  7 ZA6861_~ IE       0.868  3054. (Re-)Married: childr~ 32                    42
##  8 ZA6861_~ RO       0.724 11805. (Re-)Married: withou~ 14                    59
##  9 ZA7488_~ CY       0.691  1013. (Re-)Married: childr~ 18                    67
## 10 ZA6595_~ HR       0.580  2098. Single living w part~ 27                    30
## 11 ZA7572_~ CZ       1.86  16908. Single: without chil~ still studying        20
## 12 ZA6861_~ PT       0.932  7448. Widow: with children  no full-time ~        84
## # ... with 7 more variables: occupation_of_respondent <chr>,
## #   occupation_of_respondent_recoded <chr>,
## #   respondent_occupation_scale_c_14 <chr>, type_of_community <chr>,
## #   is_student <dbl>, no_education <dbl>, education <dbl>

Harmonizing Variable Labels

So far we have been working with metadata, weights and socio-demography. In other words, we have not even started the desired harmonization of climate change awareness. The methodology is the same, but here we really must look out for the answer options in the questionnaire. (Refer to our data summary again here.)

climate_awareness_metadata <- eb_climate_metadata %>%
  suggest_var_names( survey_program = "eurobarometer" ) %>%
  filter ( .data$var_name_suggested  %in% c("rowid",
                                            "serious_world_problems_first", 
                                             "serious_world_problems_climate_change")
  ) 

hw <- harmonize_var_names ( waves = eb_waves, 
                            metadata = climate_awareness_metadata )

The retroharmoinze package comes with a generic harmonize_values() function that will change the value labels of categorical variables (including binary ones) to a unitary format. It will also take care of various types of missing values.

First, let’s go back to our metadata and collect all value labels that will show up with collect_val_labels():

collect_val_labels(climate_awareness_metadata)

##  [1] "Climate change"                            
##  [2] "International terrorism"                   
##  [3] "Poverty, hunger and lack of drinking water"
##  [4] "Spread of infectious diseases"             
##  [5] "The economic situation"                    
##  [6] "Proliferation of nuclear weapons"          
##  [7] "Armed conflicts"                           
##  [8] "The increasing global population"          
##  [9] "Other (SPONTANEOUS)"                       
## [10] "None (SPONTANEOUS)"                        
## [11] "Not mentioned"                             
## [12] "Mentioned"                                 
## [13] "DK"

In this case, we want to select Climate change as the mentioned most serious problem, and Climate change taken from a list of three serious problems. The first question type is a single-choice one, where Climate change is either mentioned, or the alternative answer is labeled as Not mentioned. In the multiple choice case, the alternative may be something else, for example, Spread of infectious diseases, as we all well know by 2021.

We want to see who thought Climate change was the most serious problem, or one of the most serious problems, so we label each mentions of Climate change as mentioned and we pair it with a numeric value of 1. All other cases are labeled as not_mentioned, with the exceptions of various missing observations, which in these cases are Do not know answers, Declined to answer cases, and Inappropriate cases [The latter one is Eurobarometer’s label for questions that were for one reason or other not asked from a particular interviewee – for example, because the Turkish Cypriot community received a different questionnaire.]

# positive cases
label_1 = c("^Climate\\schange", "^Mentioned")
# missing cases 
na_labels <- collect_na_labels( climate_awareness_metadata)
na_labels

## [1] "DK"                             "Inap. (10 or 11 in qa1a)"      
## [3] "Inap. (coded 10 or 11 in qc1a)" "Inap. (coded 10 or 11 in qb1a)"

# negative cases
label_0 <- collect_val_labels( climate_awareness_metadata)
label_0 <- label_0[! label_0 %in% label_1 ]

The harmonize_serious_problems() function harmonizes the labels within the special labeled class of retroharmonize. This class retains all information to give categorical variables a character or numeric representation, and various processing metadata for documentation purposes. While this class is very reach (it contains whatever was imported from SPSS’s proprietary data format and the history), it is not suitable for statistical analysis. We could, of course, directly call the harmonize_values() from the retroharmonize package, but the parameterization would be very complicated even in a simple function call, not to mention a looped call. Because this function is the heart of the retroharmonize package, it has a tutorial article on its own.

harmonize_serious_problems <- function(x) {
  label_list <- list(
    from = c(label_0, label_1, na_labels), 
    to = c( rep ( "not_mentioned", length(label_0) ),   # use the same order as in from!
            rep ( "mentioned", length(label_1) ),
            "do_not_know", "inap", "inap", "inap"), 
    numeric_values = c(rep ( 0, length(label_0) ), # use the same order as in from!
                       rep ( 1, length(label_1) ),
                       99997,99999,99999,99999)
  )
  
  harmonize_values(x, 
                   harmonize_labels = label_list, 
                   na_values = c("do_not_know"=99997,
                                 "declined"=99998,
                                 "inap"=99999), 
                   remove = "\\(|\\)|\\[|\\]|\\%"
  )
}

Our objects are rather big in memory, so first, let’s remove the surveys that do not contain these world problem variables. In this cases, the subsetted and harmonized surveys in the nested list have only one columns, i.e. the rowid.

hw <- hw[unlist ( lapply ( hw, ncol)) > 1 ]

Now we have a smaller problem to deal with. With many surveys, it is easy to fill up your computer’s memory, so let’s start building up our joined panel data from a smaller set of nested, subsetted surveys.

hw <- lapply ( hw, function (x) x %>% mutate ( across ( contains("problem"), harmonize_serious_problems) ) )

Our lapply loop calls an anonymous function which in turn calls the harmonize_serious_problems parameterized version of the harmonize_values() on all variables that have problem in their names.

once we are done, our variables have harmonized names, and harmonized values, and harmonized label, but they are stored in the complex retroharmonize_labelled_spss_survey class, inherited from the haven_labelled_spss in haven.

We reduced our single and multiple choice questions to binary choice variables. We can now give them a numeric representation. Be mindful that retroharmonize has special methods for its special labeled class that retains metadata from SPSS. This means that as_character and as_numeric knows how to handle various types of missing values, whereas the base R as.character and as.numeric may coerce special values to unwanted results. This is particularly dangerous with numeric variables – and this is the reason why we introduced a new set of S3 objects and methods in the package.

We will ignore the differences between various forms of missingness, i.e. the person said that she did not know, or did not want to answer, or for some reason was not asked in the survey. In a more descriptive, non-harmonized analysis you would probably want to explore them as various ‘categories’ and use a character representation.

hw <- lapply ( hw, function(x) x %>% mutate ( across ( contains("problem"), as_numeric) ))

hw <- Reduce ( full_join, hw) # we must use joins instead of binds because the number of columns vary.

Let’s see what we have:

set.seed(2021)
sample_n (hw, 12)

## # A tibble: 12 x 3
##    rowid             serious_world_problems_fi~ serious_world_problems_climate_~
##    <chr>                                  <dbl>                            <dbl>
##  1 ZA6595_v3-0-0_23~                          0                               NA
##  2 ZA7572_v1-0-0_70~                          0                                0
##  3 ZA6595_v3-0-0_18~                          0                               NA
##  4 ZA6861_v1-2-0_27~                          0                                0
##  5 ZA6595_v3-0-0_26~                          0                               NA
##  6 ZA7572_v1-0-0_19~                          0                                1
##  7 ZA5877_v2-0-0_16~                          0                                0
##  8 ZA6861_v1-2-0_12~                          0                                0
##  9 ZA7572_v1-0-0_17~                          0                                0
## 10 ZA5877_v2-0-0_17~                          0                                1
## 11 ZA6861_v1-2-0_41~                          0                                0
## 12 ZA6861_v1-2-0_61~                          0                                1

Creating the Longitudional Table

Now we just need to join the partial table by the rowid together:

#start from the smallest (we removed the survey that had no relevant questionnaire item)
panel <- hw %>%
  left_join ( geography, by = 'rowid' ) 

panel <- panel %>%
  left_join ( demography, by = c("rowid", "isocntry") ) 

panel <- panel %>%
  left_join ( interview_dates, by = 'rowid' )

And let’s see a small sample:

sample_n(panel, 12)

## # A tibble: 12 x 19
##    rowid  serious_world_pr~ serious_world_pr~ isocntry geo   region    w1    wex
##    <chr>              <dbl>             <dbl> <chr>    <chr> <chr>  <dbl>  <dbl>
##  1 ZA686~                 0                 0 ES       ES41  Casti~ 1.21  46787.
##  2 ZA686~                 0                 0 RO       RO31  South~ 0.724 11805.
##  3 ZA686~                 0                 0 SK       SK02  Zapad~ 0.774  3499.
##  4 ZA757~                 0                 1 PT       PT16  Centr~ 1.11   9336.
##  5 ZA659~                 1                NA HR       HR041 Grad ~ 0.580  2098.
##  6 ZA659~                 1                NA RO       RO21  North~ 1.21  20160.
##  7 ZA686~                 0                 0 PT       PT17  Lisboa 0.932  7448.
##  8 ZA659~                 0                NA GB-GBN   UKI   London 0.994 50133.
##  9 ZA757~                 0                 0 CY       CY    REPUB~ 0.594   874.
## 10 ZA686~                 0                 0 LT       LT003 Klaip~ 0.623  1564.
## 11 ZA757~                 0                 0 IE       IE013 West ~ 0.490  1651.
## 12 ZA659~                 0                NA LT       LT003 Klaip~ 1.16   2917.
## # ... with 11 more variables: marital_status <chr>, age_education <chr>,
## #   age_exact <dbl>, occupation_of_respondent <chr>,
## #   occupation_of_respondent_recoded <chr>,
## #   respondent_occupation_scale_c_14 <chr>, type_of_community <chr>,
## #   is_student <dbl>, no_education <dbl>, education <dbl>,
## #   date_of_interview <date>

saveRDS ( panel, file.path(tempdir(), "climate_panel.rds"), version = 2)

# not evaluated
saveRDS( panel, file = file.path("data-raw", "climate-panel.rds"), version=2)

Putting It on a Map

This is not the end of the story. If you put all this on a map, the results are a bit disappointing.

Why? Because sub-national (provincial, state, county, district, parish) borders are changing all the time - within the EU and everywhere. The next step is to harmonize the geographical information. We have another CRAN released package to help you with. See the next post: Regional Climate Change Awareness Dataset.

What is Retrospective Survey Harmonization?

Thu, 04 Mar 2021 00:00:00 +0000

Reproducible ex post harmonization of survey microdata

Retrospective survey harmonization allows the comparison of opinion poll data conducted in different countries or time. In this example we are working with data from surveys that were ex ante harmonized to a certain degree – in our tutorials we are choosing questions that were asked in the same way in many natural languages. For example, you can compare what percentage of the European people in various countries, provinces and regions thought climate change was a serious world problem back in 2013, 2015, 2017 and 2019.

We developed the retroharmonize R package to help this process. We have tested the package with about 80 Eurobarometer, 5 Afrobarometer survey files extensively, and a bit with Arabbarometer files. This allows the comparison of various survey answers in about 70 countries. This policy-oriented survey programs were designed to be harmonized to a certain degree, but their ex post harmonization is still necessary, challenging and errorprone. Retrospective harmonization includes harmonization of the different coding used for questions and answer options, post-stratification weights, and using different file formats.

Eurobarometer, Afrobaromer, Arab Barometer and Latinobarómetro make survey files that are harmonized across countries available for research with various terms. Our retroharmonize is not affiliated with them, and to run our examples, you must visit their websites, carefully read their terms, agree to them, and download their data yourself. What we add as a value is that we help to connect their files across time (from different years) or across these programs.

The survey programs mentioned above publish their data in the proprietary SPSS format. This file format can be imported and translated to R objects with the haven package; however, we needed to re-design haven’s labelled_spss class to maintain far more metadata, which, in turn, a modification of the labelled class. The haven package was designed and tested with data stored in individual SPSS files.

The author of labelled, Joseph Larmarange describes two main approaches to work with labelled data, such as SPSS’s method to store categorical data in the Introduction to labelled.

Two main approaches of labelled data conversion.

Our approach is a further extension of Approach B. Survey harmonization in our case always means the joining data from several SPSS files, which requires a consistent coding among several data sources. This means that data cleaning and recoding must take place before conversion to factors, character or numeric vectors. This is particularly important with factor data (and their simple character conversions) and numeric data that occasionally contains labels, for example, to describe the reason why certain data is missing. Our tutorial vignette labelled_spss_survey gives you more information about this.

In the next series of tutorials, we will deal with an array of problems. These are not for the faint heart – you need to have a solid intermediate level of R to follow.

Tidy, joined survey data

The original files identifiers may not be unique, we have to create new, truly unique identifiers. Weighting may not be straightforward.
Neither the number of observations or the number of variables (which represents the survey questions and their translation to coded data) is the same. Certain data may be only present in one survey and not the other. This means that you will likely to run loops on lists and not data.frames, but eventually you must carefully join them.

Class conversion

Similar questions may be imported from a non-native R format, in our case, from an SPSS files, in an inconsistent manner. SPSS’s variable formats cannot be translated unambiguously to R classes. retroharmonize introduced a new S3 class system that handles this problem, but eventually you will have to choose if you want to see a numeric or character coding of each categorical variable.
The harmonized surveys, with harmonized variable names and harmonized value labels, must be brought to consistent R representations (most statistical functions will only work on numeric, factor or character data) and carefully joined into a single data table for analysis.

Harmonization of variables and variable labels

Same variables may come with dissimilar variable names and variable labels. It may be a challenge to match age with age. We need to harmonize the names of variables.
The harmonized variables may have different labeling. One may call refused answers as declined and the other refusal. On a simple choice, climate change may be ‘Climate change’ or Problem: Climate change. Binary choices may have survey-specific coding conventions. Value labels must be harmonized. There are good tools to do this in a single file - but we have to work with several of them.

Missing value harmonization

There are likely to be various types of missing values. Working with missing values is probably where most human judgment is needed. Why are some answers missing: was the question not asked in some questionnaires? Is there a coding error? Did the respondent refuse the question, or sad that she did not have an answer? retroharmonize has a special labeled vector type that retains this information from the raw data, if it is present, but you must make the judgment yourself – in R, eventually you will either create a missing category, or use NA_character_ or NA_real_.

That’s a lot to put on your plate.

It is unlikely that you will be able to work with completely unfamiliar survey programs if you do not have a strong intermediate level of R. Our package comes with tutorials for Eurobarometer, Afrobarometer and our development version already covers Arab Barometer, highlighting some peculiar issues with these survey programs, that we hope to give a head start for less experienced R users.

Eurobarometer Surveys Used In Our Project

Wed, 03 Mar 2021 00:00:00 +0000

In our tutorial series, we are going to harmonize the following questionnaire items from five Eurobarometer harmonized survey files. The Eurobarometer survey files are harmonized across countries, but they are only partially harmonized in time.

All data must be downloaded from the GESIS Data Archive in Cologne. We are not affiliated with GESIS and you must read and accept their terms to use the data.

Eurobarometer 80.2 (2013)

GESIS Data Archive, Cologne. ZA5877 Data file Version 2.0.0, https://doi.org/10.4232/1.12792

Data file: ZA6595 data file (European Commission 2017).
Questionnaire: Eurobarometer 83.4 Basic Bilingual Questionnaire
Citation: ZA6595 Bibtex

QA1a Which of the following do you consider to be the single most serious problem facing the world as a whole? (single choice)

QA1b Which others do you consider to be serious problems? (multiple choice)

QA2 And how serious a problem do you think climate change is at this moment? Please use a scale from 1 to 10, with '1' meaning it is "not at all a serious problem (scale 1-10)

QA4 To what extent do you agree or disagree with each of the following statements? - Fighting climate change and using energy more efficiently can boost the economy and jobs in the EU (agreement-disagreement 4-scale)

QA4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU could benefit the EU economically (agreement-disagreement 4-scale)

QA5 Have you personally taken any action to fight climate change over the past six months? (binary)

Eurobarometer 83.4 (2015)

European Commission, Brussels; Directorate General Communication COMM.A.1 ´Strategy, Corporate Communication Actions and Eurobarometer´GESIS Data Archive, Cologne. ZA6595 Data file Version 3.0.0, https://doi.org/10.4232/1.13146

Data file: ZA6595 data file (European Commission 2018).
Questionnaire: Eurobarometer 83.4 Basic Bilingual Questionnaire
Citation: ZA6595 Bibtex

Eurobarometer 87.1 (2017)

European Commission, Brussels; Directorate General Communication, COMM.A.1 ‘Strategic Communication’; European Parliament, Directorate-General for Communication, Public Opinion Monitoring Unit GESIS Data Archive, Cologne. ZA6861 Data file Version 1.2.0, https://doi.org/10.4232/1.12922

Data file: ZA6861 data file.
Questionnaire: Eurobarometer 90.2 Basic Bilingual Questionnaire
Citation: ZA6861 Bibtex

QC1a Which of the following do you consider to be the single most serious problem facing the world as a whole? (single choice)

QC1b Which others do you consider to be serious problems? (multiple choice)

QC2 And how serious a problem do you think climate change is at this moment? Please use a scale from 1 to 10, with '1' meaning it is "not at all a serious problem (scale 1-10)

Qc4 To what extent do you agree or disagree with each of the following statements? - Fighting climate change and using energy more efficiently can boost the economy and jobs in the EU (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - Promoting EU expertise in new clean technologies to countries outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can increase the security of EU energy supplies (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - More public financial support should be given to the transition to clean energies even if it means subsidies to fossil fuels should be reduced. (agreement-disagreement 4-scale)

Qc5 Have you personally taken any action to fight climate change over the past six months? (binary)

Eurobarometer 90.2 (2018)

European Commission, Brussels; Directorate General Communication, COMM.A.3 ‘Media Monitoring and Eurobarometer’ GESIS Data Archive, Cologne. ZA7488 Data file Version 1.0.0, https://doi.org/10.4232/1.13289

Data file: ZA7488 data file (European Commission 2019a)
Questionnaire: Eurobarometer 90.2 Basic Bilingual Questionnaire
Citation: ZA7488 Bibtex

QB5 To what extent do you agree or disagree with each of the following statements? - Fighting climate change and using energy more efficiently can boost the economy and jobs in the EU (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - Promoting EU expertise in new clean technologies to countries outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can increase the security of EU energy supplies (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - More public financial support should be given to the transition to clean energies even if it means subsidies to fossil fuels should be reduced. (agreement-disagreement 4-scale)

Eurobarometer 91.3 (2019)

European Commission, Brussels; Directorate General Communication, COMM.A.3 ‘Media Monitoring and Eurobarometer’ GESIS Data Archive, Cologne. ZA7572 Data file Version 1.0.0, https://doi.org/10.4232/1.13372

Data file: ZA7572 data file (European Commission 2019b).
Questionnaire: Eurobarometer 91.3 Basic Bilingual Questionnaire
Citation: ZA7572 Bibtex

QB4 To what extent do you agree or disagree with each of the following statements? - Taking action on climate change will lead to innovation that will make EU companies more competitive (N) (agreement-disagreement 4-scale)

QB4 To what extent do you agree or disagree with each of the following statements? - Promoting EU expertise in new clean technologies to countries outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB4 To what extent do you agree or disagree with each of the following statements? - Adapting to the adverse impacts of climate change can have positive outcomes for citizens in the EU (agreement-disagreement 4-scale)

QB5 Have you personally taken any action to fight climate change over the past six months? (binary)

References

European Commission, Brussels. 2017. “Eurobarometer 80.2 (2013).” GESIS Data Archive, Cologne. ZA5877 Data file Version 2.0.0, https://doi.org/10.4232/1.12792. https://doi.org/10.4232/1.12792.

———. 2018. “Eurobarometer 83.4 (2015).” GESIS Data Archive, Cologne. ZA6595 Data file Version 3.0.0, https://doi.org/10.4232/1.13146. https://doi.org/10.4232/1.13146.

———. 2019a. “Eurobarometer 90.2 (2018).” GESIS Data Archive, Cologne. ZA7488 Data file Version 1.0.0, https://doi.org/10.4232/1.13289. https://doi.org/10.4232/1.13289.

———. 2019b. “Eurobarometer 91.3 (2019).” GESIS Data Archive, Cologne. ZA7572 Data file Version 1.0.0, https://doi.org/10.4232/1.13372. https://doi.org/10.4232/1.13372.