Merging Comparative Manifesto Project and ParlGov cabinet composition data at the party-level
07 Mar 2019
Goal of this Post
The data provided by the Comparative Manifesto Project (CMP) is a rich and valuable source of information for researchers interested in parties policy positions and salience strategies. The CMP main dataset records policy issue positions at the level of parties nested within countries and elections. Specifically, issue positions and salience measurements are derived from parties’ election manifestos.1 These manifestos are usually produced as parts of parties campaigning efforts and regularly issued six to one month prior to upcoming elections.
In this post, we will focus on a particular piece of information that is lacking in the CMP data: an indicator of parties’ government status that would allow to distinguish government from opposition parties. Specifically, our goal is to find out whether the manifesto used in a given election was published by a party that was in government at this point in time.
In order to accomplish this goal, I will add to the CMP data information contained in ParlGov’s (PGV) cabinet view. In the PGV cabinet view, elections map m:1 to countries, elections map 1:m to cabinets, and parties are nested in election-cabinet configurations. It is a notable feature of this dataset that it keeps track of the different cabinets formed based on the results of a given election as well as of their timing (i.e., cabinet start dates).
Similar but different: Relating CMP and PGV data
Two data logics
Adding PGV indicators to the CMP data sounds simpler than it actually is (see also here and here). In the CMP data, manifesto-related indicators (e.g., issue position) map 1:1 to country-election-party configurations. But the election date recorded is that of the post-campaign election. That is, for a manifesto published at day $X$, the election on day $Y$ is the upcoming election in the sense that $X < Y$.
This stands in contrast to the PGV data, where the recorded election is only the point of reference during cabinet formation, so that if a cabinet forms on day $Z$ based on the results of election of day $Y$, we always have $Y < Z$. As a consequence, a first complication when wanting to determine CMP parties’ government status is that we cannot simply join PGV cabinet data to CMP election data at the party-level using election dates, but first need to identify the cabinet that was running (i.e., in office) when the election was held at which the CMP manifesto-related datasets are pointing.
Different IDs and incomplete lookup/link tables
The second complication is that parties are not only differently named and abbreviated in PGV than in CMP data, but that dataset-specific IDs are generally non-matching. What is more, to my knowledge there exists no complete look-up table that would allow to match parties between datasets.
Hands-on
But it would probably not have been so much fun to write this post, if I hadn’t had found ways to solve these problems 👍 So let’s get started!
First things first
First, some basic setup: I load all required packages and define some helper functions (and add roxygen2
docu, just in case):
# setup ----
library(dplyr)
library(lubridate)
library(manifestoR)
library(readr)
library(tidyr)
#' If NA, then ...
#'
#' @description Given a scalar value `x`, function replaces it with a specified value if the input is NA
#'
#' @param x a scalar value to be evaluated in \code{is.na(x)}
#'
#' @param then the value to replace \code{x} with if \code{is.na(x)} evaluates to true \code{TRUE}
#'
#' @return if \code{is.na(x)}, then \code{then}, else \code{x}
if_na <- function(x, then) ifelse( is.na(x), then, x )
#' Is distinct?
#'
#' @description Given a dataframe, matrix, list or vector object, function test if the number of unique rows or elements
#' equals the total number of rows or elements, respectively.
#'
#' @param x dataframe, matrix, list or vector object
#'
#' @return logical
is_unique <- function(x) {
if (inherits(x, "data.frame") | is.matrix(x)){
nrow(pgv_elcs) == nrow(unique(pgv_elcs))
} else if (is.vector(x) | is.list(x)) {
length(pgv_elcs) == nrow(length(pgv_elcs))
} else {
stop("`x` must be a data.frame, list or vector object")
}
}
Next we’ll define a dataframe containing information on the countries we are interested in:
countries <- read_csv(
'"country_name","country_iso2c","country_iso3c"
"Austria","AT","AUT"
"Belgium","BE","BEL"
"Denmark","DK","DNK"
"Finland","FI","FIN"
"France","FR","FRA"
"Germany","DE","DEU"
"Greece","GR","GRC"
"Ireland","IE","IRL"
"Italy","IT","ITA"
"Iceland","IS","ISL"
"Luxembourg","LU","LUX"
"Netherlands","NL","NLD"
"Norway","NO","NOR"
"Portugal","PT","PRT"
"Spain","ES","ESP"
"Sweden","SE","SWE"
"Switzerland","CH","CHE"
"United Kingdom","GB","GBR"'
)
Get CMP data
Now we can download the CMP dataset using the manifesto-project’s API. Note that you first have to register with manifesto-project.org and save the API key,2 and then load the API key using mp_setapikey
:
mp_setapikey("your/path/to/secret/manifesto_apikey.txt")
# get available version
cmp_versions <- mp_coreversions()
## Connecting to Manifesto Project DB API...
# get newest version
(this_version <- cmp_versions$datasets.id[nrow(cmp_versions)])
## [1] "MPDS2018b"
# get data
cmp_raw <- mp_maindataset(version = this_version)
## Connecting to Manifesto Project DB API... corpus version: 2018-2
# do all match countries?
cmp_countries <- unique(cmp_raw$countryname)
all(cmp_countries %in% countries$country_name)
## [1] FALSE
all(countries$country_name %in% cmp_countries)
## [1] TRUE
Checking our list of select countries against the countries contained in the CMP, we can verify that none of the countries we are interested in is missing in the CMP data. Hence, I create the dataframe cmp
that keeps only rows of select countries and only elections since the 1980s.
cmp <- cmp_raw %>%
# keep only select countries
filter(countryname %in% countries$country_name) %>%
mutate(edate = ymd(edate)) %>%
# keep only elections since the 80s
filter(edate > ymd("1979-12-31")) %>%
left_join(countries, by = c("countryname" = "country_name")) %>%
mutate_if(is.character, trimws)
head(cmp) %>%
select(1:7)
## # A tibble: 6 x 7
## country countryname oecdmember eumember edate date party
## <dbl> <chr> <dbl> <dbl> <date> <dbl> <dbl>
## 1 11 Sweden 10 0 1982-09-19 198209 11220
## 2 11 Sweden 10 0 1982-09-19 198209 11320
## 3 11 Sweden 10 0 1982-09-19 198209 11420
## 4 11 Sweden 10 0 1982-09-19 198209 11620
## 5 11 Sweden 10 0 1982-09-19 198209 11810
## 6 11 Sweden 10 0 1985-09-15 198509 11220
Here, we can nicely inspect the structure of the CMP dataset: Countries map 1:m elections, and elections 1:m to parties.
Create unique CMP country-election-party combinations
Now we can get all distinct combinations of countries, elections and parties in the CMP data (dataframe cmp_elc_ptys
), and validate that no invalid entries are in the resulting dataset.
cmp_elc_ptys <- cmp %>%
select(country_iso3c, edate, party, partyname, partyabbrev) %>%
rename_at(-1, ~ paste0("cmp_", .)) %>%
unique()
# any party more than one abbreviation?
cmp_elc_ptys %>%
group_by(country_iso3c, cmp_edate, cmp_party) %>%
summarize(n_abrvs = n_distinct(cmp_partyabbrev)) %>%
filter(n_abrvs > 1)
## # A tibble: 0 x 4
## # Groups: country_iso3c, cmp_edate [0]
## # … with 4 variables: country_iso3c <chr>, cmp_edate <date>,
## # cmp_party <dbl>, n_abrvs <int>
# any party more than one ID?
cmp_elc_ptys %>%
group_by(country_iso3c, cmp_edate, cmp_partyname) %>%
summarize(n_ids = n_distinct(cmp_party)) %>%
filter(n_ids > 1)
## # A tibble: 0 x 4
## # Groups: country_iso3c, cmp_edate [0]
## # … with 4 variables: country_iso3c <chr>, cmp_edate <date>,
## # cmp_partyname <chr>, n_ids <int>
Get PGV data
Next we download the most actual cabinets and parties views from the PGV website, and validate that all countries of interest are contained it.
cabs <- read_csv("http://www.parlgov.org/static/data/development-cp1252/view_cabinet.csv", locale = locale(encoding = 'ISO-8859-1'))
ptys <- read_csv("http://www.parlgov.org/static/data/development-cp1252/view_party.csv", locale = locale(encoding = 'ISO-8859-1'))
pgv_ctrs <- unique(cabs$country_name_short)
# ensure compatibility: all PGV in CMP?
all(unique(cmp$country_iso3c) %in% pgv_ctrs)
## [1] TRUE
Identify ‘running’ cabinets in PGV data
The next step is crucial! As explained above, we are interested in running cabinet parties at the day of an election. Hence, we need to leverage both election data and cabinet start date information to identify just these. This is accomplished by the following very long, but extensively commented expression:
# for PGV country-election-cabinets, ...
pgv_elcs <- cabs %>%
# keep select countries
filter(country_name_short %in% unique(cmp$country_iso3c)) %>%
# go back further to also retain previous cabs
filter(ymd(election_date) > ymd("1969-12-31")) %>%
# get distinct country-election configurations
select(country_name_short, election_date) %>%
unique() %>%
# get date of next election
group_by(country_name_short) %>%
mutate(next_election_date = lead(election_date)) %>%
# add cabinet info
left_join(cabs, by = c("country_name_short", "election_date")) %>%
# keep select columns
select(
country_name_short,
election_date, next_election_date,
cabinet_id, cabinet_name, start_date
) %>%
unique() %>%
# keep only cabinet last formed from a given election
# (NOTE: this is the cabinet ruinning when the next election was held)
group_by(country_name_short, election_date) %>%
filter(start_date == max(start_date)) %>%
# (NOTE: the previous step makes rows uniquely identified election dates
# within countries, since only one cabinet per country-election is retained)
ungroup() %>%
# again apply date filter
filter(next_election_date > ymd("1974-12-31")) %>%
# rename
rename(
country_iso3c = country_name_short
, cabinet_start_date = start_date
) %>%
rename_at(-1, ~ paste0("pgv_", .))
# Nr. rows = Nr. distinct rows?
is_unique(pgv_elcs)
## [1] TRUE
head(pgv_elcs)
## # A tibble: 6 x 6
## country_iso3c pgv_election_da… pgv_next_electi… pgv_cabinet_id
## <chr> <date> <date> <int>
## 1 AUT 1971-10-10 1975-10-05 365
## 2 AUT 1975-10-05 1979-05-06 350
## 3 AUT 1979-05-06 1983-04-24 463
## 4 AUT 1983-04-24 1986-11-23 828
## 5 AUT 1986-11-23 1990-10-07 873
## 6 AUT 1990-10-07 1994-10-09 524
## # … with 2 more variables: pgv_cabinet_name <chr>,
## # pgv_cabinet_start_date <date>
The logic in pgv_elcs
is the following:
- Each row is a unique country-election configuration.
- Column
election_date
records the actual date of the election, while columnnext_election_date
records the date of the upcoming election. - When we compare
start_date
(the selected cabinet’s start date) andnext_election_date
, we can confirm that the upcoming election falls into the period of activity of the cabinet we selected. - Parties in the ‘Kreisky IV’ cabinet (Austria), which was formed from the parliament elected on 1979-05-06 and took office on 1979-06-05, for instance, are those parties that were in office when the next election was held (on 1983-04-24).
- This ‘next’ election, in turn, is the one used in the CMP data to index policy positions of those parties who managed to enter a parliament in the given election.
- Note, however, that these parties have likely issued their manifestos before they were elected into office.
- Hence, if we want to know whether a manifesto was written by a party that was holding government office, we need to look at the cabinet that was running (i.e., in office) when the election was held (in the above example, parties in the ‘Kreisky IV’ cabinet, not in the ‘Vranitzky I’, which took office on 1986-06-16 and was only formed based on the results of the 1983-04-24 election).
Create a link table matching CMP to PGV election dates
Once I have leveraged the PGV data to identify running cabinets for elections, I want to allow us to add this information to the CMP data. For this purpose I create a new link table mapping CMP to PGV elections. This step, too, requires some great care, since we cannot simply assume that election dates always match exactly.
# join country-elections on (inexact) election dates
ctr_elcs <- pgv_elcs %>%
# NOTE: take next election due to differing data logics in CMP and PGV (see explanation above)
select(country_iso3c, pgv_next_election_date) %>%
unique() %>%
# get cross-product of elections within countries
full_join(
# join CMP data
cmp_elc_ptys %>%
select(country_iso3c, cmp_edate) %>%
unique()
, by = "country_iso3c"
) %>%
# compute date difference in days for each CMP-PGV election data pairing
# THIS POINT IS KEY!: take next election in PGV data due to differing data logics in CMP and PGV
mutate(abs_elc_date_diff = abs(pgv_next_election_date - cmp_edate)) %>%
# at the country CMP-election level (reference data), keep the PGV (next)
# election with the lowest data difference (0 days all but a few instance)
group_by(country_iso3c, cmp_edate) %>%
top_n(1, wt = desc(abs_elc_date_diff)) %>%
ungroup()
As commented in the code, it is crucial to match CMP election dates to the ‘next’/’upcoming’ election in the PGV data, since this is the one mapping to the cabinet that contains running cabinet parties.
# is distinct?
is_unique(pgv_elcs)
## [1] TRUE
# ensure that no CMP election maps to multiple PGV elections
ctr_elcs %>%
group_by(country_iso3c, cmp_edate) %>%
summarize(n_pgv_dates = n_distinct(pgv_next_election_date)) %>%
filter(n_pgv_dates > 1)
## # A tibble: 0 x 3
## # Groups: country_iso3c [0]
## # … with 3 variables: country_iso3c <chr>, cmp_edate <date>,
## # n_pgv_dates <int>
# ensure that no PGV election maps to multiple CMP elections
ctr_elcs %>%
group_by(country_iso3c, pgv_next_election_date) %>%
summarize(n_cmp_dates = n_distinct(cmp_edate)) %>%
filter(n_cmp_dates > 1)
## # A tibble: 0 x 3
## # Groups: country_iso3c [0]
## # … with 3 variables: country_iso3c <chr>,
## # pgv_next_election_date <date>, n_cmp_dates <int>
# inexact (best) matches ?
ctr_elcs %>%
filter(abs_elc_date_diff > 0)
## # A tibble: 13 x 4
## country_iso3c pgv_next_election_date cmp_edate abs_elc_date_diff
## <chr> <date> <date> <time>
## 1 FRA 1981-06-21 1981-06-14 7 days
## 2 FRA 1988-06-12 1988-06-05 7 days
## 3 FRA 1993-03-28 1993-03-21 7 days
## 4 FRA 1997-06-01 1997-05-25 7 days
## 5 FRA 2002-06-16 2002-06-09 7 days
## 6 FRA 2007-06-17 2007-06-10 7 days
## 7 FRA 2012-06-17 2012-06-10 7 days
## 8 FRA 2017-06-18 2017-06-11 7 days
## 9 ITA 1992-04-05 1992-04-06 1 days
## 10 ITA 1994-03-27 1994-03-28 1 days
## 11 ITA 2006-04-09 2006-04-10 1 days
## 12 ITA 2013-02-25 2013-02-24 1 days
## 13 SWE 1998-09-20 1998-09-21 1 days
While we can verify that neither any CMP election maps to multiple PGV elections, nor vice versa, the last query returns some configurations where election dates do not match exactly. But inspecting date differences allows to conclude that all differences are rather small (i.e., ≤ 7 days), so there is little reason to put the veracity of this data into question.
Having matched CMP and PGV elections, I can construct the link table matching CMP country-elections to PGV country-election-(running-)cabinet configurations:
# take country-election link table ...
running_cabinets <- ctr_elcs %>%
# and left-join PGV cabinet info (only info of running cabinets matched)
left_join(pgv_elcs, by = c("country_iso3c", "pgv_next_election_date")) %>%
rename(
pgv_election_date = pgv_next_election_date
, pgv_running_cabinet_id = pgv_cabinet_id
, pgv_running_cabinet_name = pgv_cabinet_name
, pgv_running_cabinet_start_date = pgv_cabinet_start_date
, pgv_running_cabinet_election_date = pgv_election_date
) %>%
select(
country_iso3c
, cmp_edate
, pgv_election_date
, pgv_running_cabinet_name
, pgv_running_cabinet_id
, pgv_running_cabinet_start_date
, pgv_running_cabinet_election_date
)
We can validate that I have matched exactly one running cabinet to each CMP election:
# any duplicates?
running_cabinets %>%
group_by(country_iso3c, cmp_edate) %>%
filter(n_distinct(pgv_running_cabinet_id) != 1)
## # A tibble: 0 x 7
## # Groups: country_iso3c, cmp_edate [0]
## # … with 7 variables: country_iso3c <chr>, cmp_edate <date>,
## # pgv_election_date <date>, pgv_running_cabinet_name <chr>,
## # pgv_running_cabinet_id <int>,
## # pgv_running_cabinet_start_date <date>,
## # pgv_running_cabinet_election_date <date>
Create link table matching CMP and PGV party IDs
Now that I’ve solved the first problem (matching running cabinets to the CMP data), I can deal with the other problem: matching PGV to CMP parties. I’ll use the link-table provided by PartyFacts. We first download the file:
# define your data dir
data_dir <- "path/to/your/data/directory"
# download link table
file_name <- "partyfacts-mapping.csv"
if (!file_name %in% list.files(data_dir)) {
url <- "https://partyfacts.herokuapp.com/download/external-parties-csv/"
download.file(url, file.path(data_dir, file_name))
}
Next, we create a link table matching PGV and CMP party IDs:
# read link table
ptf <- read_csv(file.path(data_dir, file_name))
# NOTE: on 2019-02-23, this raised some warnings, which can be ignored
# check if all coiuntries covered in party-facts (PTF) data
all(countries$country_iso3c %in% ptf$country)
## [1] TRUE
# create CMP-PGV party link table
pty_links <- ptf %>%
# for select countries
filter(country %in% countries$country_iso3c) %>%
# take parties CMP IDs
filter(dataset_key == "manifesto") %>%
select(partyfacts_id, country, dataset_party_id) %>%
# inner join drops both CMP parties that have no matching PGV ID,
# and PGV parties for which no matching CMP code exists
inner_join(
ptf %>%
# parties PGV IDs (where possible)
filter(dataset_key == "parlgov") %>%
select(partyfacts_id, country, dataset_party_id)
, by = c("partyfacts_id" = "partyfacts_id")
, suffix = c("_cmp", "_pgv")
) %>%
filter(country_cmp == country_pgv) %>%
select(-country_pgv) %>%
rename(
country_iso3c = country_cmp
, cmp_party = dataset_party_id_cmp
, pgv_party_id = dataset_party_id_pgv
, ptf_party_id = partyfacts_id
) %>%
mutate_at(3:4, as.integer)
We get a dataframe with 310 rows, but there are both instances of 1:m and m:1 PGV to CMP party matchings, as the below queries demonstrate:
nrow(pty_links)
## [1] 310
# is distinct?
is_unique(pty_links)
## [1] TRUE
# any PGV ID matches to multiple CMP parties?
pty_links %>%
group_by(pgv_party_id) %>%
filter(n_distinct(cmp_party) > 1)
## # A tibble: 27 x 4
## # Groups: pgv_party_id [11]
## ptf_party_id country_iso3c cmp_party pgv_party_id
## <int> <chr> <int> <int>
## 1 698 BEL 21912 969
## 2 554 BEL 21422 454
## 3 698 BEL 21423 969
## 4 554 BEL 21425 454
## 5 36 BEL 21916 501
## 6 36 BEL 21913 501
## 7 1586 BEL 21221 1113
## 8 1586 BEL 21330 1113
## 9 516 GRC 34020 1441
## 10 516 GRC 34211 1441
## # … with 17 more rows
# any CMP ID matches to multiple PGV parties?
pty_links %>%
group_by(cmp_party) %>%
filter(n_distinct(pgv_party_id) > 1)
## # A tibble: 15 x 4
## # Groups: cmp_party [6]
## ptf_party_id country_iso3c cmp_party pgv_party_id
## <int> <chr> <int> <int>
## 1 878 ITA 32220 809
## 2 878 ITA 32220 2666
## 3 851 ITA 32110 910
## 4 851 ITA 32110 1304
## 5 4019 ESP 33099 2607
## 6 4019 ESP 33099 2606
## 7 4019 ESP 33099 62
## 8 4019 ESP 33020 2607
## 9 4019 ESP 33020 2606
## 10 4019 ESP 33020 62
## 11 4019 ESP 33098 2607
## 12 4019 ESP 33098 2606
## 13 4019 ESP 33098 62
## 14 1567 GBR 51620 773
## 15 1567 GBR 51620 1496
Below we’ll see whether this 1:m and m:1 matching PGV:CMP IDs disappears once we add country-election(-cabinet) info.
# how many matched?
cmp_elc_ptys %>%
left_join(pty_links) %>%
summarise(
n = n()
, n_matched = sum(!is.na(pgv_party_id))
)
## # A tibble: 1 x 2
## n n_matched
## <int> <int>
## 1 1398 1331
# any CMP party in data matches to no PGV parties?
cmp_elc_ptys %>%
left_join(pty_links) %>%
group_by(country_iso3c, cmp_edate, cmp_party) %>%
filter(n_distinct(pgv_party_id, na.rm = TRUE) < 1)
## # A tibble: 67 x 7
## # Groups: country_iso3c, cmp_edate, cmp_party [67]
## country_iso3c cmp_edate cmp_party cmp_partyname cmp_partyabbrev
## <chr> <date> <dbl> <chr> <chr>
## 1 FIN 1991-03-17 14223 Left Wing Al… VAS
## 2 FIN 1995-03-19 14223 Left Wing Al… VAS
## 3 FIN 1999-03-21 14223 Left Wing Al… VAS
## 4 FIN 2003-03-16 14223 Left Wing Al… VAS
## 5 FIN 2007-03-18 14223 Left Wing Al… VAS
## 6 FIN 2011-04-17 14223 Left Wing Al… VAS
## 7 BEL 2007-06-10 21917 Flemish Inte… VB
## 8 BEL 2010-06-13 21917 Flemish Inte… VB
## 9 NLD 1982-09-08 22710 Centre Party ""
## 10 NLD 1994-05-03 22955 Union 55+ Unie 55+
## # … with 57 more rows, and 2 more variables: ptf_party_id <int>,
## # pgv_party_id <int>
# any CMP party in data matches to multiple PGV parties?
cmp_elc_ptys %>%
left_join(pty_links) %>%
group_by(country_iso3c, cmp_edate, cmp_party) %>%
filter(n_distinct(pgv_party_id, na.rm = TRUE) > 1)
## # A tibble: 52 x 7
## # Groups: country_iso3c, cmp_edate, cmp_party [24]
## country_iso3c cmp_edate cmp_party cmp_partyname cmp_partyabbrev
## <chr> <date> <dbl> <chr> <chr>
## 1 ITA 1983-06-26 32220 Italian Comm… PCI
## 2 ITA 1983-06-26 32220 Italian Comm… PCI
## 3 ITA 1987-06-14 32110 Green Federa… FdV
## 4 ITA 1987-06-14 32110 Green Federa… FdV
## 5 ITA 1987-06-14 32220 Italian Comm… PCI
## 6 ITA 1987-06-14 32220 Italian Comm… PCI
## 7 ITA 1992-04-06 32110 Green Federa… FdV
## 8 ITA 1992-04-06 32110 Green Federa… FdV
## 9 ITA 1992-04-06 32220 Democratic P… PDS
## 10 ITA 1992-04-06 32220 Democratic P… PDS
## # … with 42 more rows, and 2 more variables: ptf_party_id <int>,
## # pgv_party_id <int>
We see that there exist no matching PGV party IDs for some configurations in the CMP data. Note, however, that this is problematic only insofar as we want to get the government status info from PGV: if we cannot match a PGV party to a CMP party, we cannot say whether the given (unmatched) CMP party was in government or not.
I’ll deal with this problem in the next step. Specifically, I’ll check if further matching efforts need to be undertaken in case we cannot match all gov’t parties in a configuration
Check if all gov’t parties can be matched
First, we enrich the CMP data by PGV party IDs (where possible) as provided in the link table.
cmp_w_pgv_ids <- cmp_elc_ptys %>%
left_join(pty_links, by = c("country_iso3c", "cmp_party")) %>%
select(-ptf_party_id) %>%
group_by(country_iso3c, cmp_edate, cmp_party) %>%
mutate(cmp_n_pgv_ids = n_distinct(pgv_party_id, na.rm = TRUE)) %>%
ungroup()
Then we right-join party information from the PGV cabinets view to the dataset containing running cabinet info (running_cabinets
) created above.
# create running-cabinet party-level dataset
running_parties <- cabs %>%
# select party-cabinet info from original PGV cabinets view
select(
country_name_short, cabinet_id,
party_id, party_name_english, party_name_short,
caretaker, cabinet_party
) %>%
# add party CMP ID (where exists) from original PGV parties view
left_join(
ptys %>% select(party_id, cmp)
, by = "party_id"
) %>%
# compute cabinet size
group_by(country_name_short, cabinet_id) %>%
mutate(cabinet_size = sum(cabinet_party)) %>%
ungroup() %>%
# add prefixes to all but the first column
rename_at(-1, ~ paste0("pgv_", .)) %>%
# join only running cabinets
right_join(
running_cabinets
, by = c(
"country_name_short" = "country_iso3c"
, "pgv_cabinet_id" = "pgv_running_cabinet_id"
)
) %>%
# rename
rename(
country_iso3c = country_name_short
, pgv_running_cabinet_id = pgv_cabinet_id
) %>%
# select columns
select(
# all columns as ordered in `running_cabinets` dataframe
!!names(running_cabinets)
# other columns ...
, pgv_cabinet_size
, pgv_caretaker
, pgv_party_name_short
, pgv_party_name_english
, pgv_party_id
, pgv_cabinet_party
, pgv_cmp
)
Now we are ready to join the information on running cabinet parties to the CMP dataset. We use PGV party IDs, keeping in mind that there is a subset of parties in the CMP data for which we could not identify a matching PGV party.
Specifically, we perform a full outer join which gives us a stacked version of the CMP and PGV data, containing:
- all matching country-election-party pairings
- all non-matching country-election-parties from the CMP data
- all non-matching country-election-parties from the PGV data
# join PGV party-level data to PGV-ID-enriched CMP country-election-party data
cmp_full_join_pgv <- cmp_w_pgv_ids %>%
# add running parties (full outer join!)
full_join(running_parties, by = c("country_iso3c", "cmp_edate", "pgv_party_id"))
We can better understand the structure of the resulting dataset by inspecting an example configuration:
# inspect an example configuration
cmp_full_join_pgv %>%
filter(country_iso3c == "BEL", (pgv_election_date == "2007-06-10" | cmp_edate == "2007-06-10")) %>%
select(
country_iso3c, cmp_edate,
pgv_running_cabinet_name, pgv_running_cabinet_start_date,
cmp_partyabbrev, pgv_party_name_short
)
## # A tibble: 15 x 6
## country_iso3c cmp_edate pgv_running_cab… pgv_running_cab…
## <chr> <date> <chr> <date>
## 1 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 2 BEL 2007-06-10 <NA> NA
## 3 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 4 BEL 2007-06-10 <NA> NA
## 5 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 6 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 7 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 8 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 9 BEL 2007-06-10 <NA> NA
## 10 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 11 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 12 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 13 BEL 2007-06-10 <NA> NA
## 14 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 15 BEL 2007-06-10 Verhofstadt II 2003-07-12
## # … with 2 more variables: cmp_partyabbrev <chr>,
## # pgv_party_name_short <chr>
In this example, the Belgian ‘Verhofstadt II’ cabinet (PGV naming) taking office on 2003-07-12, parties ‘VB’ and ‘FN’ in the PGV data could not be matched to any party within this country-election configuration in the CMP dataset when using PGV party IDs (as obtained from the party-facts link table). Conversely, parties ‘groen!’, ‘sp.a’, ‘LDD’ and ‘VB’ in the CMP dataset could not be matched to any party within this country-election configuration in the PGV data. Significantly, despite ‘VB’ exists in both datasets, the CMP ‘VB’ could not be matched to the PGV ‘VB’ because the PGV party ID obtained trough the party-facts link table mismatches the one used in the original PGV data.
What’s more is that CMP parties for which no PGV party could be matched using PGV party IDs have NA
s on columns originating from the PGV data frame, and vice versa. So we want to fill-in missing information. I will do this by imputing missings from configuration context where possible.3
That’s what I do next:
# fill-in missing information by inferring from configuration's contexts
cmp_full_join_pgv_filled <- cmp_full_join_pgv %>%
# a) take PGV country-election groupings ...
group_by(country_iso3c, pgv_election_date) %>%
# ... fill-in missing CMP election date (can be inferred from grouping)
fill(cmp_edate, .direction = "up") %>%
fill(cmp_edate, .direction = "down") %>%
# ... and remove PGV configurations that are completly missing in CMP data (if any)
filter(!is.na(cmp_edate)) %>% # should be all
# b) take CMP country-election groupings ...
group_by(country_iso3c, cmp_edate) %>%
# ... and fill-in missing but inferrable PGV info
fill(
pgv_election_date
, pgv_running_cabinet_start_date
, pgv_running_cabinet_name
, pgv_running_cabinet_id
, pgv_running_cabinet_election_date
, pgv_cabinet_size
, pgv_caretaker
, .direction = "down"
) %>%
fill(
pgv_election_date
, pgv_running_cabinet_start_date
, pgv_running_cabinet_name
, pgv_running_cabinet_id
, pgv_running_cabinet_election_date
, pgv_cabinet_size
, pgv_caretaker
, .direction = "up"
) %>%
ungroup()
The result of filling-in missing but inferable information from the context gives complete configuration for which we know which CMP party is non-matching in PGV data and vice versa:
cmp_full_join_pgv_filled %>%
filter(country_iso3c == "BEL", cmp_edate == "2007-06-10") %>%
select(
country_iso3c, cmp_edate,
pgv_running_cabinet_name, pgv_running_cabinet_start_date,
cmp_partyabbrev, pgv_party_name_short,
pgv_cabinet_size, pgv_cabinet_party
)
## # A tibble: 15 x 8
## country_iso3c cmp_edate pgv_running_cab… pgv_running_cab…
## <chr> <date> <chr> <date>
## 1 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 2 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 3 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 4 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 5 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 6 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 7 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 8 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 9 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 10 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 11 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 12 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 13 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 14 BEL 2007-06-10 Verhofstadt II 2003-07-12
## 15 BEL 2007-06-10 Verhofstadt II 2003-07-12
## # … with 4 more variables: cmp_partyabbrev <chr>,
## # pgv_party_name_short <chr>, pgv_cabinet_size <int>,
## # pgv_cabinet_party <int>
So now we are ready to again deal with the core problem: the fact that we cannot match PGV party info to some CMP parties. This fact is problematic for our purpose (identifying parties’ government status) if and only if not all PGV parties that are recorded as cabinet party for a given configuration can be matched. This is because if we can match all cabinet parties, we can infer the government status of non-matched parties: they are all not cabinet parties.
A straight-forward way to validate this condition is to try to replicate the cabinet_size
measure. The logic is simple: For configurations in which aggregating party cabinet membership information at the country-election(-cabinet) level does not allow us to replicate this measure, we know that we failed to match at least one government party. In such a case, we could only infer the missing government status information if only one party in this configuration was not matched. Otherwise, we would not have enough information to infer non-matching parties government status.
So I check this in the dataset cmp_full_join_pgv_filled
:
cmp_full_join_pgv_filled %>%
group_by(country_iso3c, cmp_edate, pgv_running_cabinet_start_date) %>%
mutate(
postmatch_cabinet_size = n_distinct(ifelse(pgv_cabinet_party == 1, pgv_party_name_short, NA), na.rm = TRUE)
, flag = pgv_cabinet_size == postmatch_cabinet_size
) %>%
filter(!flag)
## # A tibble: 0 x 20
## # Groups: country_iso3c, cmp_edate, pgv_running_cabinet_start_date
## # [0]
## # … with 20 variables: country_iso3c <chr>, cmp_edate <date>,
## # cmp_party <dbl>, cmp_partyname <chr>, cmp_partyabbrev <chr>,
## # pgv_party_id <int>, cmp_n_pgv_ids <int>,
## # pgv_election_date <date>, pgv_running_cabinet_name <chr>,
## # pgv_running_cabinet_id <int>,
## # pgv_running_cabinet_start_date <date>,
## # pgv_running_cabinet_election_date <date>, pgv_cabinet_size <int>,
## # pgv_caretaker <int>, pgv_party_name_short <chr>,
## # pgv_party_name_english <chr>, pgv_cabinet_party <int>,
## # pgv_cmp <int>, postmatch_cabinet_size <int>, flag <lgl>
Wow! We are lucky: I was able to replicate the cabinet_size
measure for all configurations in the dataset, and hence can infer that parties that could not be matched are invariable opposition parties.4
Create the complete dataset
Now I have almost reached my goal. The final step is to add the inferred government status information (i.e., replace NA
values with 0
), drop PGV parties that are non-matching in the CMP data, and gather all datasets at the level of CMP country-election-party configurations.
# take the filled-in fully-joined dataset
cmp_w_pty_govt_status <- cmp_full_join_pgv_filled %>%
# get rid of PGV parties that are non-matching in CMP data
filter(!is.na(cmp_party)) %>%
# infer missing government status by ...
# ... a) looking within CMP country-election-party configurations, and
group_by(country_iso3c, cmp_edate, cmp_party) %>%
fill(pgv_cabinet_party) %>%
# ... b) replace with 0 where still NA
mutate(pgv_cabinet_party = ifelse(is.na(pgv_cabinet_party), 0, pgv_cabinet_party)) %>%
# aggregate at party-level within CMP data
group_by(
country_iso3c
, cmp_edate
, pgv_running_cabinet_name
, pgv_running_cabinet_id
, pgv_running_cabinet_start_date
, pgv_running_cabinet_election_date
, pgv_cabinet_size
, pgv_caretaker
, cmp_party
, cmp_partyname
, cmp_partyabbrev
, pgv_cabinet_party
) %>%
# add informative comments
summarize(
comment = case_when(
cmp_n_pgv_ids == 0 ~ "no matching party found within matching ParlGov cabinet configuration"
, cmp_n_pgv_ids == 1 ~ sprintf(
"CMP party matches to party %s (%s) within ParlGov cabinet configuration"
, if_na(pgv_party_name_short, "?")
, if_na(pgv_party_id, "?")
)
, cmp_n_pgv_ids > 1 ~ sprintf(
"CMP party matches to multiple parties within ParlGov cabinet configuration: %s"
, paste0(
sprintf(
"%s (%s)"
, if_na(pgv_party_name_short, "?")
, if_na(pgv_party_id, "?")
)
, collapse = ", "
)
)
, TRUE ~ NA_character_
# in order to aggregate, keep distinct comments (always 1 within grouping)
) %>% unique()
) %>%
ungroup()
We can verify that we have added valid government status information to every row in the original CMP dataset:
nrow(cmp_w_pty_govt_status) == nrow(cmp)
## [1] TRUE
The resulting dataset contains just everything we need to join it back to the original CMP dataset cmp_raw
…
head(cmp_w_pty_govt_status)
## # A tibble: 6 x 13
## country_iso3c cmp_edate pgv_running_cab… pgv_running_cab…
## <chr> <date> <chr> <int>
## 1 AUT 1983-04-24 Kreisky IV 463
## 2 AUT 1983-04-24 Kreisky IV 463
## 3 AUT 1983-04-24 Kreisky IV 463
## 4 AUT 1986-11-23 Vranitzky I 828
## 5 AUT 1986-11-23 Vranitzky I 828
## 6 AUT 1986-11-23 Vranitzky I 828
## # … with 9 more variables: pgv_running_cabinet_start_date <date>,
## # pgv_running_cabinet_election_date <date>, pgv_cabinet_size <int>,
## # pgv_caretaker <int>, cmp_party <dbl>, cmp_partyname <chr>,
## # cmp_partyabbrev <chr>, pgv_cabinet_party <dbl>, comment <chr>
… but I leave this exercise to you 😊
-
But see the indicator
progtype
in the CMP data, distinguishing different techniques of obtaining policy position measurements. ↩ -
See https://manifesto-project.wzb.eu/information/documents/api for detailed information on the API. ↩
-
In the example of ‘Verhofstadt II’ cabinet, we know, for instance, that we can write ‘Verhofstadt II’ in column
pgv_running_cabinet_name
where there are currentlyNA
s because we matched PGV to CMP data using our country-election date look-up table, and we know that we have selected only one cabinet configuration per election from the PGV data (the ‘running’ cabinet). Hence there are no ambiguities at the election-cabinet level. ↩ -
In case you wonder about the clumsy definition of column
postmatch_cabinet_size
: Since some PGV parties match to multiple CMP parties, I need to count distinct PGV parties to exactly replicate thecabinet_size
measure. If I would instead have usedpostmatch_cabinet_size = sum(pgv_cabinet_party, na.rm = NA)
, the indicator would have double counted m:1 matched CMP parties, and in these casespostmatch_cabinet_size > pgv_cabinet_size
. ↩