Merging Comparative Manifesto Project and ParlGov cabinet composition data at the party-level
07 Mar 2019
Goal of this Post
The data provided by the Comparative Manifesto Project (CMP) is a rich and valuable source of information for researchers interested in parties policy positions and salience strategies. The CMP main dataset records policy issue positions at the level of parties nested within countries and elections. Specifically, issue positions and salience measurements are derived from parties’ election manifestos.1 These manifestos are usually produced as parts of parties campaigning efforts and regularly issued six to one month prior to upcoming elections.
In this post, we will focus on a particular piece of information that is lacking in the CMP data: an indicator of parties’ government status that would allow to distinguish government from opposition parties. Specifically, our goal is to find out whether the manifesto used in a given election was published by a party that was in government at this point in time.
In order to accomplish this goal, I will add to the CMP data information contained in ParlGov’s (PGV) cabinet view. In the PGV cabinet view, elections map m:1 to countries, elections map 1:m to cabinets, and parties are nested in election-cabinet configurations. It is a notable feature of this dataset that it keeps track of the different cabinets formed based on the results of a given election as well as of their timing (i.e., cabinet start dates).
Similar but different: Relating CMP and PGV data
Two data logics
Adding PGV indicators to the CMP data sounds simpler than it actually is (see also here and here). In the CMP data, manifesto-related indicators (e.g., issue position) map 1:1 to country-election-party configurations. But the election date recorded is that of the post-campaign election. That is, for a manifesto published at day $X$, the election on day $Y$ is the upcoming election in the sense that $X < Y$.
This stands in contrast to the PGV data, where the recorded election is only the point of reference during cabinet formation, so that if a cabinet forms on day $Z$ based on the results of election of day $Y$, we always have $Y < Z$. As a consequence, a first complication when wanting to determine CMP parties’ government status is that we cannot simply join PGV cabinet data to CMP election data at the party-level using election dates, but first need to identify the cabinet that was running (i.e., in office) when the election was held at which the CMP manifesto-related datasets are pointing.
Different IDs and incomplete lookup/link tables
The second complication is that parties are not only differently named and abbreviated in PGV than in CMP data, but that dataset-specific IDs are generally non-matching. What is more, to my knowledge there exists no complete look-up table that would allow to match parties between datasets.
Hands-on
But it would probably not have been so much fun to write this post, if I hadn’t had found ways to solve these problems 👍 So let’s get started!
First things first
First, some basic setup: I load all required packages and define some helper functions (and add roxygen2
docu, just in case):
Next we’ll define a dataframe containing information on the countries we are interested in:
Get CMP data
Now we can download the CMP dataset using the manifesto-project’s API. Note that you first have to register with manifesto-project.org and save the API key,2 and then load the API key using mp_setapikey
:
Checking our list of select countries against the countries contained in the CMP, we can verify that none of the countries we are interested in is missing in the CMP data. Hence, I create the dataframe cmp
that keeps only rows of select countries and only elections since the 1980s.
Here, we can nicely inspect the structure of the CMP dataset: Countries map 1:m elections, and elections 1:m to parties.
Create unique CMP country-election-party combinations
Now we can get all distinct combinations of countries, elections and parties in the CMP data (dataframe cmp_elc_ptys
), and validate that no invalid entries are in the resulting dataset.
Get PGV data
Next we download the most actual cabinets and parties views from the PGV website, and validate that all countries of interest are contained it.
Identify ‘running’ cabinets in PGV data
The next step is crucial! As explained above, we are interested in running cabinet parties at the day of an election. Hence, we need to leverage both election data and cabinet start date information to identify just these. This is accomplished by the following very long, but extensively commented expression:
The logic in pgv_elcs
is the following:
- Each row is a unique country-election configuration.
- Column
election_date
records the actual date of the election, while columnnext_election_date
records the date of the upcoming election. - When we compare
start_date
(the selected cabinet’s start date) andnext_election_date
, we can confirm that the upcoming election falls into the period of activity of the cabinet we selected. - Parties in the ‘Kreisky IV’ cabinet (Austria), which was formed from the parliament elected on 1979-05-06 and took office on 1979-06-05, for instance, are those parties that were in office when the next election was held (on 1983-04-24).
- This ‘next’ election, in turn, is the one used in the CMP data to index policy positions of those parties who managed to enter a parliament in the given election.
- Note, however, that these parties have likely issued their manifestos before they were elected into office.
- Hence, if we want to know whether a manifesto was written by a party that was holding government office, we need to look at the cabinet that was running (i.e., in office) when the election was held (in the above example, parties in the ‘Kreisky IV’ cabinet, not in the ‘Vranitzky I’, which took office on 1986-06-16 and was only formed based on the results of the 1983-04-24 election).
Create a link table matching CMP to PGV election dates
Once I have leveraged the PGV data to identify running cabinets for elections, I want to allow us to add this information to the CMP data. For this purpose I create a new link table mapping CMP to PGV elections. This step, too, requires some great care, since we cannot simply assume that election dates always match exactly.
As commented in the code, it is crucial to match CMP election dates to the ‘next’/’upcoming’ election in the PGV data, since this is the one mapping to the cabinet that contains running cabinet parties.
While we can verify that neither any CMP election maps to multiple PGV elections, nor vice versa, the last query returns some configurations where election dates do not match exactly. But inspecting date differences allows to conclude that all differences are rather small (i.e., ≤ 7 days), so there is little reason to put the veracity of this data into question.
Having matched CMP and PGV elections, I can construct the link table matching CMP country-elections to PGV country-election-(running-)cabinet configurations:
We can validate that I have matched exactly one running cabinet to each CMP election:
Create link table matching CMP and PGV party IDs
Now that I’ve solved the first problem (matching running cabinets to the CMP data), I can deal with the other problem: matching PGV to CMP parties. I’ll use the link-table provided by PartyFacts. We first download the file:
Next, we create a link table matching PGV and CMP party IDs:
We get a dataframe with 310 rows, but there are both instances of 1:m and m:1 PGV to CMP party matchings, as the below queries demonstrate:
Below we’ll see whether this 1:m and m:1 matching PGV:CMP IDs disappears once we add country-election(-cabinet) info.
We see that there exist no matching PGV party IDs for some configurations in the CMP data. Note, however, that this is problematic only insofar as we want to get the government status info from PGV: if we cannot match a PGV party to a CMP party, we cannot say whether the given (unmatched) CMP party was in government or not.
I’ll deal with this problem in the next step. Specifically, I’ll check if further matching efforts need to be undertaken in case we cannot match all gov’t parties in a configuration
Check if all gov’t parties can be matched
First, we enrich the CMP data by PGV party IDs (where possible) as provided in the link table.
Then we right-join party information from the PGV cabinets view to the dataset containing running cabinet info (running_cabinets
) created above.
Now we are ready to join the information on running cabinet parties to the CMP dataset. We use PGV party IDs, keeping in mind that there is a subset of parties in the CMP data for which we could not identify a matching PGV party.
Specifically, we perform a full outer join which gives us a stacked version of the CMP and PGV data, containing:
- all matching country-election-party pairings
- all non-matching country-election-parties from the CMP data
- all non-matching country-election-parties from the PGV data
We can better understand the structure of the resulting dataset by inspecting an example configuration:
In this example, the Belgian ‘Verhofstadt II’ cabinet (PGV naming) taking office on 2003-07-12, parties ‘VB’ and ‘FN’ in the PGV data could not be matched to any party within this country-election configuration in the CMP dataset when using PGV party IDs (as obtained from the party-facts link table). Conversely, parties ‘groen!’, ‘sp.a’, ‘LDD’ and ‘VB’ in the CMP dataset could not be matched to any party within this country-election configuration in the PGV data. Significantly, despite ‘VB’ exists in both datasets, the CMP ‘VB’ could not be matched to the PGV ‘VB’ because the PGV party ID obtained trough the party-facts link table mismatches the one used in the original PGV data.
What’s more is that CMP parties for which no PGV party could be matched using PGV party IDs have NA
s on columns originating from the PGV data frame, and vice versa. So we want to fill-in missing information. I will do this by imputing missings from configuration context where possible.3
That’s what I do next:
The result of filling-in missing but inferable information from the context gives complete configuration for which we know which CMP party is non-matching in PGV data and vice versa:
So now we are ready to again deal with the core problem: the fact that we cannot match PGV party info to some CMP parties. This fact is problematic for our purpose (identifying parties’ government status) if and only if not all PGV parties that are recorded as cabinet party for a given configuration can be matched. This is because if we can match all cabinet parties, we can infer the government status of non-matched parties: they are all not cabinet parties.
A straight-forward way to validate this condition is to try to replicate the cabinet_size
measure. The logic is simple: For configurations in which aggregating party cabinet membership information at the country-election(-cabinet) level does not allow us to replicate this measure, we know that we failed to match at least one government party. In such a case, we could only infer the missing government status information if only one party in this configuration was not matched. Otherwise, we would not have enough information to infer non-matching parties government status.
So I check this in the dataset cmp_full_join_pgv_filled
:
Wow! We are lucky: I was able to replicate the cabinet_size
measure for all configurations in the dataset, and hence can infer that parties that could not be matched are invariable opposition parties.4
Create the complete dataset
Now I have almost reached my goal. The final step is to add the inferred government status information (i.e., replace NA
values with 0
), drop PGV parties that are non-matching in the CMP data, and gather all datasets at the level of CMP country-election-party configurations.
We can verify that we have added valid government status information to every row in the original CMP dataset:
The resulting dataset contains just everything we need to join it back to the original CMP dataset cmp_raw
…
… but I leave this exercise to you 😊
-
But see the indicator
progtype
in the CMP data, distinguishing different techniques of obtaining policy position measurements. ↩ -
See https://manifesto-project.wzb.eu/information/documents/api for detailed information on the API. ↩
-
In the example of ‘Verhofstadt II’ cabinet, we know, for instance, that we can write ‘Verhofstadt II’ in column
pgv_running_cabinet_name
where there are currentlyNA
s because we matched PGV to CMP data using our country-election date look-up table, and we know that we have selected only one cabinet configuration per election from the PGV data (the ‘running’ cabinet). Hence there are no ambiguities at the election-cabinet level. ↩ -
In case you wonder about the clumsy definition of column
postmatch_cabinet_size
: Since some PGV parties match to multiple CMP parties, I need to count distinct PGV parties to exactly replicate thecabinet_size
measure. If I would instead have usedpostmatch_cabinet_size = sum(pgv_cabinet_party, na.rm = NA)
, the indicator would have double counted m:1 matched CMP parties, and in these casespostmatch_cabinet_size > pgv_cabinet_size
. ↩