The main dataset contains three sets of content analytical variables.
- The three-digit main categories (per101 – 706)
- The 4-digits Central and Eastern European (CEE)-subcategories (per1011 – 7062)
- The 3+1 digit subcategories since version 5 of the coding instructions (per103_1 – 703_2)
All variables have in common that they indicate the share of quasi-sentences in the respective category calculated as a fraction of the overall number of allocated codes per document. A value of 5 within a cell in the column per501 indicates that 5% of allocated codes (the number of quasi-sentences) were coded with the code 501 (Positive mentions about the protection of the environment). This tutorial explains the existence, generation and usage of the different type of subcategories and their relation to the main categories in the main dataset, the South America dataset and the Manifesto Corpus.
The main categories: per101 – 706
The three digit variables (per101 –- per706) are the main categories of the coding scheme. For most of analyses these are the most relevant categories and they can be used without any precaution and without knowing anything about sub-categories at all. Data on the main categories is available for all countries and all elections covered by the dataset.
The CEE-subcategories: per1011 – 7062
The four digit variables (per1011 – per7062) are sub-categories mostly addressing issues in transitional democracies in (mostly) Central and Eastern European countries. However, they were introduced in version 1 of the coding instructions and were gradually abandoned in most of the countries as the issues could also be coded into the three digit main categories. Currently, the share of these categories is not included in the main categories. If analysts use observations from CEE countries for which the CEE codes were used and want to compare them to manifestos without CEE codes then they should aggregate such CEE codes into the main categories. For example per7062 should be added to per706. Our R package manifestoR can easily do this by using the function aggregate_pers_cee
on the main dataset.
The following graph shows the elections were CEE categories were used.
The manual_5-subcategories: per103_1 – 703_2
The four digit variables with underscore (per103_1 – 703_2) are new categories introduced with version 5 of the coding instructions. The new subcategories are coded instead of their respective main category. So, from version 5 of the coding instructions, coders cannot code the main categories anymore. If a category does contain new subcategories, coders have to choose one of the new subcategories. Eg a coder cannot assign the code 608 anymore, but has to choose between the subcategories 608_1, 608_2, and 608_3. To ensure over-time comparability of the main categories with data coded with earlier versions of the coding instructions, the new categories can be aggregated into the respective main categories. When compared with old data, sentences coded with 608_1 for example can be recoded to 608 to make the data comparable over time. The only exception to this aggregation are categories 202_2, 605_2 and 703_2, which have to be added to the uncoded sentences (peruncod), as such issues were not covered in the handbook4 category scheme. The following indicates all the aggregation rules:
- peruncod: per202_2, per605_2 and per703_2
- per103: per103_1 and per103_2
- per201: per201_1 and per201_2
- per202: per202_1, per202_3 and per202_4
- per305: per305_1, per305_2, per305_3, per305_4, per305_5 and per305_6
- per416: per416_1 and per416_2
- per601: per601_1 and per601_2
- per602: per602_1 and per602_2
- per605: per605_1
- per606: per606_1 and per606_2
- per607: per607_1, per607_2 and per607_3
- per608: per608_1, per608_2 and per608_3
- per703: per703_1
As most scholars require longer time series, we decided to conduct this aggregation already for the main dataset. So, scholars interested in long-time series can simply continue to use the existing main categories (per101-706, including e.g. the rile) as they did in the past and ignore the new categories. Scholars interested in the new categories should be aware that the respective main categories of the new sub- categories are aggregates of the new categories (i.e. if one uses the main categories together with the new subcategories the per-variables will likely add up to more than 100%). To see whether the these categories were used or not, you can filter on the manual (==5) variable in the Main Dataset.
The Manifesto Project: South America dataset
The Manifesto Project Dataset: South America has a slightly different structure than the Main Dataset - in particular in regard to the dealing of subcategories. While the Main Dataset already provides aggregations of the new subcategories to make past and current data comparable, the South America Dataset does not so. The reason is that all data coded for the South American Dataset was coded with version 5 of the coding instructions. Then, the codes with underscores eg. 601_1 should be treated as every other category. Note that the South America dataset does not contain the three digit categories of codes where there are new subcategories. Eg there is no variable per601 in the South America Dataset. If you would like to merge the main dataset with the South America Dataset it is necessary to aggregate the new subcategories into the three-digit main categories in the South America dataset (eg by using manifestoR’s function aggregate_pers
).
Subcategories in the Manifesto Corpus
The Manifesto Corpus contains the codes as they were assigned by the coders at the time of coding with the coding instructions used at that time. So, in the original coding this data is not always directly comparable as some documents eg contain 608 codes and others contain the codings of the different subcategories of 608. The same is true for coding from CEE countries in the past. When looking at individual elections where there are no differences in the version of the coding instructions that were used, eg. when looking at data from one election in one country, there is usually no need to recode the codes. In other cases, if scholars want to make their data maximally comparable over time and across country it is best to recode both types of subcategory codes. With manifestoR this can easily be done with two functions. recode_cee_codes
recodes the four digit cee codes to their respective main categories. recode_v5_to_v4
does the same for the subcategories introduced with version 5 of the coding instructions and implements the recoding exceptions mentioned above.
To illustrate this, the following table shows the codes before and after applying the recode_v5_to_v4
function to a text snippet. The recode functions can be applied to a ManifestoCorpus object, to a ManifestoDocument object or to a vector of codes (see the manifestoR introduction for more information on these objects.)
text | cmp_code | recode_v5_to_v4 |
---|---|---|
Liebe Bürgerinnen und Bürger,am 24. September ist Bundestagswahl. | 0 | 0 |
Bevor wir Ihnen sagen, was wir vorhaben, haben wir eine Bitte an Sie: Diskutieren Sie mit, mischen Sie sich ein, gehen Sie wählen. | 202.1 | 202 |
Treten Sie mit uns für die Werte ein, die unser Land und Europa stark und lebenswert gemacht haben, die uns weit über Partei- und Ländergrenzen hinweg verbinden: die Würde des Menschen, | 201.1 | 201 |
Gerechtigkeit und Gleichberechtigung, | 503 | 503 |
Freiheit und Demokratie. | 201.1 | 201 |
The eu_code column in the Manifesto Corpus
Usually all quasi-sentences are assigned to one and only on code. In the Manifesto Corpus and the coded documents these codes can be found in the cmp_code column. However, in some former versions of the coding instructions, there was one exception to this rule: If a quasi-sentence was somehow related to the European Union/European community and to another category covered by the coding scheme it could be coded as positive or negative mentions of the european community/union (codes 108 and 110) and be assigned to another code. This practice was abandoned roughly when MARPOR took over the coding in 2009. The Manifesto Corpus provides meta information for each annotated document whether whether it contains this kind of eu_codes or not. The following snippet illustrates how these codes are dealt with in the Manifesto Corpus. NB: For the counting of the code frequencies for the main dataset, these codes were treated equally to the codes 108 and 110 that were coded in the cmp_code column.
text | cmp_code | eu_code |
---|---|---|
Die Vereinbarkeit von Familie und Beruf setzt neben einer Begrenzung der Arbeitszeit auch eine Planbarkeit von Arbeitszeit und Freizeit voraus. | 603 | NA |
Auch für die Europäische Gesetzgebung muss das Ziel der Arbeitszeitverkürzung und nicht der -verlängerung Priorität bekommen. | 701 | 108 |
Aufgabe der Tarifpolitik kann es auch sein, eine stärkere Beteiligung von Arbeitnehmern und Arbeitnehmerinnen am Produktivvermögen durchzusetzen. | 701 | NA |
Unfortunately, for documents that are not digitally available, we do not always have the information whether the “double”-coding with eu categories was allowed and practised by the coder or not. The following graph illustrates the share of coded documents per election for which eu-double coding was allowed. However, this graph is restricted to documents that are digitally annotated only.