First steps with manifestoR

Jirka Lewandowski & Nicolas Merz

17 May 2018 (slightly updated 14 June 2021)

This tutorial is largely based on the manifestoR vignette.

Installing and loading manifestoR

In order to make use of the manifestoR package, you need to have a working installation of R installed on your computer. R is an open-source statistic and programming software and can be downloaded for free. We recommend the use of RStudio as an integrated development environment, but the tutorial should also work with just R installed on your computer. To install the manifestoR-package, type the following commands into the console window:

install.packages("manifestoR")

To make use of manifestoR, you need to load the package using the library-function:

library(manifestoR)

You need to load the package again whenever you restart your R session, but you do not need to install it again. Although you can type all commands in the R console and use R interactively, we recommend to write all code in an .R file to make your work more easily reproducible.

Sidenote: As we make use of some dplyr functions during this tutorial we will also have to load the dplyr package:

library(dplyr)

Connecting to the Manifesto Project Database API

To access the Manifesto Project Data or data in the Manifesto Corpus with manifestoR, an account for the Manifesto Project webpage with an API key is required. If you do not yet have an account, you can create one at https://manifesto-project.wzb.eu/signup. If you have an account, you can create and download the API key on your profile page.

For every R session using manifestoR and connecting to the Manifesto API, you need to set the API key in your work environment. This can be done by passing either a key or the name of a file containing the key to manifestoR’s mp_setapikey() function (see documentation ?mp_setapikey for details). Thus, your R script using manifestoR usually will start like this:

library(manifestoR)
mp_setapikey("manifesto_apikey.txt")

This R code presumes that you have stored and downloaded the API key in a file named manifesto_apikey.txt in your current R working directory.

Note that it is a security risk to store the API key file or a script containing the key in public repositories such as github.

All the following commands will only work if you have set your apikey as described above.

Downloading the Manifesto Project Dataset

Let’s first learn how to download the Manifesto Project Dataset with manifestoR. You can download the Manifesto Project Dataset (MPDS) with the function mp_maindataset().

mpds <- mp_maindataset()
## Connecting to Manifesto Project DB API... 
## Connecting to Manifesto Project DB API... corpus version: 2020-2
names(mpds)
##   [1] "country"        "countryname"    "oecdmember"     "eumember"      
##   [5] "edate"          "date"           "party"          "partyname"     
##   [9] "partyabbrev"    "parfam"         "coderid"        "manual"        
##  [13] "coderyear"      "testresult"     "testeditsim"    "pervote"       
##  [17] "voteest"        "presvote"       "absseat"        "totseats"      
##  [21] "progtype"       "datasetorigin"  "corpusversion"  "total"         
##  [25] "peruncod"       "per101"         "per102"         "per103"        
##  [29] "per104"         "per105"         "per106"         "per107"        
##  [33] "per108"         "per109"         "per110"         "per201"        
##  [37] "per202"         "per203"         "per204"         "per301"        
##  [41] "per302"         "per303"         "per304"         "per305"        
##  [45] "per401"         "per402"         "per403"         "per404"        
##  [49] "per405"         "per406"         "per407"         "per408"        
##  [53] "per409"         "per410"         "per411"         "per412"        
##  [57] "per413"         "per414"         "per415"         "per416"        
##  [61] "per501"         "per502"         "per503"         "per504"        
##  [65] "per505"         "per506"         "per507"         "per601"        
##  [69] "per602"         "per603"         "per604"         "per605"        
##  [73] "per606"         "per607"         "per608"         "per701"        
##  [77] "per702"         "per703"         "per704"         "per705"        
##  [81] "per706"         "per1011"        "per1012"        "per1013"       
##  [85] "per1014"        "per1015"        "per1016"        "per1021"       
##  [89] "per1022"        "per1023"        "per1024"        "per1025"       
##  [93] "per1026"        "per1031"        "per1032"        "per1033"       
##  [97] "per2021"        "per2022"        "per2023"        "per2031"       
## [101] "per2032"        "per2033"        "per2041"        "per3011"       
## [105] "per3051"        "per3052"        "per3053"        "per3054"       
## [109] "per3055"        "per4011"        "per4012"        "per4013"       
## [113] "per4014"        "per4121"        "per4122"        "per4123"       
## [117] "per4124"        "per4131"        "per4132"        "per5021"       
## [121] "per5031"        "per5041"        "per5061"        "per6011"       
## [125] "per6012"        "per6013"        "per6014"        "per6061"       
## [129] "per6071"        "per6072"        "per6081"        "per7051"       
## [133] "per7052"        "per7061"        "per7062"        "per103_1"      
## [137] "per103_2"       "per201_1"       "per201_2"       "per202_1"      
## [141] "per202_2"       "per202_3"       "per202_4"       "per305_1"      
## [145] "per305_2"       "per305_3"       "per305_4"       "per305_5"      
## [149] "per305_6"       "per416_1"       "per416_2"       "per601_1"      
## [153] "per601_2"       "per602_1"       "per602_2"       "per605_1"      
## [157] "per605_2"       "per606_1"       "per606_2"       "per607_1"      
## [161] "per607_2"       "per607_3"       "per608_1"       "per608_2"      
## [165] "per608_3"       "per703_1"       "per703_2"       "rile"          
## [169] "planeco"        "markeco"        "welfare"        "intpeace"      
## [173] "datasetversion" "id_perm"

The dataset is returned as a data.frame (to be precise as a tibble - the tidyverse version of a data.frame), therefore the names function returns the variable names of the data frame. By default the most recent version of the dataset is returned, but you can also access older versions. To get a list of all versions of the main dataset, type:

mp_coreversions()
##    datasets.id                             datasets.name
## 1    MPDS2012a Manifesto Project Dataset (version 2012a)
## 2    MPDS2012b Manifesto Project Dataset (version 2012b)
## 3    MPDS2013a Manifesto Project Dataset (version 2013a)
## 4    MPDS2013b Manifesto Project Dataset (version 2013b)
## 5    MPDS2014a Manifesto Project Dataset (version 2014a)
## 6    MPDS2014b Manifesto Project Dataset (version 2014b)
## 7    MPDS2015a Manifesto Project Dataset (version 2015a)
## 8    MPDS2016a Manifesto Project Dataset (version 2016a)
## 9    MPDS2016b Manifesto Project Dataset (version 2016b)
## 10   MPDS2017a Manifesto Project Dataset (version 2017a)
## 11   MPDS2017b Manifesto Project Dataset (version 2017b)
## 12   MPDS2018a Manifesto Project Dataset (version 2018a)
## 13   MPDS2018b Manifesto Project Dataset (version 2018b)
## 14   MPDS2019a Manifesto Project Dataset (version 2019a)
## 15   MPDS2019b Manifesto Project Dataset (version 2019b)
## 16   MPDS2020a Manifesto Project Dataset (version 2020a)
## 17   MPDS2020b Manifesto Project Dataset (version 2020b)

To query a specific version of the main dataset, use the dataset.id listed in the output of coreversions. For example, to get the Manifesto Project Dataset version 2015a, type:

mp_maindataset(version = "MPDS2015a")

Even if you want to use the current version of the dataset, it is good practice to specifically query this version to ensure that you always get the same dataset even if you later run your script again.

You can get the Manifesto Project South America Dataset using the function mp_southamerica_dataset() that works analogously to mp_maindataset().

Downloading documents from the Manifesto Corpus

Check the availability of documents

Before downloading documents, the function mp_availability let’s you check which documents are available in the Manifesto Corpus. The following command summarizes the availability of documents. If not assigned to an object, it prints a summary report of the Manifesto Corpus. The argument TRUE here indicates that it checks whether there is a document available for each case of the Manifesto Project Dataset.

mp_availability(TRUE)
## Connecting to Manifesto Project DB API... 
## Connecting to Manifesto Project DB API... corpus version: 2020-2 
## Connecting to Manifesto Project DB API... corpus version: 2020-2
##                                                                                                                                                                                                                                                                                                                                                              Queried for 
##                                                                                                                                                                                                                                                                                                                                                                     4760 
##                                                                                                                                                                                                                                                                                                                                                           Corpus Version 
##                                                                                                                                                                                                                                                                                                                                                                   2020-2 
##                                                                                                                                                                                                                                                                                                                                                          Documents found 
##                                                                                                                                                                                                                                                                                                                                                           2742 (57.605%) 
##                                                                                                                                                                                                                                                                                                                                                    Coded Documents found 
##                                                                                                                                                                                                                                                                                                                                                           1593 (33.466%) 
##                                                                                                                                                                                                                                                                                                                                                          Originals found 
##                                                                                                                                                                                                                                                                                                                                                           3023 (63.508%) 
##                                                                                                                                                                                                                                                                                                                                                                Languages 
## 39 (swedish norwegian danish finnish icelandic french dutch german english italian spanish catalan galician greek portuguese japanese hebrew turkish armenian bosnian bosnian-cyrillic serbian-cyrillic bulgarian croatian czech estonian georgian hungarian latvian lithuanian macedonian romanian montenegrin polish russian serbian-latin slovak slovenian ukrainian)

Instead of indicating TRUE, you can also indicate a logical expression using variables from the Manifesto Project Dataset which serves as a reference of cases. The following command checks for the availability of documents for all Belgium manifestos covered by the Manifesto Project Dataset.

available_docs <- mp_availability(countryname == "Belgium")
available_docs
##           Queried for        Corpus Version       Documents found 
##                   184                2020-2         156 (84.783%) 
## Coded Documents found       Originals found             Languages 
##          38 (20.652%)         124 (67.391%)      2 (french dutch)
names(available_docs)
## [1] "party"       "date"        "manifestos"  "originals"   "annotations"
## [6] "language"

available_docs is a data.frame where can easily be filtered, eg for a specific language. To check for the availability of Flemish (here labelled as “dutch”) documents for Belgium elections since 2010 covered in the dataset you could for example do the following:

belgium_2010 <- mp_availability(countryname == "Belgium" & date > 201000)
filter(belgium_2010, language == "dutch")
##           Queried for        Corpus Version       Documents found 
##                    25                2020-2              21 (84%) 
## Coded Documents found       Originals found             Languages 
##              21 (84%)              21 (84%)             1 (dutch)

To get all English-language documents that come along with annotations, you would do the following

english_annotated <- mp_availability(TRUE) %>% filter(annotations == TRUE & language == "english")

Downloading documents

(Bulk-)Downloading documents from the Manifesto Corpus works via the function mp_corpus(...). This function understands different inputs.

  1. You can download election programmes on an individual basis by listing combinations of party ids and election dates in a data.frame and passing it to mp_corpus(...):
wanted <- data.frame(
  party = c(41220, 41320),
  date = c(200909, 200909)
)
mp_corpus(wanted)
## Connecting to Manifesto Project DB API... corpus version: 2020-2 
## Connecting to Manifesto Project DB API... corpus version: 2020-2
## <<ManifestoCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1

The party ids (41220 and 41320 in the example) are the ids as in the Manifesto Project’s main dataset. They can be found in the current dataset documentation at https://manifesto-project.wzb.eu/datasets or in the main dataset.

Note that we received only 1 document, while querying for two. This is because the party with the id 41220 (KPD) did not run for elections in September 2009.

  1. Instead of typing all these combinations by hand, one can also do this easier: mp_availability returns a data frame in the same format as the wanted data frame above that is used to query the corpus. So, to get a corpus with all english-language annotated documents, you could just pass the object english_annotated saved above to the mp_corpus function:
mp_corpus(english_annotated)
## Connecting to Manifesto Project DB API... corpus version: 2020-2
## <<ManifestoCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 181
  1. mp_corpus can be called with a logical expression specifying the subset of the Manifesto Corpus that you want to download:
my_corpus <- mp_corpus(countryname == "Austria" & date > 200100 & date < 201312)
## Connecting to Manifesto Project DB API... corpus version: 2020-2
my_corpus
## <<ManifestoCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 21

This queries for all documents in the Manifesto Corpus for Austrian parties published after or in the the year 2001. (The format of the date variable in the Manifesto Project Dataset is YYYYMM).

The variable names in the logical expression used for querying the corpus database (countryname and date in the example above) can be any column names from the Manifesto Project’s Main Dataset or your current R environment. The Main Dataset itself is available in manifestoR via the funcion. The following command queries all documents with a “rile” score higher than 60:

mp_corpus(rile > 60) ## another example of data set based corpus query
## Connecting to Manifesto Project DB API... corpus version: 2020-2
## <<ManifestoCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 9

A convenient way to download the whole Manifesto Corpus is to type mp_corpus(TRUE). However, please have in mind that this might take a while to process and download. In most cases, it is sufficient to download a subset of the Manifesto Corpus.

The ManifestoCorpus object

mp_corpus returns a ManifestoCorpus object, a subclass of Corpus as defined in the natural language processing package tm (Feinerer & Hornik 2015). Following tms logic, a ManifestoCorpus consists of ManifestoDocuments. Documents in corpus can be indexed via their manifesto_id (consisting of the CMP party code, an underscore, and either the election year, if unambigous, or the election year and month) or via their position in the corpus. For both, corpus and documents, tm provides accessor functions to the corpus and documents content and metadata:

head(content(my_corpus[["42110_200211"]]))
## [1] "“Wir können heute die Existenzgrundlagen"  
## [2] "künftiger Generationen zerstören."         
## [3] "Oder sie sichern.”"                        
## [4] "Dr. Eva Glawischnig"                       
## [5] "Österreich braucht jetzt Weitblick."       
## [6] "Nachhaltigkeit für zukünftige Generationen"
head(content(my_corpus[[1]]))
## [1] "“Wir können heute die Existenzgrundlagen"  
## [2] "künftiger Generationen zerstören."         
## [3] "Oder sie sichern.”"                        
## [4] "Dr. Eva Glawischnig"                       
## [5] "Österreich braucht jetzt Weitblick."       
## [6] "Nachhaltigkeit für zukünftige Generationen"
meta(my_corpus[["42110_200211"]])
##   manifesto_id               : 42110_200211
##   party                      : 42110
##   date                       : 200211
##   language                   : german
##   source                     : MARPOR
##   has_eu_code                : FALSE
##   is_primary_doc             : TRUE
##   may_contradict_core_dataset: FALSE
##   md5sum_text                : 4e1877c110c3d01db9eaf864310cc0b0
##   url_original               : NA
##   md5sum_original            : NA
##   annotations                : TRUE
##   handbook                   : 3
##   is_copy_of                 : NA
##   title                      : Österreich braucht jetzt die Grünen. Das Wahlprogramm
##   id                         : 42110_200211

Processing and analysing the corpus documents

As in tm, the textual content of a document is returned by the function content. In the documents with annotations, the text is stored as a character vector where each element is one quasi-sentence. For example the extract below are the first four quasi-sentences from the Austrian Green Party’s manifesto from the 2006 election.

txt <- content(my_corpus[["42110_200610"]])
class(txt)
## [1] "character"
head(txt, n = 4)
## [1] "1 Lebensqualität"                                                 
## [2] "1.1 Grüne Energiewende"                                           
## [3] "Lebensqualität bedeutet in einer unversehrten Umwelt zu leben."   
## [4] "Die Verantwortung dafür liegt bei uns: Wir alle gestalten Umwelt."

In documents with no annotations, the text is stored as a single very long character string. The following shows an extract of the German SPD manifesto from 1994 election.

txt <- content(mp_corpus(party == 41320 & date == 199410)[["41320_199410"]])
## Connecting to Manifesto Project DB API... corpus version: 2020-2
substr(txt, 19900, 20512)
## [1] " an diesen Prozessen stärker beteiligt werden, daß sie mehr mitbestimmen können. Deshalb werden wir das Betriebsverfassungs- und Personalvertretungsrecht dort weiterentwickeln, wo neue Arbeitsorganisationen und neue Produktions- und Informationstechnologien dies erfordern. Wir wollen die Unternehmensmitbestimmung der Arbeitnehmer, ihrer Interessenvertreter und Gewerkschaften sichern und weiterentwickeln. Mitbestimmungsrechte dürfen durch Unternehmensspaltungen und Konzernstrukturen oder durch Verlagerungen von Unternehmenszentralen ins Ausland nicht eingeschränkt werden. Um die Aushöhlung der Mitbestimmung"

Working with the CMP codings

The central way for accessing the CMP codings is the accessor method codes(...). It can be called on ManifestoDocuments and ManifestoCorpuss and returns a vector of the CMP codings attached to the quasi-sentences of the document/corpus in a row:

doc <- my_corpus[["42110_200610"]]
head(codes(doc), n = 15)
##  [1] NA    NA    "501" "606" "501" "501" "501" "416" "416" "412" "503" "411"
## [13] "501" "416" NA
head(codes(my_corpus), n = 15)
##  [1] "305" "305" "305" NA    NA    NA    "601" "416" "416" "107" "107" "107"
## [13] "416" "416" "416"

Thus you can for example use R’s functionality to count the codes or select quasi- sentences (units of texts) based on their code:

table(codes(doc))
## 
## 104 105 106 107 108 109 201 202 203 303 305 401 402 403 408 409 411 412 413 416 
##   3   9   2  52  36  11  36  17   1   3   1   2   6  20   1   1  38  17   1  13 
## 501 502 503 504 506 601 604 605 606 607 608 701 703 704 706 
##  62  48  83  24  46  14  20   9  10  15   5  33  13   9  32
doc_subcodes <- subset(doc, codes(doc) %in% c(202, 503, 607))
length(doc_subcodes)
## [1] 115
length(doc_subcodes) / length(doc)
## [1] 0.1489637

The CMP coding scheme can be found in the online documentation of the Manifesto Project dataset at https://manifesto-project.wzb.eu/coding_schemes/1. Obviously, codes() only works on documents that are digitally annotated (annotations==TRUE).

Using the document metadata

Each document in the Manifesto Corpus has meta information about itself attached. They can be accessed via the function meta:

meta(doc)
##   manifesto_id               : 42110_200610
##   party                      : 42110
##   date                       : 200610
##   language                   : german
##   source                     : MARPOR
##   has_eu_code                : FALSE
##   is_primary_doc             : TRUE
##   may_contradict_core_dataset: FALSE
##   md5sum_text                : b90378f0c6fca51b464bbe8cd2c96990
##   url_original               : /down/originals/42110_2006.pdf
##   md5sum_original            : 8fd5726c6363864c3ace6e2d497d647e
##   annotations                : TRUE
##   handbook                   : 3
##   is_copy_of                 : NA
##   title                      : Zeit für Grün. Das Grüne Programm
##   id                         : 42110_200610

It is possible to access and also modify specific metadata entries:

meta(doc, "party")
## [1] 42110
meta(doc, "manual_edits") <- TRUE
meta(doc)
##   manifesto_id               : 42110_200610
##   party                      : 42110
##   date                       : 200610
##   language                   : german
##   source                     : MARPOR
##   has_eu_code                : FALSE
##   is_primary_doc             : TRUE
##   may_contradict_core_dataset: FALSE
##   md5sum_text                : b90378f0c6fca51b464bbe8cd2c96990
##   url_original               : /down/originals/42110_2006.pdf
##   md5sum_original            : 8fd5726c6363864c3ace6e2d497d647e
##   annotations                : TRUE
##   handbook                   : 3
##   is_copy_of                 : NA
##   title                      : Zeit für Grün. Das Grüne Programm
##   id                         : 42110_200610
##   manual_edits               : TRUE

Document metadata can also be bulk-downloaded with the function mp_metadata, taking the same set of parameters as mp_corpus:

metas <- mp_metadata(countryname == "Spain")
head(metas)
## # A tibble: 6 x 15
##   party   date language source has_eu_code is_primary_doc may_contradict_core_d…
##   <dbl>  <dbl> <chr>    <chr>  <lgl>       <lgl>          <lgl>                 
## 1 33220 197706 NA       NA     FALSE       NA             NA                    
## 2 33320 197706 spanish  CEMP   FALSE       TRUE           FALSE                 
## 3 33430 197706 spanish  CEMP   FALSE       TRUE           FALSE                 
## 4 33610 197706 NA       NA     FALSE       NA             NA                    
## 5 33901 197706 NA       NA     FALSE       NA             NA                    
## 6 33902 197706 NA       NA     FALSE       NA             NA                    
## # … with 8 more variables: manifesto_id <chr>, md5sum_text <chr>,
## #   url_original <chr>, md5sum_original <chr>, annotations <lgl>,
## #   handbook <chr>, is_copy_of <chr>, title <chr>

The field …

  • party contains the party id from the Manifesto Project Dataset.
  • date contains the month of the election in the same format as in the Manifesto Project Dataset (YYYYMM)
  • language specifies the language of the document as a word.
  • is_primary_doc is FALSE only in cases where for a single party and election date multiple manifestos are available and this is the document not used for coding by the Manifesto Project.
  • may_contradict_core_dataset is TRUE for documents where the CMP codings in the corpus documents might be inconsistent with the coding aggregates in the Manifesto Project’s Main Dataset. This applies to manifestos which have been either recoded after they entered the dataset or cases where the dataset entries are derived from hand-written coding sheets used prior to the digitalization of the Manifesto Project’s data workflow, but the documents were digitalized and added to the Manifesto Corpus afterwards.
  • annotations is TRUE whenenver there are CMP codings available for the document.
  • has_eu_code marks document in which the additional coding layer eu_code is present. These codes have been assigned to quasi-sentences by CMP coders additionally to the main CMP code to indicate policy statements that should or should not be implemented on the level of the European union.
  • handbook indicates the version of the coding instructions that was used to annotate the document. See this website for more information on the (different versions of) coding instructions.
  • is_copy_of In a few cases, we copy manifestos to use them for more than one party-date combination (eg. in case of some alliances). In such cases, this field indicates the manifesto-id of the original document. When doing computerized text analysis, you might often want to exclude these cases otherwise you will deal with duplicate documents.
  • title is the title of the document.

The other metadata entries have primarily technical functions for communication between the manifestoR package and the online database.

Working with additional layers of codings

Besides the main layer of CMP codings, you can create, store and access additional layers of codings in ManifestoDocuments by passing a name of the coding layer as additional argument to the function codes():

## assigning a dummy code of alternating As and Bs
codes(doc, "my_code") <- rep_len(c("A", "B"), length.out = length(doc))
head(codes(doc, "my_code"))
## [1] "A" "B" "A" "B" "A" "B"

You can view the names of the coding layers stored in a ManifestoDocument with the function code_layers():

code_layers(doc)
## [1] "cmp_code" "eu_code"  "my_code"

Note that certain documents downloaded from the Manifesto Corpus Database already have a second layer of codes named eu_code. These are codes that have been assigned to quasi-sentences by CMP coders additionally to the main CMP code to indicate policy statements that should or should not be implemented on the level of the European union. The documents that were coded in this way are marked in the corpus’ metadata with the flag has_eu_code (see below Using the document metadata). Note that, since these codes also have been used for computing the per and rile variables in the Manifesto Project Main Dataset, they are also used in manifestoRs count_codes and rile functions (see below Scaling texts) if the respective metadata flag is present.

Text mining tools

Since the Manifesto Corpus uses the infrastructure of the tm package (Feinerer & Hornik 2015), all of tms filtering and transformation functionality can be applied directly to the downloaded ManifestoCorpus.

For example, standard natural language processors are available to clean the corpus:

head(content(my_corpus[["42110_200809"]]))
## [1] "1. SONNE STATT ÖL: WIR  HELFEN  BEIM  SPAREN"  
## [2] "Der Umstieg hat begonnen."                     
## [3] "Die Menschen in Österreich fahren weniger Auto"
## [4] "und mehr mit dem öffentlichen Verkehr"         
## [5] "und dem Rad."                                  
## [6] "Sie sanieren Häuser und Wohnungen"
corpus_cleaned <- tm_map(my_corpus, removePunctuation)
corpus_nostop <- tm_map(corpus_cleaned, removeWords, stopwords("german"))
head(content(corpus_nostop[["42110_200809"]]))
## [1] "1 SONNE STATT ÖL WIR  HELFEN  BEIM  SPAREN"  
## [2] "Der Umstieg  begonnen"                       
## [3] "Die Menschen  Österreich fahren weniger Auto"
## [4] " mehr   öffentlichen Verkehr"                
## [5] "  Rad"                                       
## [6] "Sie sanieren Häuser  Wohnungen"

So is analysis in form of term document matrices:

tdm <- TermDocumentMatrix(corpus_nostop)
inspect(tdm[c("menschen", "wahl", "familie"), ])
## <<TermDocumentMatrix (terms: 3, documents: 21)>>
## Non-/sparse entries: 52/11
## Sparsity           : 17%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## Sample             :
##           Docs
## Terms      42110_200211 42110_201309 42320_200211 42320_200809 42320_201309
##   familie             2            8            2            2            7
##   menschen           65          144           78           50           53
##   wahl                2            9            2            0            0
##           Docs
## Terms      42420_200211 42520_200211 42520_200610 42520_201309 42951_201309
##   familie            17           19           20           20           11
##   menschen           38           47           49           72          102
##   wahl                1            2            0            2           12
findAssocs(tdm, "stadt", 0.97) ## find correlated terms, see ?tm::findAssocs
## $stadt
## erfordert 
##      0.97

For more information about the functionality provided by the tm, please refer to its documentation.

Selecting relevant parts of text

For applications in which not the entire text of a document is of interest, but rather a subset of the quasi-sentences matching certain criteria, manifestoR provides a function subset(...) working just like R’s internal subset function.

It can, for example, be used to filter quasi-sentences based on codes or the text:

# subsetting based on codes (as example above)
doc_subcodes <- subset(doc, codes(doc) %in% c(202, 503, 607))
length(doc_subcodes)
## [1] 115
# subsetting based on text
doc_subtext <- subset(doc, grepl("Demokratie", content(doc)))
head(content(doc_subtext), n = 3)
## [1] "Eine Demokratie benötigt auch die Unterstützung von Forschung jenseits wirtschaftlicher Interessen."      
## [2] "In einer Demokratie sollen all jene wählen dürfen, die von den politischen Entscheidungen betroffen sind."
## [3] "Demokratie braucht die Teilhabe der BürgerInnen."
head(codes(doc_subtext), n = 10)
## [1] "506" "202" "202" "201" "108" NA    "202" "107"

Via tm_map the filtering operations can also be applied to an entire corpus:

corp_sub <- tm_map(my_corpus, function(doc) {
  subset(doc, codes(doc) %in% c(202, 503, 607))
})
head(content(corp_sub[[3]]))
## [1] "Das hat einen einzigen Grund: die hohen Öl- und Gaspreise."           
## [2] "Immer mehr Menschen können sich Heizung"                              
## [3] "und Mobilität immer weniger leisten."                                 
## [4] "Ob wir das wollen oder nicht – Erdöl und Erdgas werden weiter teurer."
## [5] "Wir verbrennen Milliarden in unseren Tanks und Öfen,"                 
## [6] "und: SPAREN STATT VERSCHWENDEN."
head(codes(corp_sub))
## [1] "503" "202" "202" "503" "503" "503"

For convenience, it is also possible to filter quasi-sentences with specific codes directly when downloading a corpus. For this, the additional argument codefilter with a list of CMP codes of interest is passed to mp_corpus:

corp_sub <- mp_corpus(countryname == "Australia", codefilter = c(103, 104))
head(content(corp_sub[[1]]))
## [1] "In the important area of defense alone, our defense white paper has made the greatest ever additional provision for the future defense needs of Australia of any government in more than a quarter of a century."                                                                                                                                                          
## [2] "Over the next ten years we will invest an additional $32 billion in the defense of Australia."                                                                                                                                                                                                                                                                             
## [3] "And how proud I am to say to you that when we came into government in March of 1996 and we found not withstanding what Mr."                                                                                                                                                                                                                                                
## [4] "Beazley had told us during the election campaign that our budget was $10."                                                                                                                                                                                                                                                                                                 
## [5] "5 billion in deficit, that we’d accumulated as a nation $96 billion of federal government debt, the one restriction I put on Peter Costello and John Fahey in getting the budget in shape was you will not cut any money out of defense."                                                                                                                                  
## [6] "And not only didn’t we cut any money out of defense we in fact increased defense expenditure, and just as well because in that five and a half year period we’ve had the demands of East Timor, of Bougainville, and now the commitment to the war against terrorism which is as much our war and our fight and our struggle as it is for the people of the United States."
head(codes(corp_sub))
## [1] "104" "104" "104" "104" "104" "104"

Viewing original documents

Apart from the machine-readable, annotated documents, the Manifesto Corpus also contains original layouted election programmes in PDF format. If available, they can be viewed via the function mp_view_originals(...), which takes exactly the format of arguments as mp_corpus(...) (see above), e.g.:

mp_view_originals(party == 41320 & date == 200909)

The original documents are shown in you system’s web browser. All URLs opened by this function refer only to the Manifesto Project’s Website. If you want to open more than 5 PDF documents at once, you have to specify the maximum number of URLs allows to be opened manually via the parameter maxn. Since opening URLs in an external browser costs computing resources on your local machine, make sure to use only values for maxn that do not slow down or make your computer unresponsive.

mp_view_originals(party > 41000 & party < 41999, maxn = 20)

Efficiency and reproducibility: caching and versioning

To save time and network traffic, manifestoR caches all downloaded data and documents in your computer’s working memory and connects to the online database only when data is required that has not been downloaded before.

corpus <- mp_corpus(wanted)
## Connecting to Manifesto Project DB API... corpus version: 2020-2 
## Connecting to Manifesto Project DB API... corpus version: 2020-2
subcorpus <- mp_corpus(wanted[3:7, ])

Note that in the second query no message informing about the connection to the Manifesto Project’s Database is printed, since no data is actually downloaded.

This mechanism also ensures reproducibility of your scripts, analyses and results: executing your code again will yield the same results, even if the Manifesto Project’s Database is updated in the meantime. Since the cache is only stored in the working memory, however, in order to ensure reproducibility across R sessions, it is advisable to save the cache to the hard drive at the end of analyses and load it in the beginning:

mp_save_cache(file = "manifesto_cache.RData")

## ... start new R session ... then:

library(manifestoR)
mp_setapikey("manifesto_apikey.txt")
mp_load_cache(file = "manifesto_cache.RData")

This way manifestoR always works with the same snapshot of the Manifesto Project Database and Corpus, saves a lot of unnecessary online traffic and also enables you to continue with your analyses offline.

Each snapshot of the Manifesto Corpus is identified via a version number, which is stored in the cache together with the data and can be accessed via

mp_which_corpus_version()
## [1] "2020-2"

When collaborating on a project with other researchers, it is advisable to use the same corpus version for reproducibility of the results. manifestoR can be set to use a specific version with the functions

mp_use_corpus_version("2015-3")
## Connecting to Manifesto Project DB API... corpus version: 2015-3 
## Connecting to Manifesto Project DB API... corpus version: 2015-3
## 1 documents updated

In order to guarantee reproducibility of published work, please also mention the corpus version id used for the reported analyses in the publication.

For updating locally cached data to the most recent version of the Manifesto Project Corpus, manifestoR provides two functions:

mp_check_for_corpus_update()
## $update_available
## [1] TRUE
## 
## $versionid
## [1] "2020-2"
mp_update_cache()
## Connecting to Manifesto Project DB API... corpus version: 2020-2 
## Connecting to Manifesto Project DB API... corpus version: 2020-2
## 1 documents updated
## [1] "2020-2"
mp_check_for_corpus_update()
## $update_available
## [1] FALSE
## 
## $versionid
## [1] "2020-2"

For more detailed information on the caching mechanism and on how to use and load specific snapshots of the Manifesto Corpus, refer to the R documentations of the functions mentioned here as well mp_use_corpus_version, mp_corpusversions, mp_which_corpus_version.

Exporting documents

If required ManifestoCorpus as well as ManifestoDocument objects can be converted to R’s internal data.frame format and processed further:

doc_df <- as.data.frame(doc)
head(within(doc_df, {
  ## for pretty printing
  text <- paste0(substr(text, 1, 60), "...")
}))
##                                                              text cmp_code
## 1                                             1 Lebensqualität...     <NA>
## 2                                       1.1 Grüne Energiewende...     <NA>
## 3 Lebensqualität bedeutet in einer unversehrten Umwelt zu lebe...      501
## 4 Die Verantwortung dafür liegt bei uns: Wir alle gestalten Um...      606
## 5 Ein Umdenken in der Energiepolitik ist eine wesentliche Vora...      501
## 6 Wir Grüne stehen für eine Energiewende hin zu einem Aufbruch...      501
##   eu_code my_code pos
## 1    <NA>       A   1
## 2    <NA>       B   2
## 3    <NA>       A   3
## 4    <NA>       B   4
## 5    <NA>       A   5
## 6    <NA>       B   6

The function also provides a parameter to include all available metadata in the export:

doc_df_with_meta <- as.data.frame(doc, with.meta = TRUE)
print(names(doc_df_with_meta))
##  [1] "text"                        "cmp_code"                   
##  [3] "eu_code"                     "my_code"                    
##  [5] "pos"                         "manifesto_id"               
##  [7] "party"                       "date"                       
##  [9] "language"                    "source"                     
## [11] "has_eu_code"                 "is_primary_doc"             
## [13] "may_contradict_core_dataset" "md5sum_text"                
## [15] "url_original"                "md5sum_original"            
## [17] "annotations"                 "handbook"                   
## [19] "is_copy_of"                  "title"                      
## [21] "id"                          "manual_edits"

Note again that also all functionality provided by tm, such as writeCorpus is available on a ManifestoCorpus.

Additional Information

When publishing work using the Manifesto Corpus, please make sure to cite it correctly and to give the identification number of the corpus version used for your analysis. You can print citation and version information with the function mp_cite().

For a more detailed reference and complete list of the functions provided by manifestoR, see the R package reference manual on CRAN: http://cran.r-project.org/web/packages/manifestoR/manifestoR.pdf

References

Feinerer, I., & Hornik, K. (2015). Tm: Text Mining Package. http://cran.r-project.org/web/packages/tm/index.html

Session Info

Tested with:

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting value                       
##  version R version 4.0.3 (2020-10-10)
##  date    2021-06-15                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib  source        
##  assertthat    0.2.0   2017-04-11 [NA] CRAN (R 4.0.3)
##  base64enc     0.1-3   2015-07-28 [NA] CRAN (R 4.0.2)
##  bookdown      0.22    2021-04-22 [NA] CRAN (R 4.0.2)
##  cli           1.1.0   2019-03-19 [NA] CRAN (R 4.0.3)
##  crayon        1.3.4   2017-09-16 [NA] CRAN (R 4.0.2)
##  curl          3.2     2018-03-28 [NA] CRAN (R 4.0.3)
##  digest        0.6.21  2019-09-20 [NA] CRAN (R 4.0.3)
##  dplyr       * 1.0.6   2021-05-05 [NA] CRAN (R 4.0.2)
##  DT            0.7     2019-06-11 [NA] CRAN (R 4.0.3)
##  ellipsis      0.3.2   2021-04-29 [NA] CRAN (R 4.0.3)
##  evaluate      0.14    2019-05-28 [NA] CRAN (R 4.0.1)
##  fansi         0.4.0   2018-10-05 [NA] CRAN (R 4.0.3)
##  foreign       0.8-70  2018-04-23 [NA] CRAN (R 4.0.3)
##  functional    0.6     2014-07-16 [NA] CRAN (R 4.0.2)
##  generics      0.0.2   2018-11-29 [NA] CRAN (R 4.0.2)
##  glue          1.4.2   2020-08-27 [NA] CRAN (R 4.0.2)
##  hms           0.4.2   2018-03-10 [NA] CRAN (R 4.0.3)
##  htmltools     0.4.0   2019-10-04 [NA] CRAN (R 4.0.3)
##  htmlwidgets   1.5.3   2020-12-10 [NA] CRAN (R 4.0.2)
##  httr          1.3.1   2017-08-20 [NA] CRAN (R 4.0.3)
##  jsonlite      1.6     2018-12-07 [NA] CRAN (R 4.0.3)
##  knitr         1.33    2021-04-24 [NA] CRAN (R 4.0.2)
##  lattice       0.20-35 2017-03-25 [NA] CRAN (R 4.0.3)
##  lifecycle     1.0.0   2021-02-15 [NA] CRAN (R 4.0.2)
##  magrittr      2.0.1   2020-11-17 [NA] CRAN (R 4.0.2)
##  manifestoR  * 1.5.0   2020-11-29 [NA] CRAN (R 4.0.2)
##  mnormt        1.5-5   2016-10-15 [NA] CRAN (R 4.0.3)
##  nlme          3.1-131 2017-02-06 [NA] CRAN (R 4.0.3)
##  NLP         * 0.1-9   2016-02-18 [NA] CRAN (R 4.0.3)
##  pillar        1.6.1   2021-05-16 [NA] CRAN (R 4.0.2)
##  pkgconfig     2.0.2   2018-08-16 [NA] CRAN (R 4.0.3)
##  psych         1.8.3.3 2018-03-30 [NA] CRAN (R 4.0.3)
##  purrr         0.3.2   2019-03-15 [NA] CRAN (R 4.0.3)
##  R6            2.2.2   2017-06-17 [NA] CRAN (R 4.0.3)
##  Rcpp          1.0.0   2018-11-07 [NA] CRAN (R 4.0.3)
##  readr         1.3.1   2018-12-21 [NA] CRAN (R 4.0.3)
##  rlang         0.4.10  2020-12-30 [NA] CRAN (R 4.0.2)
##  rmarkdown     2.8     2021-05-07 [NA] CRAN (R 4.0.2)
##  rmdformats    1.0.2   2021-04-19 [NA] CRAN (R 4.0.2)
##  sessioninfo   1.1.1   2018-11-05 [NA] CRAN (R 4.0.2)
##  slam          0.1-40  2016-12-01 [NA] CRAN (R 4.0.3)
##  stringi       1.1.7   2018-03-12 [NA] CRAN (R 4.0.3)
##  stringr       1.3.0   2018-02-19 [NA] CRAN (R 4.0.3)
##  tibble        3.1.2   2021-05-16 [NA] CRAN (R 4.0.2)
##  tidyselect    1.1.1   2021-04-30 [NA] CRAN (R 4.0.3)
##  tm          * 0.7-5   2018-07-29 [NA] CRAN (R 4.0.3)
##  utf8          1.1.3   2018-01-03 [NA] CRAN (R 4.0.3)
##  vctrs         0.3.8   2021-04-29 [NA] CRAN (R 4.0.3)
##  withr         2.1.2   2018-03-15 [NA] CRAN (R 4.0.3)
##  xfun          0.23    2021-05-15 [NA] CRAN (R 4.0.2)
##  xml2          1.2.0   2018-01-24 [NA] CRAN (R 4.0.3)
##  yaml          2.2.0   2018-07-25 [NA] CRAN (R 4.0.3)
##  zoo           1.7-13  2016-05-03 [NA] CRAN (R 4.0.3)