Using the Manifesto Corpus with the tidytext package

Nicolas Merz,

31 May 2018

In this tutorial, we will show how to use the tidytext package to convert the Manifesto Corpus into a tidy text format. We assume that you have already read the tutorial on the First steps with manifestoR.

Tidy data and tidytext

The tidy text format is inspired by the tidy data format (Wickham 2014). Data is tidy if

  • each variable is a column
  • each observation is a row
  • each type of observational unit is a table

In other context, tidy data is also known as “long” format.

The tidy text format picks up three principles of tidy data. Tidy text is a format where information is stored in “a table with one-token-per-row”" (Silge and Robinson 2016). This is in contrast to the idea of term-document-matrices or document-feature matrices that are commonly used in text analysis.

The advantage of the tidytext format is that it allows the use of functions many users are familiar with from managing and cleaning “normal” data sets.

The tidytext package provides functions to transform several other text data formats into a tidy text format. These functions can also be applied to the Manifesto Corpus format. In the following, we will show how to use the functions of the tidytext package to convert the Manifesto Corpus into a tidy text format.

tidytext package

If you have not installed the manifestoR or tidytext package, you need to install them first with install.packages("manifestoR") and/or install.packages("tidytext"). As every sesions using the Manifesto Corpus, you need to set your api-key. To learn more about the api-key and manifestoR, see the tutorial “First steps with manifestoR”. Moreover, we fix the corpus version using the mp_use_corpus_version function. This ensure that the script does not break if a new corpus version is published as by default the latest corpus version is used.

library(manifestoR)
library(tidytext)
library(dplyr)
library(ggplot2)
mp_setapikey(key.file = "manifesto_apikey.txt")
mp_use_corpus_version("2017-2")

The mp_corpus returns a ManifestoCorpus object in the Corpus format of the tm-package (see the “First steps…” tutorial for more information). We use the manifestos of the Irish 2016 election as exemplary case here.

ireland2016_corpus <- mp_corpus(countryname == "Ireland" & date == 201602)
ireland2016_corpus
## <<ManifestoCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 10

The tidy()function transforms the ManifestoCorpus into a data frame where each row represents one document. Variables are the meta-information from the corpus as well as an additional variable named “text” that contains the whole text for each document.

tidied_corpus <- ireland2016_corpus %>% tidy()
tidied_corpus
## # A tibble: 10 x 17
##    manifesto_id party   date language source has_eu_code is_primary_doc
##    <chr>        <dbl>  <dbl> <chr>    <chr>  <lgl>       <lgl>         
##  1 53110_201602 53110 201602 english  MARPOR FALSE       TRUE          
##  2 53231_201602 53231 201602 english  MARPOR FALSE       TRUE          
##  3 53240_201602 53240 201602 english  MARPOR FALSE       TRUE          
##  4 53250_201602 53250 201602 english  MARPOR FALSE       TRUE          
##  5 53320_201602 53320 201602 english  MARPOR FALSE       TRUE          
##  6 53321_201602 53321 201602 english  MARPOR FALSE       TRUE          
##  7 53520_201602 53520 201602 english  MARPOR FALSE       TRUE          
##  8 53620_201602 53620 201602 english  MARPOR FALSE       TRUE          
##  9 53951_201602 53951 201602 english  MARPOR FALSE       TRUE          
## 10 53981_201602 53981 201602 english  MARPOR FALSE       TRUE          
## # … with 10 more variables: may_contradict_core_dataset <lgl>,
## #   md5sum_text <chr>, url_original <chr>, md5sum_original <chr>,
## #   annotations <lgl>, handbook <chr>, is_copy_of <chr>, title <chr>, id <chr>,
## #   text <chr>

The most important function of the tidytext package is the unnest_tokens function. It tokenizes the text variable into words (or other tokens) and creates one row per token - making the data frame tidy. The unnest_token function by default transforms all characters to lower case.

tidy_df <- tidied_corpus %>%
  unnest_tokens(word, text)

tidy_df %>%
  select(manifesto_id, word) %>%
  head(15)
## # A tibble: 15 x 2
##    manifesto_id word       
##    <chr>        <chr>      
##  1 53110_201602 think      
##  2 53110_201602 ahead      
##  3 53110_201602 act        
##  4 53110_201602 now        
##  5 53110_201602 general    
##  6 53110_201602 election   
##  7 53110_201602 manifesto  
##  8 53110_201602 2016       
##  9 53110_201602 progressive
## 10 53110_201602 practical  
## 11 53110_201602 and        
## 12 53110_201602 sustainable
## 13 53110_201602 politics   
## 14 53110_201602 for        
## 15 53110_201602 the

Cleaning and preprocessing

The tidy format allows to make use of the dplyr grammar to preprocess and clean the data. To delete stopwords we make us of a stop word collection that comes with the tidytext package. The argument here is a tidytext function that returns a dataframe with a list of stopwords (frequent but little meaningful words).

get_stopwords()
## # A tibble: 175 x 2
##    word      lexicon 
##    <chr>     <chr>   
##  1 i         snowball
##  2 me        snowball
##  3 my        snowball
##  4 myself    snowball
##  5 we        snowball
##  6 our       snowball
##  7 ours      snowball
##  8 ourselves snowball
##  9 you       snowball
## 10 your      snowball
## # … with 165 more rows

Anti_join here will only keep words that do not appear in the list dataframe provided as argument. Another advantage of the tidytext format is one can easily filter for certain characteristics. Here, we show how one can easily filter for tokens that are numbers only. The expression is.na(as.numeric(word))filters for words that can not be transformed to numeric values. This filters out all words that are just containing numbers (such as the “2016” in the example above).

tidy_without_stopwords <- tidy_df %>%
  anti_join(get_stopwords()) %>%
  filter(is.na(as.numeric(word)))

tidy_without_stopwords %>%
  select(manifesto_id, word) %>%
  head(10)
## # A tibble: 10 x 2
##    manifesto_id word       
##    <chr>        <chr>      
##  1 53110_201602 think      
##  2 53110_201602 ahead      
##  3 53110_201602 act        
##  4 53110_201602 now        
##  5 53110_201602 general    
##  6 53110_201602 election   
##  7 53110_201602 manifesto  
##  8 53110_201602 progressive
##  9 53110_201602 practical  
## 10 53110_201602 sustainable

Term frequencies and Tf-Idf

Using the count function on the tidied data, it is very easy to obtain term frequencies of the corpus under investigation.

tidy_without_stopwords %>%
  count(word, sort = TRUE) %>%
  head(10)
## # A tibble: 10 x 2
##    word           n
##    <chr>      <int>
##  1 new          847
##  2 people       704
##  3 ireland      686
##  4 public       678
##  5 ensure       674
##  6 government   620
##  7 support      615
##  8 fine         578
##  9 gael         556
## 10 services     536

General term frequencies (even when calculated per document) are often not very meaningful as they do not differ very much across documents. Many applications therefore calculate the tf-idf score (term-frequency inverse-document-frequency). This detects words that appear often within one document, but rarely in other documents. Tfidf identifies words that are on the one hand frequent, but on the other hand also distinct. tidytext has a function bind_tfidf that adds the tfidf-score to a data frame containing term frequencies and document meta data.

Before calculating the tfidf score, we get nicer document names based on the party names stored in the Manifesto Project Dataset.

irish_partynames <- mp_maindataset() %>%
  filter(countryname == "Ireland" & date == 201602) %>%
  select(party, partyname)

The following shows how to calculate tf-idf socres and plot the 5 highest scoring terms for each manifesto. For more information on tf-idf scores, have a look at the respective chapter in the tidy text text.

tidy_without_stopwords %>%
  count(party, word, sort = TRUE) %>%
  bind_tf_idf(word, party, n = n) %>%
  arrange(desc(party, tf_idf)) %>%
  # mutate(word = factor(word, levels = rev(unique(word)), ordered=T)) %>%
  group_by(party) %>%
  top_n(5) %>%
  ungroup() %>%
  left_join(irish_partynames, by = "party") %>%
  ggplot(aes(x = reorder(word, tf_idf), y = tf_idf, fill = partyname)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~partyname, ncol = 2, scales = "free") +
  coord_flip()
## Selecting by tf_idf

One can see that the terms with high tf-idf scores differ across parties. Not surprisingly, the parties’ names or parts thereof appear often in these lists (as they are often used by the party, and rarely by other parties).

Make use of the codings (annotations)

The previous analyses did just make use of the machine-readable texts, but did not exploit the digital codings/annotations of the Manifesto Corpus. In this section, we will show how to use the tidytext package in conjunction with the annotations/codings of the Manifesto Corpus. In order to keep the codes for further analysis, it is necessary to first convert the ManifestoCorpus object to a data.frame and then use the unnest_tokens function (instead of using the tidy function which will drop the codes). The pos variable in the returned data frame comes from the content object of the Manifesto Corpus and indicates the position of the quasi-sentence within a ManifestoDocument. The following extract shows the quasi-sentences 50 to 51 in the Green Party manifesto (party id == 53110). For better readability, we did not remove stopwords here. One can see that quasi-sentence 50 was coded as 107 (Internationalism: positive), while the following quasi-sentence was coded as 501 (environmental protection: positive).

words_and_codes <- mp_corpus(countryname == "Ireland" & date == 201602) %>%
  as.data.frame(with.meta = TRUE) %>%
  unnest_tokens(word, text)

words_and_codes %>%
  select(party, word, pos, cmp_code) %>%
  filter(party == 53110 & between(pos, 50, 51))
##       party         word pos cmp_code
## 50    53110           we  50      107
## 50.1  53110         need  50      107
## 50.2  53110           to  50      107
## 50.3  53110       regain  50      107
## 50.4  53110         this  50      107
## 50.5  53110       spirit  50      107
## 50.6  53110          and  50      107
## 50.7  53110         this  50      107
## 50.8  53110       stance  50      107
## 50.9  53110          and  50      107
## 50.10 53110          act  50      107
## 50.11 53110           as  50      107
## 50.12 53110           an  50      107
## 50.13 53110       honest  50      107
## 50.14 53110       broker  50      107
## 50.15 53110           in  50      107
## 50.16 53110          all  50      107
## 50.17 53110          our  50      107
## 50.18 53110 multilateral  50      107
## 50.19 53110  engagements  50      107
## 51    53110      looking  51      501
## 51.1  53110     globally  51      501
## 51.2  53110           we  51      501
## 51.3  53110         will  51      501
## 51.4  53110    legislate  51      501
## 51.5  53110          for  51      501
## 51.6  53110      binding  51      501
## 51.7  53110      targets  51      501
## 51.8  53110           on  51      501
## 51.9  53110      climate  51      501
## 51.10 53110       change  51      501
## 51.11 53110           in  51      501
## 51.12 53110         line  51      501
## 51.13 53110         with  51      501
## 51.14 53110          the  51      501
## 51.15 53110        paris  51      501
## 51.16 53110    agreement  51      501

Now, we can simply filter based on the cmp_code, eg to either exclude some of the word occurencies from the analysis. One can also use the coding information to calculate tf-idf scores based on the different coding categories instead of based on the different documents. This should get us terms that are distinct and meaningful for the given categories. We first use remove stopwords and purely numeric values from the word list shown above and drop sentences coded as headlines (H), non-coded quasi-sentences or quasi-sentences coded as “0” (no particular meaning, cannot be coded). To reduce the complexity, we recode the categories coded according to version 5 of the coding instructions to the less complex coding scheme of version 4 (this aggregates several subcategories to the main categories - see the subcategories tutorial for more information). Then, we count and calculate tf-idf scores based on the word frequencies per coding category (instead of based on the frequencies per document).

tfidf_codes <- words_and_codes %>%
  anti_join(get_stopwords()) %>%
  filter(is.na(as.numeric(word))) %>%
  filter(!(cmp_code %in% c("H", "", "0", "000", NA))) %>%
  mutate(cmp_code = recode_v5_to_v4(cmp_code)) %>%
  count(cmp_code, word) %>%
  bind_tf_idf(word, cmp_code, n)

For illustrative purposes, we restrict the dataset to four codes: decentralisation (301), technology & infrastructure (411), environmental protection (501), and culture (502). We can see that the terms with high tf-idf scores seem very reasonable and make intuitive sense for the categories here (certainly, otherwise we wouldn’t have chosen this example…).

tfidf_codes %>%
  filter(cmp_code %in% c("501", "502", "301", "411")) %>%
  mutate(cmp_code = factor(cmp_code, labels = c("Decentralisation", "Technology & Infrastructure", "Environmental Protection", "Culture"))) %>%
  group_by(cmp_code) %>%
  top_n(10, tf_idf) %>%
  ggplot(aes(x = reorder(word, tf_idf), y = tf_idf, fill = cmp_code)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~cmp_code, ncol = 2, scales = "free") +
  coord_flip()

Tidytext provides many functions to convert to and from other text packages such as quanteda or tm. This was just a primer on how to use tidytext package (and philosophy) to use with the Manifesto Corpus. If you want to dig deeper into tidy text mining, we recommend the book Text Mining with R: A Tidy Approach" by Julia Silge and David Robinson.

Bibliography

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). doi:10.18637/jss.v059.i10 .

Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” The Journal of Open Source Software 1 (3). doi:10.21105/joss.00037.

Session Info

Tested with:

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting value                       
##  version R version 4.0.3 (2020-10-10)
##  date    2021-06-15                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib  source        
##  assertthat    0.2.0   2017-04-11 [NA] CRAN (R 4.0.3)
##  base64enc     0.1-3   2015-07-28 [NA] CRAN (R 4.0.2)
##  bookdown      0.22    2021-04-22 [NA] CRAN (R 4.0.2)
##  cli           1.1.0   2019-03-19 [NA] CRAN (R 4.0.3)
##  colorspace    1.3-2   2016-12-14 [NA] CRAN (R 4.0.3)
##  crayon        1.3.4   2017-09-16 [NA] CRAN (R 4.0.2)
##  curl          3.2     2018-03-28 [NA] CRAN (R 4.0.3)
##  digest        0.6.21  2019-09-20 [NA] CRAN (R 4.0.3)
##  dplyr       * 1.0.6   2021-05-05 [NA] CRAN (R 4.0.2)
##  DT            0.7     2019-06-11 [NA] CRAN (R 4.0.3)
##  ellipsis      0.3.2   2021-04-29 [NA] CRAN (R 4.0.3)
##  evaluate      0.14    2019-05-28 [NA] CRAN (R 4.0.1)
##  fansi         0.4.0   2018-10-05 [NA] CRAN (R 4.0.3)
##  farver        2.0.1   2019-11-13 [NA] CRAN (R 4.0.3)
##  foreign       0.8-70  2018-04-23 [NA] CRAN (R 4.0.3)
##  functional    0.6     2014-07-16 [NA] CRAN (R 4.0.2)
##  generics      0.0.2   2018-11-29 [NA] CRAN (R 4.0.2)
##  ggplot2     * 3.3.3   2020-12-30 [NA] CRAN (R 4.0.2)
##  glue          1.4.2   2020-08-27 [NA] CRAN (R 4.0.2)
##  gtable        0.2.0   2016-02-26 [NA] CRAN (R 4.0.3)
##  highr         0.6     2016-05-09 [NA] CRAN (R 4.0.3)
##  hms           0.4.2   2018-03-10 [NA] CRAN (R 4.0.3)
##  htmltools     0.4.0   2019-10-04 [NA] CRAN (R 4.0.3)
##  htmlwidgets   1.5.3   2020-12-10 [NA] CRAN (R 4.0.2)
##  httr          1.3.1   2017-08-20 [NA] CRAN (R 4.0.3)
##  janeaustenr   0.1.1   2016-06-20 [NA] CRAN (R 4.0.3)
##  jsonlite      1.6     2018-12-07 [NA] CRAN (R 4.0.3)
##  knitr         1.33    2021-04-24 [NA] CRAN (R 4.0.2)
##  labeling      0.3     2014-08-23 [NA] CRAN (R 4.0.3)
##  lattice       0.20-35 2017-03-25 [NA] CRAN (R 4.0.3)
##  lifecycle     1.0.0   2021-02-15 [NA] CRAN (R 4.0.2)
##  magrittr      2.0.1   2020-11-17 [NA] CRAN (R 4.0.2)
##  manifestoR  * 1.5.0   2020-11-29 [NA] CRAN (R 4.0.2)
##  Matrix        1.2-14  2018-04-09 [NA] CRAN (R 4.0.3)
##  mnormt        1.5-5   2016-10-15 [NA] CRAN (R 4.0.3)
##  munsell       0.5.0   2018-06-12 [NA] CRAN (R 4.0.2)
##  nlme          3.1-131 2017-02-06 [NA] CRAN (R 4.0.3)
##  NLP         * 0.1-9   2016-02-18 [NA] CRAN (R 4.0.3)
##  pillar        1.6.1   2021-05-16 [NA] CRAN (R 4.0.2)
##  pkgconfig     2.0.2   2018-08-16 [NA] CRAN (R 4.0.3)
##  psych         1.8.3.3 2018-03-30 [NA] CRAN (R 4.0.3)
##  purrr         0.3.2   2019-03-15 [NA] CRAN (R 4.0.3)
##  R6            2.2.2   2017-06-17 [NA] CRAN (R 4.0.3)
##  Rcpp          1.0.0   2018-11-07 [NA] CRAN (R 4.0.3)
##  readr         1.3.1   2018-12-21 [NA] CRAN (R 4.0.3)
##  rlang         0.4.10  2020-12-30 [NA] CRAN (R 4.0.2)
##  rmarkdown     2.8     2021-05-07 [NA] CRAN (R 4.0.2)
##  rmdformats    1.0.2   2021-04-19 [NA] CRAN (R 4.0.2)
##  scales        1.1.0   2019-11-18 [NA] CRAN (R 4.0.3)
##  sessioninfo   1.1.1   2018-11-05 [NA] CRAN (R 4.0.2)
##  slam          0.1-40  2016-12-01 [NA] CRAN (R 4.0.3)
##  SnowballC     0.5.1   2014-08-09 [NA] CRAN (R 4.0.3)
##  stopwords     0.9.0   2017-12-14 [NA] CRAN (R 4.0.3)
##  stringi       1.1.7   2018-03-12 [NA] CRAN (R 4.0.3)
##  stringr       1.3.0   2018-02-19 [NA] CRAN (R 4.0.3)
##  tibble        3.1.2   2021-05-16 [NA] CRAN (R 4.0.2)
##  tidyselect    1.1.1   2021-04-30 [NA] CRAN (R 4.0.3)
##  tidytext    * 0.2.1   2019-06-14 [NA] CRAN (R 4.0.3)
##  tm          * 0.7-5   2018-07-29 [NA] CRAN (R 4.0.3)
##  tokenizers    0.2.1   2018-03-29 [NA] CRAN (R 4.0.2)
##  utf8          1.1.3   2018-01-03 [NA] CRAN (R 4.0.3)
##  vctrs         0.3.8   2021-04-29 [NA] CRAN (R 4.0.3)
##  withr         2.1.2   2018-03-15 [NA] CRAN (R 4.0.3)
##  xfun          0.23    2021-05-15 [NA] CRAN (R 4.0.2)
##  xml2          1.2.0   2018-01-24 [NA] CRAN (R 4.0.3)
##  yaml          2.2.0   2018-07-25 [NA] CRAN (R 4.0.3)
##  zoo           1.7-13  2016-05-03 [NA] CRAN (R 4.0.3)