Using the Manifesto Corpus with the tidytext package

Nicolas Merz,

31 May 2018

In this tutorial, we will show how to use the tidytext package to convert the Manifesto Corpus into a tidy text format. We assume that you have already read the tutorial on the First steps with manifestoR.

Tidy data and tidytext

The tidy text format is inspired by the tidy data format (Wickham 2014). Data is tidy if

  • each variable is a column
  • each observation is a row
  • each type of observational unit is a table

In other context, tidy data is also known as “long” format.

The tidy text format picks up three principles of tidy data. Tidy text is a format where information is stored in “a table with one-token-per-row”" (Silge and Robinson 2016). This is in contrast to the idea of term-document-matrices or document-feature matrices that are commonly used in text analysis.

The advantage of the tidytext format is that it allows the use of functions many users are familiar with from managing and cleaning “normal” data sets.

The tidytext package provides functions to transform several other text data formats into a tidy text format. These functions can also be applied to the Manifesto Corpus format. In the following, we will show how to use the functions of the tidytext package to convert the Manifesto Corpus into a tidy text format.

tidytext package

If you have not installed the manifestoR or tidytext package, you need to install them first with install.packages("manifestoR") and/or install.packages("tidytext"). As every sesions using the Manifesto Corpus, you need to set your api-key. To learn more about the api-key and manifestoR, see the tutorial “First steps with manifestoR”. Moreover, we fix the corpus version using the mp_use_corpus_version function. This ensure that the script does not break if a new corpus version is published as by default the latest corpus version is used.

library(manifestoR)
library(tidytext)
library(dplyr)
library(ggplot2)
mp_setapikey(key.file = "manifesto_apikey.txt")
mp_use_corpus_version("2017-2")

The mp_corpus returns a ManifestoCorpus object in the Corpus format of the tm-package (see the “First steps…” tutorial for more information). We use the manifestos of the Irish 2016 election as exemplary case here.

ireland2016_corpus <- mp_corpus(countryname == "Ireland" & date == 201602)
ireland2016_corpus
## <<ManifestoCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 10

The tidy()function transforms the ManifestoCorpus into a data frame where each row represents one document. Variables are the meta-information from the corpus as well as an additional variable named “text” that contains the whole text for each document.

tidied_corpus <- ireland2016_corpus %>% tidy()
tidied_corpus
## # A tibble: 10 x 17
##    manifesto_id party   date language source has_eu_code is_primary_doc
##    <chr>        <dbl>  <dbl> <chr>    <chr>  <lgl>       <lgl>         
##  1 53110_201602 53110 201602 english  MARPOR FALSE       TRUE          
##  2 53231_201602 53231 201602 english  MARPOR FALSE       TRUE          
##  3 53240_201602 53240 201602 english  MARPOR FALSE       TRUE          
##  4 53250_201602 53250 201602 english  MARPOR FALSE       TRUE          
##  5 53320_201602 53320 201602 english  MARPOR FALSE       TRUE          
##  6 53321_201602 53321 201602 english  MARPOR FALSE       TRUE          
##  7 53520_201602 53520 201602 english  MARPOR FALSE       TRUE          
##  8 53620_201602 53620 201602 english  MARPOR FALSE       TRUE          
##  9 53951_201602 53951 201602 english  MARPOR FALSE       TRUE          
## 10 53981_201602 53981 201602 english  MARPOR FALSE       TRUE          
## # … with 10 more variables: may_contradict_core_dataset <lgl>,
## #   md5sum_text <chr>, url_original <chr>, md5sum_original <chr>,
## #   annotations <lgl>, handbook <chr>, is_copy_of <chr>, title <chr>, id <chr>,
## #   text <chr>

The most important function of the tidytext package is the unnest_tokens function. It tokenizes the text variable into words (or other tokens) and creates one row per token - making the data frame tidy. The unnest_token function by default transforms all characters to lower case.

tidy_df <- tidied_corpus %>%
  unnest_tokens(word, text)

tidy_df %>%
  select(manifesto_id, word) %>%
  head(15)
## # A tibble: 15 x 2
##    manifesto_id word       
##    <chr>        <chr>      
##  1 53110_201602 think      
##  2 53110_201602 ahead      
##  3 53110_201602 act        
##  4 53110_201602 now        
##  5 53110_201602 general    
##  6 53110_201602 election   
##  7 53110_201602 manifesto  
##  8 53110_201602 2016       
##  9 53110_201602 progressive
## 10 53110_201602 practical  
## 11 53110_201602 and        
## 12 53110_201602 sustainable
## 13 53110_201602 politics   
## 14 53110_201602 for        
## 15 53110_201602 the

Cleaning and preprocessing

The tidy format allows to make use of the dplyr grammar to preprocess and clean the data. To delete stopwords we make us of a stop word collection that comes with the tidytext package. The argument here is a tidytext function that returns a dataframe with a list of stopwords (frequent but little meaningful words).

get_stopwords()
## # A tibble: 175 x 2
##    word      lexicon 
##    <chr>     <chr>   
##  1 i         snowball
##  2 me        snowball
##  3 my        snowball
##  4 myself    snowball
##  5 we        snowball
##  6 our       snowball
##  7 ours      snowball
##  8 ourselves snowball
##  9 you       snowball
## 10 your      snowball
## # … with 165 more rows

Anti_join here will only keep words that do not appear in the list dataframe provided as argument. Another advantage of the tidytext format is one can easily filter for certain characteristics. Here, we show how one can easily filter for tokens that are numbers only. The expression is.na(as.numeric(word))filters for words that can not be transformed to numeric values. This filters out all words that are just containing numbers (such as the “2016” in the example above).

tidy_without_stopwords <- tidy_df %>%
  anti_join(get_stopwords()) %>%
  filter(is.na(as.numeric(word)))

tidy_without_stopwords %>%
  select(manifesto_id, word) %>%
  head(10)
## # A tibble: 10 x 2
##    manifesto_id word       
##    <chr>        <chr>      
##  1 53110_201602 think      
##  2 53110_201602 ahead      
##  3 53110_201602 act        
##  4 53110_201602 now        
##  5 53110_201602 general    
##  6 53110_201602 election   
##  7 53110_201602 manifesto  
##  8 53110_201602 progressive
##  9 53110_201602 practical  
## 10 53110_201602 sustainable

Term frequencies and Tf-Idf

Using the count function on the tidied data, it is very easy to obtain term frequencies of the corpus under investigation.

tidy_without_stopwords %>%
  count(word, sort = TRUE) %>%
  head(10)
## # A tibble: 10 x 2
##    word           n
##    <chr>      <int>
##  1 new          847
##  2 people       704
##  3 ireland      686
##  4 public       678
##  5 ensure       674
##  6 government   620
##  7 support      615
##  8 fine         578
##  9 gael         556
## 10 services     536

General term frequencies (even when calculated per document) are often not very meaningful as they do not differ very much across documents. Many applications therefore calculate the tf-idf score (term-frequency inverse-document-frequency). This detects words that appear often within one document, but rarely in other documents. Tfidf identifies words that are on the one hand frequent, but on the other hand also distinct. tidytext has a function bind_tfidf that adds the tfidf-score to a data frame containing term frequencies and document meta data.

Before calculating the tfidf score, we get nicer document names based on the party names stored in the Manifesto Project Dataset.

irish_partynames <- mp_maindataset() %>%
  filter(countryname == "Ireland" & date == 201602) %>%
  select(party, partyname)

The following shows how to calculate tf-idf socres and plot the 5 highest scoring terms for each manifesto. For more information on tf-idf scores, have a look at the respective chapter in the tidy text text.

tidy_without_stopwords %>%
  count(party, word, sort = TRUE) %>%
  bind_tf_idf(word, party, n = n) %>%
  arrange(desc(party, tf_idf)) %>%
  # mutate(word = factor(word, levels = rev(unique(word)), ordered=T)) %>%
  group_by(party) %>%
  top_n(5) %>%
  ungroup() %>%
  left_join(irish_partynames, by = "party") %>%
  ggplot(aes(x = reorder(word, tf_idf), y = tf_idf, fill = partyname)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~partyname, ncol = 2, scales = "free") +
  coord_flip()
## Selecting by tf_idf