In this tutorial, we will show how to use the tidytext package to convert the Manifesto Corpus into a tidy text format. We assume that you have already read the tutorial on the First steps with manifestoR.
Tidy data and tidytext
The tidy text format is inspired by the tidy data format (Wickham 2014). Data is tidy if
- each variable is a column
- each observation is a row
- each type of observational unit is a table
In other context, tidy data is also known as “long” format.
The tidy text format picks up three principles of tidy data. Tidy text is a format where information is stored in “a table with one-token-per-row”" (Silge and Robinson 2016). This is in contrast to the idea of term-document-matrices or document-feature matrices that are commonly used in text analysis.
The advantage of the tidytext format is that it allows the use of functions many users are familiar with from managing and cleaning “normal” data sets.
The tidytext package provides functions to transform several other text data formats into a tidy text format. These functions can also be applied to the Manifesto Corpus format. In the following, we will show how to use the functions of the tidytext package to convert the Manifesto Corpus into a tidy text format.
tidytext package
If you have not installed the manifestoR or tidytext package, you need to install them first with install.packages("manifestoR")
and/or install.packages("tidytext")
. As every sesions using the Manifesto Corpus, you need to set your api-key. To learn more about the api-key and manifestoR, see the tutorial “First steps with manifestoR”. Moreover, we fix the corpus version using the mp_use_corpus_version
function. This ensure that the script does not break if a new corpus version is published as by default the latest corpus version is used.
library(manifestoR)
library(tidytext)
library(dplyr)
library(ggplot2)
mp_setapikey(key.file = "manifesto_apikey.txt")
mp_use_corpus_version("2017-2")
The mp_corpus
returns a ManifestoCorpus object in the Corpus format of the tm-package (see the “First steps…” tutorial for more information). We use the manifestos of the Irish 2016 election as exemplary case here.
<- mp_corpus(countryname == "Ireland" & date == 201602)
ireland2016_corpus ireland2016_corpus
## <<ManifestoCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 10
The tidy()
function transforms the ManifestoCorpus into a data frame where each row represents one document. Variables are the meta-information from the corpus as well as an additional variable named “text” that contains the whole text for each document.
<- ireland2016_corpus %>% tidy()
tidied_corpus tidied_corpus
## # A tibble: 10 x 17
## manifesto_id party date language source has_eu_code is_primary_doc
## <chr> <dbl> <dbl> <chr> <chr> <lgl> <lgl>
## 1 53110_201602 53110 201602 english MARPOR FALSE TRUE
## 2 53231_201602 53231 201602 english MARPOR FALSE TRUE
## 3 53240_201602 53240 201602 english MARPOR FALSE TRUE
## 4 53250_201602 53250 201602 english MARPOR FALSE TRUE
## 5 53320_201602 53320 201602 english MARPOR FALSE TRUE
## 6 53321_201602 53321 201602 english MARPOR FALSE TRUE
## 7 53520_201602 53520 201602 english MARPOR FALSE TRUE
## 8 53620_201602 53620 201602 english MARPOR FALSE TRUE
## 9 53951_201602 53951 201602 english MARPOR FALSE TRUE
## 10 53981_201602 53981 201602 english MARPOR FALSE TRUE
## # … with 10 more variables: may_contradict_core_dataset <lgl>,
## # md5sum_text <chr>, url_original <chr>, md5sum_original <chr>,
## # annotations <lgl>, handbook <chr>, is_copy_of <chr>, title <chr>, id <chr>,
## # text <chr>
The most important function of the tidytext package is the unnest_tokens
function. It tokenizes the text
variable into words (or other tokens) and creates one row per token - making the data frame tidy. The unnest_token function by default transforms all characters to lower case.
<- tidied_corpus %>%
tidy_df unnest_tokens(word, text)
%>%
tidy_df select(manifesto_id, word) %>%
head(15)
## # A tibble: 15 x 2
## manifesto_id word
## <chr> <chr>
## 1 53110_201602 think
## 2 53110_201602 ahead
## 3 53110_201602 act
## 4 53110_201602 now
## 5 53110_201602 general
## 6 53110_201602 election
## 7 53110_201602 manifesto
## 8 53110_201602 2016
## 9 53110_201602 progressive
## 10 53110_201602 practical
## 11 53110_201602 and
## 12 53110_201602 sustainable
## 13 53110_201602 politics
## 14 53110_201602 for
## 15 53110_201602 the
Cleaning and preprocessing
The tidy format allows to make use of the dplyr grammar to preprocess and clean the data. To delete stopwords we make us of a stop word collection that comes with the tidytext package. The argument here is a tidytext function that returns a dataframe with a list of stopwords (frequent but little meaningful words).
get_stopwords()
## # A tibble: 175 x 2
## word lexicon
## <chr> <chr>
## 1 i snowball
## 2 me snowball
## 3 my snowball
## 4 myself snowball
## 5 we snowball
## 6 our snowball
## 7 ours snowball
## 8 ourselves snowball
## 9 you snowball
## 10 your snowball
## # … with 165 more rows
Anti_join here will only keep words that do not appear in the list dataframe provided as argument. Another advantage of the tidytext format is one can easily filter for certain characteristics. Here, we show how one can easily filter for tokens that are numbers only. The expression is.na(as.numeric(word))
filters for words that can not be transformed to numeric values. This filters out all words that are just containing numbers (such as the “2016” in the example above).
<- tidy_df %>%
tidy_without_stopwords anti_join(get_stopwords()) %>%
filter(is.na(as.numeric(word)))
%>%
tidy_without_stopwords select(manifesto_id, word) %>%
head(10)
## # A tibble: 10 x 2
## manifesto_id word
## <chr> <chr>
## 1 53110_201602 think
## 2 53110_201602 ahead
## 3 53110_201602 act
## 4 53110_201602 now
## 5 53110_201602 general
## 6 53110_201602 election
## 7 53110_201602 manifesto
## 8 53110_201602 progressive
## 9 53110_201602 practical
## 10 53110_201602 sustainable
Term frequencies and Tf-Idf
Using the count
function on the tidied data, it is very easy to obtain term frequencies of the corpus under investigation.
%>%
tidy_without_stopwords count(word, sort = TRUE) %>%
head(10)
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 new 847
## 2 people 704
## 3 ireland 686
## 4 public 678
## 5 ensure 674
## 6 government 620
## 7 support 615
## 8 fine 578
## 9 gael 556
## 10 services 536
General term frequencies (even when calculated per document) are often not very meaningful as they do not differ very much across documents. Many applications therefore calculate the tf-idf score (term-frequency inverse-document-frequency). This detects words that appear often within one document, but rarely in other documents. Tfidf identifies words that are on the one hand frequent, but on the other hand also distinct. tidytext has a function bind_tfidf
that adds the tfidf-score to a data frame containing term frequencies and document meta data.
Before calculating the tfidf score, we get nicer document names based on the party names stored in the Manifesto Project Dataset.
<- mp_maindataset() %>%
irish_partynames filter(countryname == "Ireland" & date == 201602) %>%
select(party, partyname)
The following shows how to calculate tf-idf socres and plot the 5 highest scoring terms for each manifesto. For more information on tf-idf scores, have a look at the respective chapter in the tidy text text.
%>%
tidy_without_stopwords count(party, word, sort = TRUE) %>%
bind_tf_idf(word, party, n = n) %>%
arrange(desc(party, tf_idf)) %>%
# mutate(word = factor(word, levels = rev(unique(word)), ordered=T)) %>%
group_by(party) %>%
top_n(5) %>%
ungroup() %>%
left_join(irish_partynames, by = "party") %>%
ggplot(aes(x = reorder(word, tf_idf), y = tf_idf, fill = partyname)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~partyname, ncol = 2, scales = "free") +
coord_flip()
## Selecting by tf_idf