Table of Contents
In this R tutorial, we will analyze the most common words in the 2019 Democratic Debate Nights One and Two. The 2019 Democratic Debate transcripts will be classified by text analysis and visualize the most frequently said words using Wordcloud.
2019 Democratic Debate Night One Candidates:
- Cory Booker – Senator from NJ
- Beto O’Rourke – Former congressman from TX
- Elizabeth Warren – Senator from MA
- Julián Castro – Former HUD Secretary
- Amy Klobuchar – Senator from MN
- Tulsi Gabbard – Congresswoman from HI
- Tim Ryan – Congressman from OH
- John Delaney – Former congressman from MD
- Bill de Blasio – Mayor of NYC
- Jay Inslee – Governor of WA
2019 Democratic Debate Night Two Candidates:
- Joe Biden – Former Vice President
- Kamala Harris – Senator from CA
- Bernie Sanders – Senator from VT
- Pete Buttigieg – Mayor of South Bend, IN
- Michael Bennet – Senator from CO
- Kirsten Gillibrand – Senator from NY
- John Hickenlooper – Former governor of CO
- Marianna Williamson – Author
- Eric Swalwell – Congressman from CA
- Andrew Yang – Businessman
Install and Load Packages
Below are the packages and libraries that we will need to load to complete this tutorial.
Input:
install.packages("tm") install.packages("wordcloud") install.packages("NLP") install.packages("RColorBrewer") install.packages("e1071") install.packages("SnowballC") install.packages("gmodels") library(tm) library(wordcloud) library(NLP) library(RColorBrewer) library(e1071) library(SnowballC) library(gmodels)
Download and Load 2019 First Democratic Debates
Since we will be using the 2019 first Democratic debates transcripts, you will need to download both datasets. The datasets are already packaged and available for an easy download from the dataset page or directly from here Democratic Debate Night One 2019 and Democratic Debate Night Two 2019.
Night One, 2019 First Democratic Debate
Date and Time: Wednesday, June 26, 9 to 11 p.m. EST
Location: Adrienne Arsht Center for the Performing Arts in Miami, Florida
Input:
democratic_debate_night_one <- read.csv("democratic_debate_night_one_2019.csv", stringsAsFactors = FALSE)
View the Night One Democratic Debate Dataset
head() function
In order to have an idea of what data is being processed, we can use the head() function to view sample data of the transcripts and we will print the first 10 rows.
Input:
head(democratic_debate_night_one, 10)
Output:
Transcript 1 Good evening everyone, I am Lester Holt and welcome to the first democratic debate in the 2020 race for president. 2 Hi, Im Savannah Guthrie and tonight its our first chance to see these candidates go head to head onstage together. We will be joined in our questioning tonight by our colleagues, Jose Diaz-Balart, Chuck Todd and Rachel Maddow. 3 Voters are trying to nail down where the candidates and on the issues, what sets them apart, and which of these presidential hopefuls has what it takes 4 Well, now its time to find out. 5 Tonight round one. New Jersey Senator Cory Booker, Former Housing Secretary Julian Castro, New York City Mayor Bill De Blasio, Former Maryland Congressman John Delaney, Hawaii Congresswoman Tulsi Gabbard, Washington Governor Jay Inslee, Minnesota Senator Amy Klobuchar, Former Texas Congressman Beto ORourke, Ohio Congressman Tim Ryan and Massachusetts Senator Elizabeth Warren. From NBC News Decision 2020 the Democratic candidates debate live from the Adrienne Arsht Performing Arts Center in Miami, Florida. 6 Good evening again everyone. Welcome to the candidates and to our audience here in Miami here in the Arsht Center and all across the country. Tonight we are going to take on many of the most pressing issues of the moment including immigration, the situation unfolding at our border, and the treatment of migrant children. 7 And we are going to talk about the tensions with Iran, climate change, and of course we will talk about the economy, those kitchen table issues so many Americans face every day. 8 And some quick rules of the road before we begin. Twenty candidates qualified for this first debate. We will hear from 10 tonight and 10 more tomorrow. The breakdown for each was selected at random. The candidates will have 60 seconds to answer and 30 seconds for any follow-ups. 9 Because of this large field not every person will be able to comment on every of topic but over the course of the next two hours we will hear from everyone. We would also like to ask the audience to keep their reactions to a minimum. We are not going to be shy about making sure the candidates stick to time tonight. 10 All right. So with a business out of the we want to get to it we will start this evening with Senator Elizabeth Warren. Senator, good evening to you.
str() function
An alternative way to print sample data is using the str() function. The str() command displays the internal structure of an R object.
Input:
str(democratic_debate_night_one)
Output:
'data.frame': 541 obs. of 1 variable: $ Transcript: chr "Good evening everyone, I am Lester Holt and welcome to the first democratic debate in the 2020 race for president." "Hi, Im Savannah Guthrie and tonight its our first chance to see these candidates go head to head onstage togeth"| __truncated__ "Voters are trying to nail down where the candidates and on the issues, what sets them apart, and which of these"| __truncated__ "Well, now its time to find out." ...
Cleaning the Democratic Debate transcripts
Below are sample outputs that are used to clean the raw data file for the costumes. A few topics in which are used to clean the file are listed below:
What will be removed from the Democratic Debate transcripts?
- remove words/stop words
- remove white spaces
- remove punctuation
- returning words to root form
Build the Democratic Debate Wordcloud
Below are the Wordclouds used for the above cleaning of the Democratic Debate transcripts. Below are a series of tasks that must be completed to clean and validate the Halloween costume names.
- transform letters to lowercase
- remove numbers
- remove stop words
- remove all punctuation
- remove stem words in a text document using Porter’s stemming algorithm
- remove white spaces
Input:
democratic_debate_night_one_transcripts <- VCorpus(VectorSource(democratic_debate_night_one$Transcript)) democratic_debate_night_one_transcripts_corpus_clean <- tm_map(democratic_debate_night_one_transcripts, content_transformer(tolower)) democratic_debate_night_one_transcripts_corpus_clean <- tm_map(democratic_debate_night_one_transcripts_corpus_clean, removeNumbers) democratic_debate_night_one_transcripts_corpus_clean <- tm_map(democratic_debate_night_one_transcripts_corpus_clean, removeWords, stopwords()) democratic_debate_night_one_transcripts_corpus_clean <- tm_map(democratic_debate_night_one_transcripts_corpus_clean, removePunctuation) democratic_debate_night_one_transcripts_corpus_clean <- tm_map(democratic_debate_night_one_transcripts_corpus_clean, stripWhitespace)
Now that we have cleaned the Democratic Debate transcripts for 2019, we can create the Wordcloud for the transcripts.
2019 First Democratic Debate Wordcloud, Night one
Input:
set.seed(1234) color = brewer.pal(8, "RdBu") wordcloud(democratic_debate_night_one_transcripts_corpus_clean, min.freq = 4, colors=color, random.order = FALSE, width=20, height=20)
Output:
Night Two, 2019 First Democratic Debate
Date and Time: Thursday, June 27, 9 to 11 p.m. EST
Location: Adrienne Arsht Center for the Performing Arts in Miami, Florida
Input:
democratic_debate_night_two <- read.csv("democratic_debate_night_two_2019.csv", stringsAsFactors = FALSE)
View the Night One Democratic Debate Dataset
head() function
Input:
head(democratic_debate_night_two, 10)
Output:
Transcripts 1 And good evening once again. Welcome to the candidates and our spirited audience here tonight in the Arsht Center and across America. Tonight we continue the spirited debate about the future of the country, how to tackle our most pressing problems and getting to the heart of the biggest issues in this Democratic primary. 2 Tonight we are going to talk about healthcare, immigration. Were also to dive into the economy, jobs, climate change as well. 3 As a quick rules of the road before we begin and they may sound familiar 20 candidates calqualified for this first debate. As we said we heard from 10 last night and we will hear from 10 more tonight. The breakdown for each night was selected at random. The candidates will have 60 seconds to answer, 30 seconds for follow-ups. 4 And because of the large field of candidates not every person is going to be able to weigh in on every topic but over the course of the next two hours we will hear from everyone. 5 This is your last free article. 6 Subscribe to The Times 7 And we love our audience but we would like to ask them to keep their reactions to a minimum and we are not going to hold back making sure that candidates stick to time. 8 So with that business take caretaken care of lets get to it and we are going to start today with Senator Sanders. Good evening to you. 9 You have called for big new government benefits like universal healthcare and free college. In a recent interview you said you suspected that Americans would be quote delighted to pay more taxes for things like that. My question to you is will taxes go up for the middle class in a Sanders administration? And if so how do you sell that to voters? 10 Well, you are quite right. We have a new vision for America and at a time when we have three people in this country owning more wealth than the bottom half of America, while 500,000 people are sleeping out on the streets today we think it is time for change, real change.
str() function
Input:
str(democratic_debate_night_two)
Output:
'data.frame': 705 obs. of 1 variable: $ Transcripts: chr "And good evening once again. Welcome to the candidates and our spirited audience here tonight in the Arsht Cent"| __truncated__ "Tonight we are going to talk about healthcare, immigration. Were also to dive into the economy, jobs, climate change as well." "As a quick rules of the road before we begin and they may sound familiar 20 candidates calqualified for this fi"| __truncated__ "And because of the large field of candidates not every person is going to be able to weigh in on every topic bu"| __truncated__ ...
Cleaning the Democratic Debate, Night Two Transcripts
Input:
democratic_debate_night_two_transcripts <- VCorpus(VectorSource(democratic_debate_night_two$Transcript)) democratic_debate_night_two_transcripts_corpus_clean <- tm_map(democratic_debate_night_two_transcripts, content_transformer(tolower)) democratic_debate_night_two_transcripts_corpus_clean <- tm_map(democratic_debate_night_two_transcripts_corpus_clean, removeNumbers) democratic_debate_night_two_transcripts_corpus_clean <- tm_map(democratic_debate_night_two_transcripts_corpus_clean, removeWords, stopwords()) democratic_debate_night_two_transcripts_corpus_clean <- tm_map(democratic_debate_night_two_transcripts_corpus_clean, removePunctuation) democratic_debate_night_two_transcripts_corpus_clean <- tm_map(democratic_debate_night_two_transcripts_corpus_clean, stripWhitespace)
Now that we have cleaned the Halloween costume name, we can create the Wordcloud for the costume names.
2019 First Democratic Debate Wordcloud, Night Two
Input:
set.seed(1234) color = brewer.pal(8, "RdBu") wordcloud(democratic_debate_night_two_transcripts_corpus_clean, min.freq = 4, colors=color, random.order = FALSE, width=20, height=20)
Output:
Hope you enjoyed this tutorial and have some fun with the Democratic Debate transcripts!