From last time…

  1. Learned how to acquire text data using APIs.

  2. Learned how to clean and prepare text data for analysis.

  3. Built a document-term matrix.

For today

  1. Further text processing
  1. Building supervised machine learning classifiers with text data.
  1. Assessing the performance of classifiers.

Building a pipeline

text_cleaner<-function(corpus){
  tempcorpus = lapply(corpus, function(t) t$getText())
  tempcorpus = lapply(tempcorpus, tolower)
  tempcorpus<-Corpus(VectorSource(tempcorpus))
  tempcorpus<-tm_map(tempcorpus,
                    removePunctuation)
  tempcorpus<-tm_map(tempcorpus,
                    stripWhitespace)
  tempcorpus<-tm_map(tempcorpus,
                    removeNumbers)
  tempcorpus<-tm_map(tempcorpus,
                     removeWords, stopwords("english"))
  tempcorpus<-tm_map(tempcorpus, 
                    stemDocument)
  return(tempcorpus)
}

Building a document-term matrix

Building the document-term matrix

inspect(dtm[1:5, 1:5])
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 0/25
## Sparsity           : 100%
## Maximal term length: 14
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs abandon abc abcdonaldtrump abcpolit abdeslam
##    1       0   0              0        0        0
##    2       0   0              0        0        0
##    3       0   0              0        0        0
##    4       0   0              0        0        0
##    5       0   0              0        0        0
newcorpus[[2]]$content
## [1] " done even better elect possibl winner base popular vote campaign differ"

Sparse Document Term Matrices

Sparsity reduction

Sparsity reduction

dtm<-removeSparseTerms(dtm,0.95)
dtm
## <<DocumentTermMatrix (documents: 3917, terms: 16)>>
## Non-/sparse entries: 5676/56996
## Sparsity           : 91%
## Maximal term length: 21
## Weighting          : term frequency (tf)

TF-IDF: Term-Frequency, Inverse Document Frequency

TF-IDF: Term-Frequency, Inverse Document Frequency

Term frequency- number of times term t appears in document d

\[TF_{i,j} = \frac{\sum_{i=1}^{w_{d}}1(w_{i} = t)}{w_{d}}\] Inverse document frequency - measures importance of a term in a corpus. It is the log of the number of total documents N divided by total documents containing the term t

\[IDF_{i,j} = ln\left(\frac{N}{\sum_{j=1}^{N}1(d_{j} = t)}\right)\]

TF-IDF Matrix

\[TF-IDF_{i,j}=TF_{i,j}~X~IDF_{i,j} \] \[TF-IDF \in \mathbb{R}^{dxw}\]

TF-IDF Matrix

dtm <- DocumentTermMatrix(newcorpus, control = list(weighting = weightTfIdf))
## Warning in weighting(x): empty document(s): 2513
dtm = removeSparseTerms(dtm, 0.95)
inspect(dtm[1:5,4:8])
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 5/20
## Sparsity           : 80%
## Maximal term length: 7
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## 
##     Terms
## Docs       get     great   hillari just make
##    1 0.0000000 0.0000000 0.2606815    0    0
##    2 0.0000000 0.0000000 0.0000000    0    0
##    3 0.0000000 0.0000000 0.2406291    0    0
##    4 0.2921460 0.4064887 0.0000000    0    0
##    5 0.3146187 0.0000000 0.0000000    0    0

Some fun with words

summary(trumptweets$Retweets)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3030    5424    7597    9366  349900
viraltweets<-ifelse(trumptweets$Retweets > 9366, 1,0)
nonviraltweets<-ifelse(trumptweets$Retweets < 3030, 1,0)

Plot a word cloud

Wordcloud for High Retweet Trump Tweets

Wordcloud for High Retweet Trump Tweets

Plot a word cloud

Wordcloud for Low Retweet Trump Tweets

Wordcloud for Low Retweet Trump Tweets

Supervised machine learning with text data

Step back - Assessing the performance of classifiers

  1. It can incorrectly classify a tweet as one that will “go viral” when it goes not go viral (false positive)

  2. It can incorrectly classify a tweet as one that will “not go viral” when it does go viral. (false negative)

Confusion matrix

\begin{tablular}{l|ll} & \ & 1 & 0 \ 1 & & \ 0 & & \ \end{tabular}

Sensitivity, specificity and accuracy

Sensitivity, specificity and accuracy

Sensitivity, specificity and accuracy

Classifying Text Data With Logistic Regression

\[ logit(E[C| W,X]) = \theta_{0} + \theta_{1}w_{1} + \cdots + \theta_{n}w_{n} \]

Classifying Text Data With Logistic Regression

\[ P(C_{i} = 1 | W_{i},X_{i}) = \frac{exp(\theta W_{i})}{1 + exp(\theta W_{i})} \] Assign class label \(C_{k}\) such that \[ C_{k} = \arg\max_{k} P(C_{i} = k | W_{i},X_{i}) \]

Sparse Logistic Regression

\[ \arg\min_{\theta} \sum_{i=1}^m log\left[1+exp(-ch_{\theta}(w))\right] + \lambda \sum_{i=1}^n \theta^2] \]

Sparse Logistic Regression With Trump’s Tweets

Sparse Logistic Regression With Trump’s Tweets

Step 1: Divide the data into training and testing.

train=sample(1:dim(trumptweets)[1],
             dim(trumptweets)[1]*0.5)
dtm_mat<-as.matrix(dtm)
trainX = dtm_mat[train,]
testX = dtm_mat[-train,]
trainY = viraltweets[train]
testY = viraltweets[-train]

Sparse Logistic Regression With Trump’s Tweets

Step 2: Train the model on the training set. This is a sparse logistic regression model with \(L_{2}\) loss.

library(LiblineaR)
m=LiblineaR(data=trainX,target=trainY,
            type=7,bias=TRUE,verbose=FALSE)

Sparse Logistic Regression With Trump’s Tweets

Step 3: Make a prediction using the test data.

p=predict(m,testX)

Sparse Logistic Regression With Trump’s Tweets

Step 4: Display the confusion matrix.

confusion=table(p$predictions,
          testY)
confusion
##    testY
##        0    1
##   0 1451  435
##   1   33   40

Sparse Logistic Regression With Trump’s Tweets

Step 5: Calculate accuracy, specificity and sensitivity

accuracy<-(1438+39)/sum(confusion)
accuracy
## [1] 0.7539561
specificity<-39/(439+39)
specificity
## [1] 0.08158996
sensitivity<-1438/(1438+43)
sensitivity
## [1] 0.9709656