3/16/2017

From last time…

  1. Learned how to build a TF-IDF matrix.

  2. Run a one type of supervised learning classifier called regularized logistic regression.

For today

  1. More supervised learning with text: Naive Bayes and Support Vector Machines.

  2. Cross-validation for supervised machine learning.

Discriminative and Generative Classifiers (from Ng and Jordan (2002))

  • Generative classifiers: learn a model of the joint probability \(p(x,y)\) using Bayes rule to calculate the posterior \(p(y|x)\) and then pick the most likley label \(y\). (eg). Naive Bayes, all Bayesian methods in general)

  • Discriminative classifiers: model the posterior \(p(y|x)\) directly. (logistic regression, SVMs etc.)

  • Ng and Jordan (2002) find that for most kinds of data generative classifiers almost always perform better than discriminative classifiers, despite the fact that they tend to have higher MSE.

  • In my personal experience, the results have been mixed.

Naive Bayes

\[ P(C = k|D) = \frac{P(D|C = k)P(C=k)}{P(D)} \]

  • Given a document D, we want to figure out the probability of the document belonging to a class C.

  • We can do this by using Bayes theorem to directly calculate class probabilities given the words in a document

Bayesian statistics terminology

  • Before we discuss the naive Bayes algorithm it's useful to know a little bit about the components of Bayes theorem.

\[P(C = k|D)\] - is known as the posterior \[P(D |C = k)\] - is known as the likelihood \[P(C = k)\] - is known as the prior \[P(D)\] - is known as the marginal likelihood or evidence.

For continuous distributions this is simply a probability model

\[ \pi(C | D) = f_{D|C}(D|C)\pi(C) / \int_{\Theta} f_{D|C}(D|C)\pi(C) \]

For discrete distributions this just comes down to multiplying probabilities

\[ P(C = k|D) = \frac{P(D|C = k)P(C=k)}{P(D)} \]

  • \(D = \{w_{1},w_{2}, \cdots, w_{k}\}\)
  • \(C = \{1,0\}\)

Thus…

\[ P(C = 1|D) = \frac{P(w_{1} \cap w_{2} \cap \cdots \cap w_{k} | C = 1) P(C = 1)}{P(w_{1} \cap w_{2} \cap \cdots \cap w_{k})} \]

Thus…

Likelihood: \[P(D|C = 1) = \prod_{i=1}^W P(w_{i}|C =1)\] Prior: \[P(C = 1)= \frac{\# D \in C_{1}}{\# D \in C_{1},C_{2}}\]

Marginal likelihood: \[ P(D) = \prod_{i=1}^W P(w_{i}) \]

Assumptions

If we assume that the words are independent conditional on a document class then:

\[ P(C = 1|D) = \frac{[P(w_{1}|C=1)P(w_{2}|C=1)\cdots P(w_{k}| C = 1)] P(C = 1)}{P(w_{1})P(w_{2})\cdots P(w_{k})} \]

Where

\[P(w_{i} | C = 1) = \frac{\# w_{i} \in C_{1}}{\# \mathbf{w} \in C_{1}}\] \[P(C = 1)= \frac{\# D \in C_{1}}{\# D \in C_{1},C_{2}}\] \[P(w_{i})= \frac{\# w_{i} \in C_{1},C_{2}}{\# \mathbf{w} \in C_{1},C_{2}}\]

Classification

\[ \arg\max_{k} C_{k} = P(C = k)\prod_{i=1}^W P(w_{i}|C =k) \] - For classification purposes, we can ignore the marginal likelihood and assign classes based on likehood and the prior.

Classification

  • An alternative means of expessing this is if:

\[ P(C = k | D) > \frac{1}{k}\]

  • Assign document to class k.

Laplace Smoothing

  • Words with zero probability can seriously damage the performance of the classifier.

  • To correct this problem we implement a Laplace smoother to ensure that there are no zero probability words.

  • This amounts to simply adding 1 to each count; eg)

\[P(w_{i} | C = 1) = \frac{(\# w_{i} \in C_{1}) + 1}{(\# \mathbf{w} \in C_{1}) + 1}\]

Example: Tweet Sentiment

Recent Tweet from @POTUS: "We are going to reduce your taxes big league…I want to start that process so quickly…We've got to start the tax reductions."

  • \(C = {+,-}\)
  • \(N = 1000 tweets\)
  • 500 \(+\) tweets, 500 \(-\) tweets

Example: Tweet Sentiment

Cleaned Tweet: "reduc tax big league start proces quick start tax reduc."

  • \(C = {+,-}\)
  • \(N = 1000\) tweets
  • 500 \(+\) tweets, 500 \(-\) tweets

Recall from last time

summary(trumptweets$Retweets)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3030    5424    7597    9366  349900
viraltweets<-ifelse(trumptweets$Retweets > 9366, 1,0)
nonviraltweets<-ifelse(trumptweets$Retweets < 3030, 1,0)
  • Let's say we were interested in trying to figure out what makes a tweet go viral.

  • We explore the difference in word usage between high retweet rate tweets and low retweet rate tweets.

Pre-processing pipeline

text_cleaner<-function(corpus, rawtext){
  tempcorpus = lapply(corpus,toString)
  for(i in 1:length(tempcorpus)){
    tempcorpus[[i]]<-iconv(tempcorpus[[i]], "ASCII", "UTF-8", sub="")
  }
  if(rawtext == TRUE){
    tempcorpus = lapply(tempcorpus, function(t) t$getText())
  }
  tempcorpus = lapply(tempcorpus, tolower)
  tempcorpus<-Corpus(VectorSource(tempcorpus))
  tempcorpus<-tm_map(tempcorpus,
                     removePunctuation)
  tempcorpus<-tm_map(tempcorpus,
                     removeNumbers)
  tempcorpus<-tm_map(tempcorpus,
                     removeWords, stopwords("english"))
  tempcorpus<-tm_map(tempcorpus, 
                     stemDocument)
  tempcorpus<-tm_map(tempcorpus,
                     stripWhitespace)
  return(tempcorpus)
}

Plot a word cloud

Wordcloud for High Retweet Trump Tweets

Plot a word cloud

Wordcloud for Low Retweet Trump Tweets

Naive Bayes in R with Trump's Tweets

  • Pre-processing steps have already been done, we are now working with the Document-Term matrix
train=sample(1:dim(trumptweets)[1],
             dim(trumptweets)[1]*0.75)
trainX = dtm[train,]
testX = dtm[-train,]
trainY = viraltweets[train]
testY = viraltweets[-train]
  • This is a 75% training, 25% test split.

Increasing accuracy by identifying common words

  • We may be able to improve accuracy by identifying commonly used words.
ten_words<-
  findFreqTerms(trainX,10)
ten_words[(length(ten_words)-20):length(ten_words)]
##  [1] "win"       "winner"    "wisconsin" "without"   "woman"    
##  [6] "women"     "won"       "wonder"    "wont"      "word"     
## [11] "work"      "world"     "wors"      "worst"     "wow"      
## [16] "wrong"     "year"      "yesterday" "yet"       "york"     
## [21] "zero"

Create a Document-Term Matrix with only the frequent words

  • Notice that we're creating a new Document-Term Matrix from the original cleaned corpus
fword_train <- 
  DocumentTermMatrix(newcorpus[train],
  control=list(dictionary = ten_words))

fword_test <- 
  DocumentTermMatrix(newcorpus[-train],
  control=list(dictionary = ten_words))

Conversion of Document Term Matrix to Counts

Naive Bayes uses proportions of words so we need to transform counts higher than 1 to 0.

counts <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1))
  y
}

Convert Document Term Matrix to Counts

fword_train <- apply(fword_train, 2, counts)
fword_test <- apply(fword_test, 2, counts)

Estimating the Naive Bayes Model

viral_classifier <- 
  naiveBayes(x=fword_train,y=factor(trainY))
  • We will use the function "naiveBayes" in the "e1070" package.

  • The "viral_classifier" object is now the trained classifier on the training data.

Apply the trained naive Bayes classifier to the test data

viral_test_pred <- 
  predict(viral_classifier, newdata=fword_test)
# Let's see how this looks
confusion = table(testY,viral_test_pred)
confusion
##      viral_test_pred
## testY   0   1
##     0 645  90
##     1 126 119

Confusion matrix

Confusion Matrix for Trump's Tweets
Ground Truth Class. Viral Class. Not Viral
Viral Correct False (-)
Not Viral False (+) Correct
  • A means of displaying this information is in a "confusion matrix" as the one shown above.

  • If you are writing a paper using a classifier, always include the confusion matrix.

Sensitivity, specificity and accuracy

  • Class specific performance is a very important aspect of classifiers.

  • Ideally, you want to keep both false negatives and false positives as low as possible.

Sensitivity, specificity and accuracy

  • Accuracy - % of documents that are correctly classified: \[ \frac{\text{# docs correctly classified}}{\text{# of docs classified}} \]

  • Sensitivity is the % of positives that are correctly identified: \[ \frac{\text{# of positives identified}}{\text{# of positives}} \]

  • Specificity is the % of negatives that are correctly identified: \[ \frac{\text{# of negatives identified}}{\text{# of negatives}} \]

Sensitivity, specificity and accuracy

  • It is very easy to have very high accuracy rates and high (sensitivity/specificity) but a crappy classifier.

  • Lesson from classifying violence in religious texts.

Naive Bayes With Trump's Tweets

Step 5: Calculate accuracy, specificity and sensitivity

accuracy<-c(confusion[1,1]+confusion[2,2])/sum(confusion)
accuracy
## [1] 0.7795918

Naive Bayes With Trump's Tweets

Step 5: Calculate accuracy, specificity and sensitivity

specificity<-confusion[1,1]/sum(confusion[1,])
specificity
## [1] 0.877551

Naive Bayes With Trump's Tweets

Step 5: Calculate accuracy, specificity and sensitivity

sensitivity<-confusion[2,2]/sum(confusion[2,])
sensitivity
## [1] 0.4857143

Support vector machines

  • One of the oldest machine learning methods.

  • Introduced around 1992 by Vapnik.

  • Theoretically well motivated - the product of statistical learning theory since the 60s.

  • Good performance in many domains (image recognition, text data etc.)

Support vector machine basics

  • Support vector machines are in many ways similar to regression.

  • But instead of fitting a line, support vector machines fit a maximally separating hyperplane between a set of points.

Support vector machines

SVM Maximally separating hyperplane

  • Support vector machines have nice properties

  • Convex and can be non-linear w/ different kernels.

How do SVMs work?

\[ \theta^{T}x - \alpha = 0 \]

  • Hyperplane can be written as the above.

  • SVMs involve estimating weights \(\theta\) that define a hyperplane which separates two classes.

  • But there are lots of different hyperplanes you can estimate.

  • Which one to choose?

SVM hyperplane estimation

\[ \begin{aligned} \theta^{T}x - \alpha & = 0 \\ \theta^{T}x - \alpha & \geq 1 &~\text{if}~y_{i} =1 \\ \theta^{T}x - \alpha & \leq -1 &~\text{if}~y_{i}=-1 \\ \end{aligned} \]

  • SVMs estimate hyperplanes which leave the maximum margin between classes.

SVM hyperplane estimation

Margin is: \[ \frac{2}{||\theta||} \]

The maximum margin is when \(||\theta||\) is at a minimum.

SVM hyperplane estimation

Minimize \[||\theta|| \]

\[y_{i}(\theta^{T}x - \alpha) \geq 1 \]

  • Hyperplane estimation is a constrained optimization problem.

  • Estimated using Lagrangians.

SVM Kernel Trick

  • Although SVMs are technially linear models, you can estimate nonlinear SVMs with something known as the kernel trick

  • This can be done by essentially changing the ways that the weights are estimated.

  • Changing the kernel can drastically change performance of the SVM.

Support vector machines in R

fword_train <- 
  DocumentTermMatrix(newcorpus[train],
  control=list(dictionary = ten_words))

fword_test <- 
  DocumentTermMatrix(newcorpus[-train],
  control=list(dictionary = ten_words))

Support Vector Machines in R

  • SVM with the default kernel
model <- svm(x=as.matrix(fword_train),
             y=factor(trainY))
 
predictedY <- predict(model,as.matrix(fword_test))
 
confusion = table(testY, predictedY)
confusion
##      predictedY
## testY   0   1
##     0 726   9
##     1 212  33

Support Vector Machines in R

  • SVM with the sigmoid kernel
model <- svm(x=as.matrix(fword_train),
             y=factor(trainY), kernel="sigmoid")
 
predictedY <- predict(model,as.matrix(fword_test))
 
confusion = table(testY, predictedY)
confusion
##      predictedY
## testY   0   1
##     0 696  39
##     1 196  49