We are often interested constructing a numerical measure of the sentiment of a document.
Sentiment can be positive/negative: "This movie is great!" (+) "This movie is typical of the ones that Nicholas Cage is in" (-)
4/05/2017
We are often interested constructing a numerical measure of the sentiment of a document.
Sentiment can be positive/negative: "This movie is great!" (+) "This movie is typical of the ones that Nicholas Cage is in" (-)
"God, Nicholas Cage is such a bad actor!" (Disgust) "I wish I could just reach into the screen and throttle Nicholas Cage" (Anger) "So depressed that I'm watching my second Nicholas Cage movie week" (Sadness)
Eg) "@AmbassadorRice, you should do the right thing. Hire a lawyer and surrender yourself to the @FBI. #ObamaGate"
Eg) Coverage of the Trump inauguration "Why the paltry crowd for Trump's inaugural matters"" | MSNBC www.msnbc.com/rachel-maddow…/why-the-paltry-crowd-trumps-inaugural-matters
"31 million tune in to witness Trump inauguration, Fox News most …" www.washingtontimes.com/…/31-million-tune-in-to-witness-trump-inauguration-f/
Supervised machine learning - using naive Bayes, SVM, LSTMs or CNNs (neural networks)
Natural language processing
Any supervised ML method that we learned can be used for sentiment analysis.
For this class, we will learn how to do sentiment analyses using naive Bayes and SVMs.
We will not only learn how to classify documents by sentiment, we will also learn how to construct more fine grain sentiment metrics.
Unsupervised - Identify a source of labeled data OR label your own data
Semi-supervised - Use both other labeled sources of data and your own data.
It is possible under some conditions to significantly improve model accuracy without inducing overfitting.
Semi-supervised learning can oftentimes help with this.
With semi-supervised learning we make use of both labeled and unlabeled data.
There are many ways to do semi-supervised learning, this is just one.
1)Identify a source of labeled data.
Train model.
Apply model to unlabelled data.
Use labels plus human-judgement on a subsample of the unlabelled data & retrain the model with the newly labelled data.
\[ P(C = 1|D) = \frac{P(w_{1} \cap w_{2} \cap \cdots \cap w_{k} | C = 1) P(C = 1)}{P(w_{1} \cap w_{2} \cap \cdots \cap w_{k})} \]
Likelihood: \[P(D|C = 1) = \prod_{i=1}^W P(w_{i}|C =1)\] Prior: \[P(C = 1)= \frac{\# D \in C_{1}}{\# D \in C_{1},C_{2}}\]
Marginal likelihood: \[ P(D) = \prod_{i=1}^W P(w_{i}) \]
If we assume that the words are independent conditional on a document class then:
\[ P(C = 1|D) = \frac{[P(w_{1}|C=1)P(w_{2}|C=1)\cdots P(w_{k}| C = 1)] P(C = 1)}{P(w_{1})P(w_{2})\cdots P(w_{k})} \]
\[P(w_{i} | C = 1) = \frac{\# w_{i} \in C_{1}}{\# \mathbf{w} \in C_{1}}\] \[P(C = 1)= \frac{\# D \in C_{1}}{\# D \in C_{1},C_{2}}\] \[P(w_{i})= \frac{\# w_{i} \in C_{1},C_{2}}{\# \mathbf{w} \in C_{1},C_{2}}\]
\[ \arg\max_{k} C_{k} = P(C = k)\prod_{i=1}^W P(w_{i}|C =k) \] - For classification purposes, we can ignore the marginal likelihood and assign classes based on likehood and the prior.
\[ P(C = k | D) > \frac{1}{k}\]
Assume that we have a two class sentiment prediction problem where the classes are Positive and Negative
Naive Bayes produces two outputs for each document:
A class label: \(k=(+,-)\) - \(\arg\max_{k} C_{k} = P(C = k)\prod_{i=1}^W P(w_{i}|C =k)\)
Probabilities for each class label: \(P(C =+| D)\), \(P(C = -| D)\).
Using only classes can be useful for a number of reasons.
Use the class labels $S = {d_{1+},d_{2+},d_{3-},} $ as an independent of a dependent variable for inference in a statistical model:
As dependent variable:
\[ logit(E[S|X]) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} \cdots \] As independent variable:
\[ Y = \beta_{0} + \beta_{1}S + \beta_{2}x_{2} + \beta_{3}x_{3} \cdots \]
We are more certain that a document is positive if:
\[ P(+|D) = 0.9 \implies + \] Than if:
\[ P(+|D) = 0.6 \implies + \] Even though both documents are labelled as positive.
Eg) Does positive sentiment of Congressional speeches correlate with vote share?
# Load the training data data<- read.csv("/Users/jason/Downloads/movie-pang02.csv", stringsAsFactors = FALSE) glimpse(data)
## Observations: 2,000 ## Variables: 2 ## $ class <chr> "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", ... ## $ text <chr> " films adapted from comic books have had plenty of succ...
# Clean the data reviews<-data$text newcorpus<-text_cleaner(reviews,rawtext=FALSE) sentiment<-data$class # Create a document term matrix dtm <- DocumentTermMatrix(newcorpus) dtm = removeSparseTerms(dtm, 0.99) # Reduce sparsity
# Split sample into training and test (75/25) train=sample(1:length(reviews), length(reviews)*0.75) dtm_mat<-as.matrix(dtm) trainX = dtm_mat[train,] testX = dtm_mat[-train,] trainY = sentiment[train] testY = sentiment[-train]
Naive Bayes uses proportions of words so we need to transform counts higher than 1 to 0.
counts <- function(x) { y <- ifelse(x > 0, 1,0) y <- factor(y, levels=c(0,1)) y }
fword_train <- apply(trainX, 2, counts) fword_test <- apply(testX, 2, counts)
viral_classifier <- naiveBayes(x=fword_train,y=factor(trainY))
We will use the function "naiveBayes" in the "e1070" package.
The "viral_classifier" object is now the trained classifier on the training data.
viral_test_pred <- predict(viral_classifier, newdata=fword_test)
# Let's see how this looks confusion = table(testY,viral_test_pred) confusion
## viral_test_pred ## testY Neg Pos ## Neg 205 38 ## Pos 63 194
accuracy<-c(confusion[1,1]+confusion[2,2])/sum(confusion) accuracy
## [1] 0.798
specificity<-confusion[1,1]/sum(confusion[1,]) specificity
## [1] 0.8436214
sensitivity<-confusion[2,2]/sum(confusion[2,]) sensitivity
## [1] 0.7548638
trumptweets <- read.csv("https://www.ocf.berkeley.edu/~janastas/trump-tweet-data.csv") trumptweets<-trumptweets[1:10,] trumptweets<-trumptweets$Text cleantweets<-text_cleaner(trumptweets, rawtext = FALSE) dtm <- DocumentTermMatrix(cleantweets) dtm = removeSparseTerms(dtm, 0.99) # Reduce sparsity trump_dtm<-as.matrix(dtm)
trump_tweet_pred <- predict(viral_classifier, newdata=trump_dtm, type="raw") trump_tweet_pred
## Neg Pos ## [1,] 1.000000e+00 2.985154e-253 ## [2,] 7.163240e-177 1.000000e+00 ## [3,] 8.867135e-40 1.000000e+00 ## [4,] 1.000000e+00 7.963089e-44 ## [5,] 1.000000e+00 5.679855e-244 ## [6,] 1.000000e+00 1.319742e-265 ## [7,] 1.000000e+00 4.563247e-272 ## [8,] 4.499153e-33 1.000000e+00 ## [9,] 1.000000e+00 8.356058e-175 ## [10,] 1.000000e+00 0.000000e+00
trump_tweet_pred <- predict(viral_classifier, newdata=trump_dtm, type="class") trump_tweet_pred
## [1] Neg Pos Pos Neg Neg Neg Neg Pos Neg Neg ## Levels: Neg Pos