4/05/2017

Sentiment Analysis

  • We are often interested constructing a numerical measure of the sentiment of a document.

  • Sentiment can be positive/negative: "This movie is great!" (+) "This movie is typical of the ones that Nicholas Cage is in" (-)

Sentiment Analysis

  • Sentiment can be emotional states:

"God, Nicholas Cage is such a bad actor!" (Disgust) "I wish I could just reach into the screen and throttle Nicholas Cage" (Anger) "So depressed that I'm watching my second Nicholas Cage movie week" (Sadness)

Sentiment analysis for social science research

  • In a social science context, we might be interested in figuring out whether a tweet mentioning a politician or agency is favoriable or unfavorable.

Eg) "@AmbassadorRice, you should do the right thing. Hire a lawyer and surrender yourself to the @FBI. #ObamaGate"

Sentiment analysis for social science research

  • We might be interested in figuring out whether news coverage about an event is positive of negative often called "spin"

Eg) Coverage of the Trump inauguration "Why the paltry crowd for Trump's inaugural matters"" | MSNBC www.msnbc.com/rachel-maddow…/why-the-paltry-crowd-trumps-inaugural-matters

"31 million tune in to witness Trump inauguration, Fox News most …" www.washingtontimes.com/…/31-million-tune-in-to-witness-trump-inauguration-f/

Means of measuring sentiment analysis

  1. Supervised machine learning - using naive Bayes, SVM, LSTMs or CNNs (neural networks)

  2. Natural language processing

  • create dictionary of words associated w/ each sentiment.
  • calculate sentiment score using the dictionary.

Naive bayes for sentiment analysis

  • Any supervised ML method that we learned can be used for sentiment analysis.

  • For this class, we will learn how to do sentiment analyses using naive Bayes and SVMs.

  • We will not only learn how to classify documents by sentiment, we will also learn how to construct more fine grain sentiment metrics.

Sentiment analysis with ML: Step 1 - Identify your training data

  • You can identify your training data in two ways:
  1. Unsupervised - Identify a source of labeled data OR label your own data

  2. Semi-supervised - Use both other labeled sources of data and your own data.

Semi-supervised sentiment analysis

  • It is possible under some conditions to significantly improve model accuracy without inducing overfitting.

  • Semi-supervised learning can oftentimes help with this.

  • With semi-supervised learning we make use of both labeled and unlabeled data.

  • There are many ways to do semi-supervised learning, this is just one.

Semi-supervised sentiment analysis: steps

1)Identify a source of labeled data.

  1. Train model.

  2. Apply model to unlabelled data.

  3. Use labels plus human-judgement on a subsample of the unlabelled data & retrain the model with the newly labelled data.

Review of naive Bayes: model

\[ P(C = 1|D) = \frac{P(w_{1} \cap w_{2} \cap \cdots \cap w_{k} | C = 1) P(C = 1)}{P(w_{1} \cap w_{2} \cap \cdots \cap w_{k})} \]

Likelihood and marginal likelihood

Likelihood: \[P(D|C = 1) = \prod_{i=1}^W P(w_{i}|C =1)\] Prior: \[P(C = 1)= \frac{\# D \in C_{1}}{\# D \in C_{1},C_{2}}\]

Marginal likelihood: \[ P(D) = \prod_{i=1}^W P(w_{i}) \]

Assumptions

If we assume that the words are independent conditional on a document class then:

\[ P(C = 1|D) = \frac{[P(w_{1}|C=1)P(w_{2}|C=1)\cdots P(w_{k}| C = 1)] P(C = 1)}{P(w_{1})P(w_{2})\cdots P(w_{k})} \]

Where

\[P(w_{i} | C = 1) = \frac{\# w_{i} \in C_{1}}{\# \mathbf{w} \in C_{1}}\] \[P(C = 1)= \frac{\# D \in C_{1}}{\# D \in C_{1},C_{2}}\] \[P(w_{i})= \frac{\# w_{i} \in C_{1},C_{2}}{\# \mathbf{w} \in C_{1},C_{2}}\]

Classification

\[ \arg\max_{k} C_{k} = P(C = k)\prod_{i=1}^W P(w_{i}|C =k) \] - For classification purposes, we can ignore the marginal likelihood and assign classes based on likehood and the prior.

Classification

  • An alternative means of expessing this is if:

\[ P(C = k | D) > \frac{1}{k}\]

  • Assign document to class k.

Using naive Bayes to assess sentiment

  • Assume that we have a two class sentiment prediction problem where the classes are Positive and Negative

  • Naive Bayes produces two outputs for each document:

  1. A class label: \(k=(+,-)\) - \(\arg\max_{k} C_{k} = P(C = k)\prod_{i=1}^W P(w_{i}|C =k)\)

  2. Probabilities for each class label: \(P(C =+| D)\), \(P(C = -| D)\).

  • Both can be used to measure sentiment for documents in a corpus.

Using classes to measure sentiment

  • Using only classes can be useful for a number of reasons.

  • Use the class labels $S = {d_{1+},d_{2+},d_{3-},} $ as an independent of a dependent variable for inference in a statistical model:

As dependent variable:

\[ logit(E[S|X]) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} \cdots \] As independent variable:

\[ Y = \beta_{0} + \beta_{1}S + \beta_{2}x_{2} + \beta_{3}x_{3} \cdots \]

  • etc

Using conditional probabilities you can construct sentiment scores

  • Conditional probabilities estimated by the naive Bayes classifier provide a good measure of the "strength" of a classification.

We are more certain that a document is positive if:

\[ P(+|D) = 0.9 \implies + \] Than if:

\[ P(+|D) = 0.6 \implies + \] Even though both documents are labelled as positive.

Using conditional probabilities you can construct sentiment scores

  • This can be useful when it comes to the inference context if you want to construct a variable which is a continuous meaure of the sentiment of a document.

Eg) Does positive sentiment of Congressional speeches correlate with vote share?

Example using movie reviews

# Load the training data
data<- read.csv("/Users/jason/Downloads/movie-pang02.csv", stringsAsFactors = FALSE)
glimpse(data)
## Observations: 2,000
## Variables: 2
## $ class <chr> "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", ...
## $ text  <chr> " films adapted from comic books have had plenty of succ...

Clean the reviews and place in a DTM

# Clean the data
reviews<-data$text
newcorpus<-text_cleaner(reviews,rawtext=FALSE)
sentiment<-data$class
# Create a document term matrix
dtm <- DocumentTermMatrix(newcorpus)
dtm = removeSparseTerms(dtm, 0.99) # Reduce sparsity

Train the model and assess relevant statistics

# Split sample into training and test (75/25)
train=sample(1:length(reviews),
             length(reviews)*0.75)
dtm_mat<-as.matrix(dtm)
trainX = dtm_mat[train,]
testX = dtm_mat[-train,]
trainY = sentiment[train]
testY = sentiment[-train]

Conversion of Document Term Matrix to Counts

Naive Bayes uses proportions of words so we need to transform counts higher than 1 to 0.

counts <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1))
  y
}

Convert Document Term Matrix to Counts

fword_train <- apply(trainX, 2, counts)
fword_test <- apply(testX, 2, counts)

Estimating the Naive Bayes Model

viral_classifier <- 
  naiveBayes(x=fword_train,y=factor(trainY))
  • We will use the function "naiveBayes" in the "e1070" package.

  • The "viral_classifier" object is now the trained classifier on the training data.

Apply the trained naive Bayes classifier to the test data

viral_test_pred <- 
  predict(viral_classifier, newdata=fword_test)
# Let's see how this looks
confusion = table(testY,viral_test_pred)
confusion
##      viral_test_pred
## testY Neg Pos
##   Neg 205  38
##   Pos  63 194

Calculate stats

accuracy<-c(confusion[1,1]+confusion[2,2])/sum(confusion)
accuracy
## [1] 0.798
specificity<-confusion[1,1]/sum(confusion[1,])
specificity
## [1] 0.8436214
sensitivity<-confusion[2,2]/sum(confusion[2,])
sensitivity
## [1] 0.7548638

Trained sentiment model can now be applied to other things

  • Eg) Trump's Tweets
trumptweets <- read.csv("https://www.ocf.berkeley.edu/~janastas/trump-tweet-data.csv")
trumptweets<-trumptweets[1:10,]
trumptweets<-trumptweets$Text
cleantweets<-text_cleaner(trumptweets, rawtext = FALSE)
dtm <- DocumentTermMatrix(cleantweets)
dtm = removeSparseTerms(dtm, 0.99) # Reduce sparsity

trump_dtm<-as.matrix(dtm)

Classify the first 10 tweets and get class probabilities

  • Let's classify the first 10 Tweets
trump_tweet_pred <- 
  predict(viral_classifier, 
          newdata=trump_dtm, type="raw")

trump_tweet_pred
##                 Neg           Pos
##  [1,]  1.000000e+00 2.985154e-253
##  [2,] 7.163240e-177  1.000000e+00
##  [3,]  8.867135e-40  1.000000e+00
##  [4,]  1.000000e+00  7.963089e-44
##  [5,]  1.000000e+00 5.679855e-244
##  [6,]  1.000000e+00 1.319742e-265
##  [7,]  1.000000e+00 4.563247e-272
##  [8,]  4.499153e-33  1.000000e+00
##  [9,]  1.000000e+00 8.356058e-175
## [10,]  1.000000e+00  0.000000e+00

Classify the first 10 tweets and get classes

trump_tweet_pred <- 
  predict(viral_classifier, 
          newdata=trump_dtm, type="class")

trump_tweet_pred
##  [1] Neg Pos Pos Neg Neg Neg Neg Pos Neg Neg
## Levels: Neg Pos