2/23/2017

From last time…

  1. Learned how to acquire text data using APIs.

  2. Learned how to clean and prepare text data for analysis.

  3. Built a document-term matrix.

For today

  1. Further text processing
  • Sparsity Reduction
  • TF-IDF Matrix
  1. Building supervised machine learning classifiers with text data.
  • Regularized Logistic Regression
  • Naive Bayes.
  1. Assessing the performance of classifiers.

Building a pipeline

text_cleaner<-function(corpus){
  tempcorpus = lapply(corpus, function(t) t$getText())
  tempcorpus = lapply(tempcorpus, tolower)
  tempcorpus<-Corpus(VectorSource(tempcorpus))
  tempcorpus<-tm_map(tempcorpus,
                    removePunctuation)
  tempcorpus<-tm_map(tempcorpus,
                    stripWhitespace)
  tempcorpus<-tm_map(tempcorpus,
                    removeNumbers)
  tempcorpus<-tm_map(tempcorpus,
                     removeWords, stopwords("english"))
  tempcorpus<-tm_map(tempcorpus, 
                    stemDocument)
  return(tempcorpus)
}

Building a document-term matrix

  • Need to go from texts \(\rightarrow\) numbers.

  • If \(d =\) documents in a corpus and \(w=\) words in a corpus.

  • Create a matrix \(\Delta \in \mathbb{N}^{dXw}\).

  • Rows are documents, columns are words

  • This is called the document-term matrix.

Building the document-term matrix

inspect(dtm[1:5, 1:5])
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 0/25
## Sparsity           : 100%
## Maximal term length: 14
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs abandon abc abcdonaldtrump abcpolit abdeslam
##    1       0   0              0        0        0
##    2       0   0              0        0        0
##    3       0   0              0        0        0
##    4       0   0              0        0        0
##    5       0   0              0        0        0
newcorpus[[2]]$content
## [1] " done even better elect possibl winner base popular vote campaign differ"

Sparse Document Term Matrices

  • sparse matrix In numerical analysis a sparse matrix is a matrix in which most of the elements are zero.

  • dense matrix is a matrix in which most of the elements are nonzero.
  • Matrices in text analysis problems tend to be very sparse.

  • This implies that they have many parameters that are uninformative.

Sparsity reduction

  • Sparsity can be reduced by removing terms that occur very frequently.

  • This tends to have the effect of both reducing overfitting and improving the predictive abilities of the model.

Sparsity reduction

dtm<-removeSparseTerms(dtm,0.95)
dtm
## <<DocumentTermMatrix (documents: 3917, terms: 16)>>
## Non-/sparse entries: 5676/56996
## Sparsity           : 91%
## Maximal term length: 21
## Weighting          : term frequency (tf)
  • Here we are reducing the sparsity of the document-term matrix so that the sparsity (% of non-zeros) is a maximum of 95%.

TF-IDF: Term-Frequency, Inverse Document Frequency

  • The document term matrix only contains the counts of each word in each document.

  • This is not the most informative measure of how important a word is in a document.

  • We can construct a much better measure by weighting the term frequncies by a metric of how important a term is.

TF-IDF: Term-Frequency, Inverse Document Frequency

Term frequency- number of times term t appears in document d

\[TF_{i,j} = \frac{\sum_{i=1}^{w_{d}}1(w_{i} = t)}{w_{d}}\] Inverse document frequency - measures importance of a term in a corpus. It is the log of the number of total documents N divided by total documents containing the term t

\[IDF_{i,j} = ln\left(\frac{N}{\sum_{j=1}^{N}1(d_{j} = t)}\right)\]

TF-IDF Matrix

\[TF-IDF_{i,j}=TF_{i,j}~X~IDF_{i,j} \] \[TF-IDF \in \mathbb{R}^{dxw}\]

TF-IDF Matrix

dtm <- DocumentTermMatrix(newcorpus, control = list(weighting = weightTfIdf))
## Warning in weighting(x): empty document(s): 2513
dtm = removeSparseTerms(dtm, 0.95)
inspect(dtm[1:5,4:8])
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 5/20
## Sparsity           : 80%
## Maximal term length: 7
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## 
##     Terms
## Docs       get     great   hillari just make
##    1 0.0000000 0.0000000 0.2606815    0    0
##    2 0.0000000 0.0000000 0.0000000    0    0
##    3 0.0000000 0.0000000 0.2406291    0    0
##    4 0.2921460 0.4064887 0.0000000    0    0
##    5 0.3146187 0.0000000 0.0000000    0    0
  • We can easily construct this in R

Some fun with words

summary(trumptweets$Retweets)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3030    5424    7597    9366  349900
viraltweets<-ifelse(trumptweets$Retweets > 9366, 1,0)
nonviraltweets<-ifelse(trumptweets$Retweets < 3030, 1,0)
  • Let's say we were interested in trying to figure out what makes a tweet go viral.

  • We explore the difference in word usage between high retweet rate tweets and low retweet rate tweets.

Plot a word cloud

Wordcloud for High Retweet Trump Tweets

Plot a word cloud

Wordcloud for Low Retweet Trump Tweets

Supervised machine learning with text data

  • The purpose of all of these steps was to prepare us to build classifiers using supervised machine learning methods.

  • Recall that supervised machine learning methods are based upon human classification of data.

  • The overall goal of supervised machine learning methods is to minimize both the variance and bias of a classifier.

  • In other words we want to produce a classifier that produces the best results according to an objective standard.

Step back - Assessing the performance of classifiers

  • Imagine that we built a classifier to figure out which tweets were likely to "go viral"

  • Such a classifier can make two types of errors:
  1. It can incorrectly classify a tweet as one that will "go viral" when it goes not go viral (false positive)

  2. It can incorrectly classify a tweet as one that will "not go viral" when it does go viral. (false negative)

Confusion matrix

\begin{tablular}{l|ll} & \ & 1 & 0 \ 1 & & \ 0 & & \ \end{tabular}

  • A means of displaying this information is in a "confusion matrix" as the one shown above.

  • If you are writing a paper using a classifier, always include the confusion matrix.

Sensitivity, specificity and accuracy

  • Class specific performance is a very important aspect of classifiers.

  • Ideally, you want to keep both false negatives and false positives as low as possible.

Sensitivity, specificity and accuracy

  • Accuracy - % of documents that are correctly classified: \[ \frac{\text{# docs correctly classified}}{\text{# of docs classified}} \]

  • Sensitivity is the % of positives that are correctly identified: \[ \frac{\text{# of positives identified}}{\text{# of positives}} \]

  • Specificity is the % of negatives that are correctly identified: \[ \frac{\text{# of negatives identified}}{\text{# of negatives}} \]

Sensitivity, specificity and accuracy

  • It is very easy to have very high accuracy rates and high (sensitivity/specificity) but a crappy classifier.

  • Lesson from classifying violence in religious texts.

Classifying Text Data With Logistic Regression

\[ logit(E[C| W,X]) = \theta_{0} + \theta_{1}w_{1} + \cdots + \theta_{n}w_{n} \]

  • Once we have created the Document-Term Matrix, it's easy to perform text classification using logistic regression.

  • The dependent variable is the class label and the independent variables are the words and other features of the document.

Classifying Text Data With Logistic Regression

\[ P(C_{i} = 1 | W_{i},X_{i}) = \frac{exp(\theta W_{i})}{1 + exp(\theta W_{i})} \] Assign class label \(C_{k}\) such that \[ C_{k} = \arg\max_{k} P(C_{i} = k | W_{i},X_{i}) \]

Sparse Logistic Regression

\[ \arg\min_{\theta} \sum_{i=1}^m log\left[1+exp(-ch_{\theta}(w))\right] + \lambda \sum_{i=1}^n \theta^2] \]

  • Unfortunately, ordinary logistic regression breaks down when the number of observations m is close to the number of parameters estimates, p.

  • This is often the case when we are dealing with text data.

  • This issue can be solved, however, by simply adding LASSO-type regularization to the logistic regression cost function.

  • Regularized logistic regression is often called sparse logistic regression.

Sparse Logistic Regression With Trump's Tweets

  • Assume our outcome is whether a tweet will go viral or not.

  • Let's only use the features as terms constructed via the TF-IDF matrix.

Sparse Logistic Regression With Trump's Tweets

Step 1: Divide the data into training and testing.

train=sample(1:dim(trumptweets)[1],
             dim(trumptweets)[1]*0.5)
dtm_mat<-as.matrix(dtm)
trainX = dtm_mat[train,]
testX = dtm_mat[-train,]
trainY = viraltweets[train]
testY = viraltweets[-train]

Sparse Logistic Regression With Trump's Tweets

Step 2: Train the model on the training set. This is a sparse logistic regression model with \(L_{2}\) loss.

library(LiblineaR)
m=LiblineaR(data=trainX,target=trainY,
            type=7,bias=TRUE,verbose=FALSE)

Sparse Logistic Regression With Trump's Tweets

Step 3: Make a prediction using the test data.

p=predict(m,testX)

Sparse Logistic Regression With Trump's Tweets

Step 4: Display the confusion matrix.

confusion=table(p$predictions,
          testY)
confusion
##    testY
##        0    1
##   0 1433  459
##   1   36   31

Sparse Logistic Regression With Trump's Tweets

Step 5: Calculate accuracy, specificity and sensitivity

accuracy<-(1438+39)/sum(confusion)
accuracy
## [1] 0.7539561
specificity<-39/(439+39)
specificity
## [1] 0.08158996
sensitivity<-1438/(1438+43)
sensitivity
## [1] 0.9709656