Supervised Learning with Text I

2/23/2017

From last time…

Learned how to acquire text data using APIs.
Learned how to clean and prepare text data for analysis.
Built a document-term matrix.

For today

Further text processing

Sparsity Reduction
TF-IDF Matrix

Building supervised machine learning classifiers with text data.

Regularized Logistic Regression
Naive Bayes.

Assessing the performance of classifiers.

Building a pipeline

text_cleaner<-function(corpus){
  tempcorpus = lapply(corpus, function(t) t$getText())
  tempcorpus = lapply(tempcorpus, tolower)
  tempcorpus<-Corpus(VectorSource(tempcorpus))
  tempcorpus<-tm_map(tempcorpus,
                    removePunctuation)
  tempcorpus<-tm_map(tempcorpus,
                    stripWhitespace)
  tempcorpus<-tm_map(tempcorpus,
                    removeNumbers)
  tempcorpus<-tm_map(tempcorpus,
                     removeWords, stopwords("english"))
  tempcorpus<-tm_map(tempcorpus, 
                    stemDocument)
  return(tempcorpus)
}

Building a document-term matrix

Need to go from texts \(\rightarrow\) numbers.
If \(d =\) documents in a corpus and \(w=\) words in a corpus.
Create a matrix \(\Delta \in \mathbb{N}^{dXw}\).
Rows are documents, columns are words
This is called the document-term matrix.

Building the document-term matrix

inspect(dtm[1:5, 1:5])

## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 0/25
## Sparsity           : 100%
## Maximal term length: 14
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs abandon abc abcdonaldtrump abcpolit abdeslam
##    1       0   0              0        0        0
##    2       0   0              0        0        0
##    3       0   0              0        0        0
##    4       0   0              0        0        0
##    5       0   0              0        0        0

newcorpus[[2]]$content

## [1] " done even better elect possibl winner base popular vote campaign differ"

Sparse Document Term Matrices

sparse matrix In numerical analysis a sparse matrix is a matrix in which most of the elements are zero.
dense matrix is a matrix in which most of the elements are nonzero.
Matrices in text analysis problems tend to be very sparse.
This implies that they have many parameters that are uninformative.

Sparsity reduction

Sparsity can be reduced by removing terms that occur very frequently.
This tends to have the effect of both reducing overfitting and improving the predictive abilities of the model.

Sparsity reduction

dtm<-removeSparseTerms(dtm,0.95)
dtm

## <<DocumentTermMatrix (documents: 3917, terms: 16)>>
## Non-/sparse entries: 5676/56996
## Sparsity           : 91%
## Maximal term length: 21
## Weighting          : term frequency (tf)

Here we are reducing the sparsity of the document-term matrix so that the sparsity (% of non-zeros) is a maximum of 95%.

TF-IDF: Term-Frequency, Inverse Document Frequency

The document term matrix only contains the counts of each word in each document.
This is not the most informative measure of how important a word is in a document.
We can construct a much better measure by weighting the term frequncies by a metric of how important a term is.

TF-IDF: Term-Frequency, Inverse Document Frequency

Term frequency- number of times term t appears in document d

\[TF_{i,j} = \frac{\sum_{i=1}^{w_{d}}1(w_{i} = t)}{w_{d}}\] Inverse document frequency - measures importance of a term in a corpus. It is the log of the number of total documents N divided by total documents containing the term t

\[IDF_{i,j} = ln\left(\frac{N}{\sum_{j=1}^{N}1(d_{j} = t)}\right)\]

TF-IDF Matrix

\[TF-IDF_{i,j}=TF_{i,j}~X~IDF_{i,j} \] \[TF-IDF \in \mathbb{R}^{dxw}\]

TF-IDF Matrix

dtm <- DocumentTermMatrix(newcorpus, control = list(weighting = weightTfIdf))

## Warning in weighting(x): empty document(s): 2513

dtm = removeSparseTerms(dtm, 0.95)
inspect(dtm[1:5,4:8])

## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 5/20
## Sparsity           : 80%
## Maximal term length: 7
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## 
##     Terms
## Docs       get     great   hillari just make
##    1 0.0000000 0.0000000 0.2606815    0    0
##    2 0.0000000 0.0000000 0.0000000    0    0
##    3 0.0000000 0.0000000 0.2406291    0    0
##    4 0.2921460 0.4064887 0.0000000    0    0
##    5 0.3146187 0.0000000 0.0000000    0    0

We can easily construct this in R

Some fun with words

summary(trumptweets$Retweets)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3030    5424    7597    9366  349900

viraltweets<-ifelse(trumptweets$Retweets > 9366, 1,0)
nonviraltweets<-ifelse(trumptweets$Retweets < 3030, 1,0)

Let's say we were interested in trying to figure out what makes a tweet go viral.
We explore the difference in word usage between high retweet rate tweets and low retweet rate tweets.

Plot a word cloud

Wordcloud for High Retweet Trump Tweets

Plot a word cloud

Wordcloud for Low Retweet Trump Tweets

Supervised machine learning with text data

The purpose of all of these steps was to prepare us to build classifiers using supervised machine learning methods.
Recall that supervised machine learning methods are based upon human classification of data.
The overall goal of supervised machine learning methods is to minimize both the variance and bias of a classifier.
In other words we want to produce a classifier that produces the best results according to an objective standard.

Step back - Assessing the performance of classifiers

Imagine that we built a classifier to figure out which tweets were likely to "go viral"
Such a classifier can make two types of errors:

It can incorrectly classify a tweet as one that will "go viral" when it goes not go viral (false positive)
It can incorrectly classify a tweet as one that will "not go viral" when it does go viral. (false negative)

Confusion matrix

\begin{tablular}{l|ll} & \ & 1 & 0 \ 1 & & \ 0 & & \ \end{tabular}

A means of displaying this information is in a "confusion matrix" as the one shown above.
If you are writing a paper using a classifier, always include the confusion matrix.

Sensitivity, specificity and accuracy

Class specific performance is a very important aspect of classifiers.
Ideally, you want to keep both false negatives and false positives as low as possible.

Sensitivity, specificity and accuracy

Accuracy - % of documents that are correctly classified: \[ \frac{\text{# docs correctly classified}}{\text{# of docs classified}} \]
Sensitivity is the % of positives that are correctly identified: \[ \frac{\text{# of positives identified}}{\text{# of positives}} \]
Specificity is the % of negatives that are correctly identified: \[ \frac{\text{# of negatives identified}}{\text{# of negatives}} \]

Sensitivity, specificity and accuracy

It is very easy to have very high accuracy rates and high (sensitivity/specificity) but a crappy classifier.
Lesson from classifying violence in religious texts.

Classifying Text Data With Logistic Regression

\[ logit(E[C| W,X]) = \theta_{0} + \theta_{1}w_{1} + \cdots + \theta_{n}w_{n} \]

Once we have created the Document-Term Matrix, it's easy to perform text classification using logistic regression.
The dependent variable is the class label and the independent variables are the words and other features of the document.

Classifying Text Data With Logistic Regression

\[ P(C_{i} = 1 | W_{i},X_{i}) = \frac{exp(\theta W_{i})}{1 + exp(\theta W_{i})} \] Assign class label \(C_{k}\) such that \[ C_{k} = \arg\max_{k} P(C_{i} = k | W_{i},X_{i}) \]

Sparse Logistic Regression

\[ \arg\min_{\theta} \sum_{i=1}^m log\left[1+exp(-ch_{\theta}(w))\right] + \lambda \sum_{i=1}^n \theta^2] \]

Unfortunately, ordinary logistic regression breaks down when the number of observations m is close to the number of parameters estimates, p.
This is often the case when we are dealing with text data.
This issue can be solved, however, by simply adding LASSO-type regularization to the logistic regression cost function.
Regularized logistic regression is often called sparse logistic regression.

Sparse Logistic Regression With Trump's Tweets

Assume our outcome is whether a tweet will go viral or not.
Let's only use the features as terms constructed via the TF-IDF matrix.

Sparse Logistic Regression With Trump's Tweets

Step 1: Divide the data into training and testing.

train=sample(1:dim(trumptweets)[1],
             dim(trumptweets)[1]*0.5)
dtm_mat<-as.matrix(dtm)
trainX = dtm_mat[train,]
testX = dtm_mat[-train,]
trainY = viraltweets[train]
testY = viraltweets[-train]

Sparse Logistic Regression With Trump's Tweets

Step 2: Train the model on the training set. This is a sparse logistic regression model with \(L_{2}\) loss.

library(LiblineaR)
m=LiblineaR(data=trainX,target=trainY,
            type=7,bias=TRUE,verbose=FALSE)

Sparse Logistic Regression With Trump's Tweets

Step 3: Make a prediction using the test data.

p=predict(m,testX)

Sparse Logistic Regression With Trump's Tweets

Step 4: Display the confusion matrix.

confusion=table(p$predictions,
          testY)
confusion

##    testY
##        0    1
##   0 1433  459
##   1   36   31

Sparse Logistic Regression With Trump's Tweets

Step 5: Calculate accuracy, specificity and sensitivity

accuracy<-(1438+39)/sum(confusion)
accuracy

## [1] 0.7539561

specificity<-39/(439+39)
specificity

## [1] 0.08158996

sensitivity<-1438/(1438+43)
sensitivity

## [1] 0.9709656