Learned how to acquire text data using APIs.
Learned how to clean and prepare text data for analysis.
Built a document-term matrix.
2/23/2017
Learned how to acquire text data using APIs.
Learned how to clean and prepare text data for analysis.
Built a document-term matrix.
text_cleaner<-function(corpus){ tempcorpus = lapply(corpus, function(t) t$getText()) tempcorpus = lapply(tempcorpus, tolower) tempcorpus<-Corpus(VectorSource(tempcorpus)) tempcorpus<-tm_map(tempcorpus, removePunctuation) tempcorpus<-tm_map(tempcorpus, stripWhitespace) tempcorpus<-tm_map(tempcorpus, removeNumbers) tempcorpus<-tm_map(tempcorpus, removeWords, stopwords("english")) tempcorpus<-tm_map(tempcorpus, stemDocument) return(tempcorpus) }
Need to go from texts \(\rightarrow\) numbers.
If \(d =\) documents in a corpus and \(w=\) words in a corpus.
Create a matrix \(\Delta \in \mathbb{N}^{dXw}\).
Rows are documents, columns are words
This is called the document-term matrix.
inspect(dtm[1:5, 1:5])
## <<DocumentTermMatrix (documents: 5, terms: 5)>> ## Non-/sparse entries: 0/25 ## Sparsity : 100% ## Maximal term length: 14 ## Weighting : term frequency (tf) ## ## Terms ## Docs abandon abc abcdonaldtrump abcpolit abdeslam ## 1 0 0 0 0 0 ## 2 0 0 0 0 0 ## 3 0 0 0 0 0 ## 4 0 0 0 0 0 ## 5 0 0 0 0 0
newcorpus[[2]]$content
## [1] " done even better elect possibl winner base popular vote campaign differ"
sparse matrix In numerical analysis a sparse matrix is a matrix in which most of the elements are zero.
Matrices in text analysis problems tend to be very sparse.
This implies that they have many parameters that are uninformative.
Sparsity can be reduced by removing terms that occur very frequently.
This tends to have the effect of both reducing overfitting and improving the predictive abilities of the model.
dtm<-removeSparseTerms(dtm,0.95) dtm
## <<DocumentTermMatrix (documents: 3917, terms: 16)>> ## Non-/sparse entries: 5676/56996 ## Sparsity : 91% ## Maximal term length: 21 ## Weighting : term frequency (tf)
The document term matrix only contains the counts of each word in each document.
This is not the most informative measure of how important a word is in a document.
We can construct a much better measure by weighting the term frequncies by a metric of how important a term is.
Term frequency- number of times term t appears in document d
\[TF_{i,j} = \frac{\sum_{i=1}^{w_{d}}1(w_{i} = t)}{w_{d}}\] Inverse document frequency - measures importance of a term in a corpus. It is the log of the number of total documents N divided by total documents containing the term t
\[IDF_{i,j} = ln\left(\frac{N}{\sum_{j=1}^{N}1(d_{j} = t)}\right)\]
\[TF-IDF_{i,j}=TF_{i,j}~X~IDF_{i,j} \] \[TF-IDF \in \mathbb{R}^{dxw}\]
dtm <- DocumentTermMatrix(newcorpus, control = list(weighting = weightTfIdf))
## Warning in weighting(x): empty document(s): 2513
dtm = removeSparseTerms(dtm, 0.95) inspect(dtm[1:5,4:8])
## <<DocumentTermMatrix (documents: 5, terms: 5)>> ## Non-/sparse entries: 5/20 ## Sparsity : 80% ## Maximal term length: 7 ## Weighting : term frequency - inverse document frequency (normalized) (tf-idf) ## ## Terms ## Docs get great hillari just make ## 1 0.0000000 0.0000000 0.2606815 0 0 ## 2 0.0000000 0.0000000 0.0000000 0 0 ## 3 0.0000000 0.0000000 0.2406291 0 0 ## 4 0.2921460 0.4064887 0.0000000 0 0 ## 5 0.3146187 0.0000000 0.0000000 0 0
summary(trumptweets$Retweets)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 3030 5424 7597 9366 349900
viraltweets<-ifelse(trumptweets$Retweets > 9366, 1,0) nonviraltweets<-ifelse(trumptweets$Retweets < 3030, 1,0)
Let's say we were interested in trying to figure out what makes a tweet go viral.
We explore the difference in word usage between high retweet rate tweets and low retweet rate tweets.
The purpose of all of these steps was to prepare us to build classifiers using supervised machine learning methods.
Recall that supervised machine learning methods are based upon human classification of data.
The overall goal of supervised machine learning methods is to minimize both the variance and bias of a classifier.
In other words we want to produce a classifier that produces the best results according to an objective standard.
Imagine that we built a classifier to figure out which tweets were likely to "go viral"
It can incorrectly classify a tweet as one that will "go viral" when it goes not go viral (false positive)
It can incorrectly classify a tweet as one that will "not go viral" when it does go viral. (false negative)
\begin{tablular}{l|ll} & \ & 1 & 0 \ 1 & & \ 0 & & \ \end{tabular}
A means of displaying this information is in a "confusion matrix" as the one shown above.
If you are writing a paper using a classifier, always include the confusion matrix.
Class specific performance is a very important aspect of classifiers.
Ideally, you want to keep both false negatives and false positives as low as possible.
Accuracy - % of documents that are correctly classified: \[ \frac{\text{# docs correctly classified}}{\text{# of docs classified}} \]
Sensitivity is the % of positives that are correctly identified: \[ \frac{\text{# of positives identified}}{\text{# of positives}} \]
Specificity is the % of negatives that are correctly identified: \[ \frac{\text{# of negatives identified}}{\text{# of negatives}} \]
It is very easy to have very high accuracy rates and high (sensitivity/specificity) but a crappy classifier.
\[ logit(E[C| W,X]) = \theta_{0} + \theta_{1}w_{1} + \cdots + \theta_{n}w_{n} \]
Once we have created the Document-Term Matrix, it's easy to perform text classification using logistic regression.
The dependent variable is the class label and the independent variables are the words and other features of the document.
\[ P(C_{i} = 1 | W_{i},X_{i}) = \frac{exp(\theta W_{i})}{1 + exp(\theta W_{i})} \] Assign class label \(C_{k}\) such that \[ C_{k} = \arg\max_{k} P(C_{i} = k | W_{i},X_{i}) \]
\[ \arg\min_{\theta} \sum_{i=1}^m log\left[1+exp(-ch_{\theta}(w))\right] + \lambda \sum_{i=1}^n \theta^2] \]
Unfortunately, ordinary logistic regression breaks down when the number of observations m is close to the number of parameters estimates, p.
This is often the case when we are dealing with text data.
This issue can be solved, however, by simply adding LASSO-type regularization to the logistic regression cost function.
Regularized logistic regression is often called sparse logistic regression.
Assume our outcome is whether a tweet will go viral or not.
Let's only use the features as terms constructed via the TF-IDF matrix.
Step 1: Divide the data into training and testing.
train=sample(1:dim(trumptweets)[1], dim(trumptweets)[1]*0.5) dtm_mat<-as.matrix(dtm) trainX = dtm_mat[train,] testX = dtm_mat[-train,] trainY = viraltweets[train] testY = viraltweets[-train]
Step 2: Train the model on the training set. This is a sparse logistic regression model with \(L_{2}\) loss.
library(LiblineaR) m=LiblineaR(data=trainX,target=trainY, type=7,bias=TRUE,verbose=FALSE)
Step 3: Make a prediction using the test data.
p=predict(m,testX)
Step 4: Display the confusion matrix.
confusion=table(p$predictions, testY) confusion
## testY ## 0 1 ## 0 1433 459 ## 1 36 31
Step 5: Calculate accuracy, specificity and sensitivity
accuracy<-(1438+39)/sum(confusion) accuracy
## [1] 0.7539561
specificity<-39/(439+39) specificity
## [1] 0.08158996
sensitivity<-1438/(1438+43) sensitivity
## [1] 0.9709656