Learned how to build a TF-IDF matrix.
Run a one type of supervised learning classifier called regularized logistic regression.
3/16/2017
Learned how to build a TF-IDF matrix.
Run a one type of supervised learning classifier called regularized logistic regression.
More supervised learning with text: Naive Bayes and Support Vector Machines.
Cross-validation for supervised machine learning.
Generative classifiers: learn a model of the joint probability \(p(x,y)\) using Bayes rule to calculate the posterior \(p(y|x)\) and then pick the most likley label \(y\). (eg). Naive Bayes, all Bayesian methods in general)
Discriminative classifiers: model the posterior \(p(y|x)\) directly. (logistic regression, SVMs etc.)
Ng and Jordan (2002) find that for most kinds of data generative classifiers almost always perform better than discriminative classifiers, despite the fact that they tend to have higher MSE.
In my personal experience, the results have been mixed.
\[ P(C = k|D) = \frac{P(D|C = k)P(C=k)}{P(D)} \]
Given a document D, we want to figure out the probability of the document belonging to a class C.
We can do this by using Bayes theorem to directly calculate class probabilities given the words in a document
\[P(C = k|D)\] - is known as the posterior \[P(D |C = k)\] - is known as the likelihood \[P(C = k)\] - is known as the prior \[P(D)\] - is known as the marginal likelihood or evidence.
\[ \pi(C | D) = f_{D|C}(D|C)\pi(C) / \int_{\Theta} f_{D|C}(D|C)\pi(C) \]
\[ P(C = k|D) = \frac{P(D|C = k)P(C=k)}{P(D)} \]
\[ P(C = 1|D) = \frac{P(w_{1} \cap w_{2} \cap \cdots \cap w_{k} | C = 1) P(C = 1)}{P(w_{1} \cap w_{2} \cap \cdots \cap w_{k})} \]
Likelihood: \[P(D|C = 1) = \prod_{i=1}^W P(w_{i}|C =1)\] Prior: \[P(C = 1)= \frac{\# D \in C_{1}}{\# D \in C_{1},C_{2}}\]
Marginal likelihood: \[ P(D) = \prod_{i=1}^W P(w_{i}) \]
If we assume that the words are independent conditional on a document class then:
\[ P(C = 1|D) = \frac{[P(w_{1}|C=1)P(w_{2}|C=1)\cdots P(w_{k}| C = 1)] P(C = 1)}{P(w_{1})P(w_{2})\cdots P(w_{k})} \]
\[P(w_{i} | C = 1) = \frac{\# w_{i} \in C_{1}}{\# \mathbf{w} \in C_{1}}\] \[P(C = 1)= \frac{\# D \in C_{1}}{\# D \in C_{1},C_{2}}\] \[P(w_{i})= \frac{\# w_{i} \in C_{1},C_{2}}{\# \mathbf{w} \in C_{1},C_{2}}\]
\[ \arg\max_{k} C_{k} = P(C = k)\prod_{i=1}^W P(w_{i}|C =k) \] - For classification purposes, we can ignore the marginal likelihood and assign classes based on likehood and the prior.
\[ P(C = k | D) > \frac{1}{k}\]
Words with zero probability can seriously damage the performance of the classifier.
To correct this problem we implement a Laplace smoother to ensure that there are no zero probability words.
This amounts to simply adding 1 to each count; eg)
\[P(w_{i} | C = 1) = \frac{(\# w_{i} \in C_{1}) + 1}{(\# \mathbf{w} \in C_{1}) + 1}\]
Recent Tweet from @POTUS: "We are going to reduce your taxes big league…I want to start that process so quickly…We've got to start the tax reductions."
Cleaned Tweet: "reduc tax big league start proces quick start tax reduc."
summary(trumptweets$Retweets)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 3030 5424 7597 9366 349900
viraltweets<-ifelse(trumptweets$Retweets > 9366, 1,0) nonviraltweets<-ifelse(trumptweets$Retweets < 3030, 1,0)
Let's say we were interested in trying to figure out what makes a tweet go viral.
We explore the difference in word usage between high retweet rate tweets and low retweet rate tweets.
text_cleaner<-function(corpus, rawtext){ tempcorpus = lapply(corpus,toString) for(i in 1:length(tempcorpus)){ tempcorpus[[i]]<-iconv(tempcorpus[[i]], "ASCII", "UTF-8", sub="") } if(rawtext == TRUE){ tempcorpus = lapply(tempcorpus, function(t) t$getText()) } tempcorpus = lapply(tempcorpus, tolower) tempcorpus<-Corpus(VectorSource(tempcorpus)) tempcorpus<-tm_map(tempcorpus, removePunctuation) tempcorpus<-tm_map(tempcorpus, removeNumbers) tempcorpus<-tm_map(tempcorpus, removeWords, stopwords("english")) tempcorpus<-tm_map(tempcorpus, stemDocument) tempcorpus<-tm_map(tempcorpus, stripWhitespace) return(tempcorpus) }
train=sample(1:dim(trumptweets)[1], dim(trumptweets)[1]*0.75) trainX = dtm[train,] testX = dtm[-train,] trainY = viraltweets[train] testY = viraltweets[-train]
ten_words<- findFreqTerms(trainX,10)
ten_words[(length(ten_words)-20):length(ten_words)]
## [1] "win" "winner" "wisconsin" "without" "woman" ## [6] "women" "won" "wonder" "wont" "word" ## [11] "work" "world" "wors" "worst" "wow" ## [16] "wrong" "year" "yesterday" "yet" "york" ## [21] "zero"
fword_train <- DocumentTermMatrix(newcorpus[train], control=list(dictionary = ten_words)) fword_test <- DocumentTermMatrix(newcorpus[-train], control=list(dictionary = ten_words))
Naive Bayes uses proportions of words so we need to transform counts higher than 1 to 0.
counts <- function(x) { y <- ifelse(x > 0, 1,0) y <- factor(y, levels=c(0,1)) y }
fword_train <- apply(fword_train, 2, counts) fword_test <- apply(fword_test, 2, counts)
viral_classifier <- naiveBayes(x=fword_train,y=factor(trainY))
We will use the function "naiveBayes" in the "e1070" package.
The "viral_classifier" object is now the trained classifier on the training data.
viral_test_pred <- predict(viral_classifier, newdata=fword_test)
# Let's see how this looks confusion = table(testY,viral_test_pred) confusion
## viral_test_pred ## testY 0 1 ## 0 645 90 ## 1 126 119
Ground Truth | Class. Viral | Class. Not Viral |
---|---|---|
Viral | Correct | False (-) |
Not Viral | False (+) | Correct |
A means of displaying this information is in a "confusion matrix" as the one shown above.
If you are writing a paper using a classifier, always include the confusion matrix.
Class specific performance is a very important aspect of classifiers.
Ideally, you want to keep both false negatives and false positives as low as possible.
Accuracy - % of documents that are correctly classified: \[ \frac{\text{# docs correctly classified}}{\text{# of docs classified}} \]
Sensitivity is the % of positives that are correctly identified: \[ \frac{\text{# of positives identified}}{\text{# of positives}} \]
Specificity is the % of negatives that are correctly identified: \[ \frac{\text{# of negatives identified}}{\text{# of negatives}} \]
It is very easy to have very high accuracy rates and high (sensitivity/specificity) but a crappy classifier.
Step 5: Calculate accuracy, specificity and sensitivity
accuracy<-c(confusion[1,1]+confusion[2,2])/sum(confusion) accuracy
## [1] 0.7795918
Step 5: Calculate accuracy, specificity and sensitivity
specificity<-confusion[1,1]/sum(confusion[1,]) specificity
## [1] 0.877551
Step 5: Calculate accuracy, specificity and sensitivity
sensitivity<-confusion[2,2]/sum(confusion[2,]) sensitivity
## [1] 0.4857143
One of the oldest machine learning methods.
Introduced around 1992 by Vapnik.
Theoretically well motivated - the product of statistical learning theory since the 60s.
Good performance in many domains (image recognition, text data etc.)
Support vector machines are in many ways similar to regression.
But instead of fitting a line, support vector machines fit a maximally separating hyperplane between a set of points.
Support vector machines have nice properties
Convex and can be non-linear w/ different kernels.
\[ \theta^{T}x - \alpha = 0 \]
Hyperplane can be written as the above.
SVMs involve estimating weights \(\theta\) that define a hyperplane which separates two classes.
But there are lots of different hyperplanes you can estimate.
Which one to choose?
\[ \begin{aligned} \theta^{T}x - \alpha & = 0 \\ \theta^{T}x - \alpha & \geq 1 &~\text{if}~y_{i} =1 \\ \theta^{T}x - \alpha & \leq -1 &~\text{if}~y_{i}=-1 \\ \end{aligned} \]
Margin is: \[ \frac{2}{||\theta||} \]
The maximum margin is when \(||\theta||\) is at a minimum.
Minimize \[||\theta|| \]
\[y_{i}(\theta^{T}x - \alpha) \geq 1 \]
Hyperplane estimation is a constrained optimization problem.
Estimated using Lagrangians.
Although SVMs are technially linear models, you can estimate nonlinear SVMs with something known as the kernel trick
This can be done by essentially changing the ways that the weights are estimated.
Changing the kernel can drastically change performance of the SVM.
fword_train <- DocumentTermMatrix(newcorpus[train], control=list(dictionary = ten_words)) fword_test <- DocumentTermMatrix(newcorpus[-train], control=list(dictionary = ten_words))
model <- svm(x=as.matrix(fword_train), y=factor(trainY)) predictedY <- predict(model,as.matrix(fword_test)) confusion = table(testY, predictedY) confusion
## predictedY ## testY 0 1 ## 0 726 9 ## 1 212 33
model <- svm(x=as.matrix(fword_train), y=factor(trainY), kernel="sigmoid") predictedY <- predict(model,as.matrix(fword_test)) confusion = table(testY, predictedY) confusion
## predictedY ## testY 0 1 ## 0 696 39 ## 1 196 49