  • Getting the most out of topic modeling.

  • Multidimensional scaling of texts.

  • Introduction to Computer Vision and Image Analysis

Getting the most out of topic models

  • Topic models give us two types of outputs that allow us to do many things.

  • Output 1: topics in a corpus.

  • Output 2: topic proportions for each document.

Topics in a corpus - corpus structure

  • Estimating a topic model over a corpus allows us to get a sense of how a set of docs are structured.

  • Let's do an example with the Associated Press articles

Associated press articles

## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)
ap_lda <- LDA(AssociatedPress, k = 5, control = list(seed = 1234))

# Getting the most out of topic models.

terms(ap_lda, k=10) # top 10 words for each topic
##       Topic 1      Topic 2      Topic 3   Topic 4  Topic 5     
##  [1,] "percent"    "bush"       "million" "i"      "government"
##  [2,] "year"       "soviet"     "new"     "people" "police"    
##  [3,] "million"    "president"  "company" "two"    "court"     
##  [4,] "billion"    "i"          "market"  "police" "people"    
##  [5,] "new"        "united"     "stock"   "years"  "two"       
##  [6,] "report"     "states"     "billion" "new"    "state"     
##  [7,] "last"       "new"        "percent" "three"  "case"      
##  [8,] "years"      "house"      "year"    "city"   "years"     
##  [9,] "workers"    "dukakis"    "york"    "like"   "south"     
## [10,] "department" "government" "dollar"  "school" "attorney"

Let's name topics 1-5

terms(ap_lda, k=15) # top 10 words for each topic
##       Topic 1      Topic 2      Topic 3   Topic 4    Topic 5     
##  [1,] "percent"    "bush"       "million" "i"        "government"
##  [2,] "year"       "soviet"     "new"     "people"   "police"    
##  [3,] "million"    "president"  "company" "two"      "court"     
##  [4,] "billion"    "i"          "market"  "police"   "people"    
##  [5,] "new"        "united"     "stock"   "years"    "two"       
##  [6,] "report"     "states"     "billion" "new"      "state"     
##  [7,] "last"       "new"        "percent" "three"    "case"      
##  [8,] "years"      "house"      "year"    "city"     "years"     
##  [9,] "workers"    "dukakis"    "york"    "like"     "south"     
## [10,] "department" "government" "dollar"  "school"   "attorney"  
## [11,] "federal"    "campaign"   "bank"    "time"     "last"      
## [12,] "prices"     "party"      "inc"     "just"     "trial"     
## [13,] "program"    "committee"  "trading" "children" "judge"     
## [14,] "government" "congress"   "corp"    "first"    "officials" 
## [15,] "oil"        "reagan"     "share"   "day"      "prison"

Which documents were in which topics?

  • This gives us a NxK matrix of the topic proportions for each documents
posterior_inference <- posterior(ap_lda)
posterior_topic_dist<-posterior_inference$topics # This is the distribution of topics for each document
## [1] 2246    5

Which documents were in which topics?

  • It's easy to find which documents had the highest probability for topic 2
topic_2_docs<-which(posterior_topic_dist[,2] > 0.50)
##  [1]  2  6  8 13 14 18 27 32 39 50

Let's look at some of the words in two documents from the same topic

ap_td <- tidy(AssociatedPress)
wordcloud(sample(ap_td[ap_td$document==6,]$term,10), xlab = "Document 6")
Let's look at some of the words in two documents from the same topic

wordcloud(sample(ap_td[ap_td$document==8,]$term,10), xlab = "Document 8")
What % of documents fall under each topic?

  • To understand more about a corpus we might be interested in what % of documents fall under each topic.

  • This is especially useful for understanding things like news coverage.

  • Here, we can find out (roughly), what % of Associated Press articles are in each topic.

How much coverage did the Associated Press devote to each topic?


Topics<-c("Topic 1", "Topic 2","Topic 3","Topic 4","Topic 5")

Document similarity

\[ document-similarity = \sum_{k=1}^{K}\left(\sqrt{\theta_{d,k}} + \sqrt{\theta_{f,k}}\right)^2 \] - Topic proportions from topic models can also be used to compare documents by how similar they are.

Document similarity function

    (sqrt(doc1) + sqrt(doc2))^2

Using topic proportions, calculate AP article similarity

  • Let's figure out which article is most similar to article 1.

for(i in 2:dim(posterior_topic_dist)[1]){
      doc_similarity(posterior_topic_dist[1,], posterior_topic_dist[i,]))

## [1] 1523

Using topic proportions, calculate AP article similarity

  • It seems as if article 1523 is the most similar.

  • What words does it have?

Using topic proportions, calculate AP article similarity

par(mfrow = c(1,2))
Document clustering and multidimensional scaling

  • We are often interested in finding out how similar a set of documents are for a variety of reasons.

  • May be interested in identifying latent features of document.

  • May be interested in scoring documents to see how these scores relate to other features of the documents.

Hierarchical Clustering (HC)

  • As the name suggest HC builds a hierarchy of clusters.

  • Two types of hierarchical clustering:

  1. Agglomerative - "Bottom up". Each observation has its own unique cluster, clusters are merged based on distance metrics.

  2. Divisive - "Top down" - each observation is assumed to be in its own cluster and clusters are broken apart.

Hierarchical clustering algorithm

  • HC, like multidimensional scaling, is a clustering method that is based on some measure of "distance" between two observations.

  • Algorithm proceeds by grouping observations by distance.

Types of distance


\[\|x_{1}-x_{2} \|_2 = \sqrt{\sum_i (x_{1i}-x_{2i})^2}\]

Types of distance

Squared Eudlidean

\[\|x_{1}-x_{2} \|_2^2 = \sum_i (x_{1i}-x_{2i})^2\]

Types of distance


\[\|x_{1}-x_{2} \|_1 = \sum_i |x_{1i}-x_{2i}|\]

  • etc

Linkage criterion and dendrograms

  • Hierarchical clustering proceeds by clustering observations based on linkage criterion.

  • This is essentially the maximum distance between sets of observations.

  • Ward's criterion proceeds by creating clusters based on within-cluster variance minimization.

  • This is what we use below to cluster Trump's tweets

Clustering with the Iris Dataset

#run hierarchical clustering using Ward’s method
clusters <- hclust(d,method="ward.D")
# Plot the dendrogram to figure out how many clusters
plot(clusters, hang=-1)

Clustering with the Iris Dataset

  • Clear separation between each species

First 100 of Trumps tweets

# Create a distance object using the dtm
#run hierarchical clustering using Ward’s method
clusters <- hclust(d,method="ward.D")
# Plot the dendrogram to figure out how many clusters
plot(clusters, hang=-1)

Interpreting the dendrogram

  • Looking at the dendrogram, we have to figure out how many clusters there are based on how well the tree separates.

Dendrogram of Trump's Tweets

plot(clusters, hang=-1,main="Dendrogram of Trump's Tweets")

Label documents

  • Looks like we have two well separated clusters here.

  • We can now label the documents by cutting the tree.

##  1  2  3  4  5  6  7  8  9 10 
##  1  1  1  1  1  1  1  1  1  1

Label documents


What do these clusters look like?


Multidimensional Scaling

  • There are occasions in which we might be interested in scoring documents on how similar they are.

  • This can be done by performing multidimensional scaling on the Document-Term Matrix.

Multidimensional Scaling of Members of Congress

Multidimensional Scaling of Members of Congress

  • This is accomplished by collapsing the matrix of roll call votes

  • Members of Congress are rows (observations)

  • Bills voted on are columns.

Multidimensional Scaling

  • Like hierarchical clustering, scoring based on distance between observations

  • Researcher must chose the number of dimensions they belive the data fall along.

Multidimensional Scaling

d <- dist(dtm_mat) # euclidean distances between the rows
fit <- cmdscale(d,eig=TRUE, k=2) # k is the number of dim
fit # view results
Plot first and second dimensions

Dim1 <- fit$points[,1]
Dim2 <- fit$points[,2]

plot(Dim1, Dim2, xlab="First Dimension", ylab="Second Dimensions", 
  main="Metric  MDS",   type="n")
text(Dim1, Dim2, labels = row.names(dtm_mat), cex=.7)