For today

  1. Building a corpus using APIs.

  2. Pre-processing text data: tokenization, stemming, removing stop words.

  3. Document-term matrix and TF-IDF.

Over the next few weeks

  1. Supervised learning with text data
  1. Unsupervised learning with text data

Intro to natural language processing

NLP Applications (Real World)

NLP Applications (Social Science)

NLP Basic Terminology

\[ word \subset document \subset corpus\]

Example: Measuring political ideology from Tweets

Barberá, P., 2013. “Birds of the same feather tweet together, bayesian ideal point estimation using twitter data” Political Analysis

Acquiring text data

  1. Online corpora - There are many built in packages in R and in Python that you can load which have thousands of texts that you can access.

  2. Build your own corpora

Online corpora

## Warning: unable to access index for repository http://datacube.wu.ac.at/bin/macosx/mavericks/contrib/3.3:
##   cannot open URL 'http://datacube.wu.ac.at/bin/macosx/mavericks/contrib/3.3/PACKAGES'
## installing the source package 'tm.corpus.Reuters21578'
library(tm)
install.packages("tm.corpus.Reuters21578", 
                 repos = "http://datacube.wu.ac.at")
library(tm.corpus.Reuters21578)
data(Reuters21578)

Reuters Corpus

inspect(Reuters21578[1:2])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  16
## Content:  chars: 2860
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  16
## Content:  chars: 438

Reuters Corpus

Reuters21578[[1]]$content
## [1] "Showers continued throughout the week in\nthe Bahia cocoa zone, alleviating the drought since early\nJanuary and improving prospects for the coming temporao,\nalthough normal humidity levels have not been restored,\nComissaria Smith said in its weekly review.\n    The dry period means the temporao will be late this year.\n    Arrivals for the week ended February 22 were 155,221 bags\nof 60 kilos making a cumulative total for the season of 5.93\nmln against 5.81 at the same stage last year. Again it seems\nthat cocoa delivered earlier on consignment was included in the\narrivals figures.\n    Comissaria Smith said there is still some doubt as to how\nmuch old crop cocoa is still available as harvesting has\npractically come to an end. With total Bahia crop estimates\naround 6.4 mln bags and sales standing at almost 6.2 mln there\nare a few hundred thousand bags still in the hands of farmers,\nmiddlemen, exporters and processors.\n    There are doubts as to how much of this cocoa would be fit\nfor export as shippers are now experiencing dificulties in\nobtaining +Bahia superior+ certificates.\n    In view of the lower quality over recent weeks farmers have\nsold a good part of their cocoa held on consignment.\n    Comissaria Smith said spot bean prices rose to 340 to 350\ncruzados per arroba of 15 kilos.\n    Bean shippers were reluctant to offer nearby shipment and\nonly limited sales were booked for March shipment at 1,750 to\n1,780 dlrs per tonne to ports to be named.\n    New crop sales were also light and all to open ports with\nJune/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs\nunder New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs\nper tonne FOB.\n    Routine sales of butter were made. March/April sold at\n4,340, 4,345 and 4,350 dlrs.\n    April/May butter went at 2.27 times New York May, June/July\nat 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at\n2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and\n2.27 times New York Dec, Comissaria Smith said.\n    Destinations were the U.S., Covertible currency areas,\nUruguay and open ports.\n    Cake sales were registered at 785 to 995 dlrs for\nMarch/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times\nNew York Dec for Oct/Dec.\n    Buyers were the U.S., Argentina, Uruguay and convertible\ncurrency areas.\n    Liquor sales were limited with March/April selling at 2,325\nand 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New\nYork July, Aug/Sept at 2,400 dlrs and at 1.25 times New York\nSept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith\nsaid.\n    Total Bahia sales are currently estimated at 6.13 mln bags\nagainst the 1986/87 crop and 1.06 mln bags against the 1987/88\ncrop.\n    Final figures for the period to February 28 are expected to\nbe published by the Brazilian Cocoa Trade Commission after\ncarnival which ends midday on February 27.\n Reuter\n"

Corpora in Python

import nltk
from nltk.corpus import inaugural
[u'1789-Washington.txt', u'1793-Washington.txt', u'1797-Adams.txt', u'1801-Jefferson.txt', u'1805-Jefferson.txt', u'1809-Madison.txt' ...]

Building your own corpora using APIs

Building your own corpora using APIs

Political science APIs

Extracting data from APIs using R

Extracting Tweets using TwitteR

library(twitteR)
setup_twitter_oauth(
  consumer_key="mRofUWS9AjMOdZvMwrb2CvUHg", 
  consumer_secret="yclOad0nLABB5V58Lkg3DheRG8K1qPhgIWzqAMMNErDcqh0jXz", 
  access_token="18249358-xZGyGz8sWmQ9oJ1TBsLKEczwtO24aJ0Q4waDbjxAd", 
  access_secret="uqH7cC5BLS65iuAEPEv4TXEtUZvFD80wH03xkqiB7SP7Y")
## [1] "Using direct authentication"
searchTwitter("@universityofga")[1:2]
## [[1]]
## [1] "MikeWithout_Ike: Come look at some Black History today at Tate's intersection..... and yes food will be provided @UGABMLS #bmls… https://t.co/LjvoEpExUy"
## 
## [[2]]
## [1] "SHAAREASSIST: @hargrettlibrary @universityofga @ugaalumniassoc we are asking...step up and step out and be a sanctuary to those who need you #Love"

Extracting Tweets using TwitteR

Tweets are saved as a list in R

UGATweets = searchTwitter("UGA")[1:2]
UGATweets[[1]]
## [1] "LaneNuclear: RT @UTKPrideCenter: Will you reinstate funding for the Pride Center comparable to schools like UGA, UNC, NSCU, UF, UK? #happybdayPride #7ye…"

Example 2: Tapping into APIs using “jsonlite”

library(jsonlite)
#Bills with the term "refugee in them"
bills<-fromJSON("https://www.govtrack.us/api/v2/bill?q=refugee") 

Example 2: Tapping into APIs using “jsonlite”

# We can retrieve the title and other information about these bills here 
# I'm creating a data frame with the bill title, bill id, bill sponsor
# id and the bill sponsors gender

billtitles<-bills$objects$title
billid<-bills$objects$id
sponsorid<-bills$objects$sponsor$id
sponsiridgender<-bills$objects$sponsor$gender

refugeebilldat<-
  data.frame(billtitles,billid,sponsorid,sponsiridgender)

refugeebilldat[1:2,1:2]
##                                                                                                billtitles
## 1                                                                  S. 2842 (101st): Refugee Repayment Act
## 2 H.R. 6428 (94th): A bill to amend the Migration and Refugee Assistance to refugees from Southeast Asia.
##   billid
## 1  95731
## 2 216254

Pre-processing text data

Pre-processing text data steps

  1. Tokenization - splits the document into tokens which can be words or n-grams (phrases).

  2. Formatting - punctuation, numbers, case, spacing.

  3. Stop words - removal of “stop words”

  4. Stemming - removal of certain types of suffixes.

Tokenization

Tokenization example: Tweets mentioning "@POTUS"

# Search Tweets with "@POTUS" 
potustweets = searchTwitter("@POTUS",n=5)
# Extract ONLY TEXT from the tweets
potustweets = lapply(potustweets, function(t) t$getText())
# Emoji's screw up all of out functions so we have to make sure
# that everything is in UTF-8 encoding
for(i in 1:length(potustweets)){
  potustweets[[i]]<-iconv(potustweets[[i]], "ASCII", "UTF-8", sub="")
}
# We have to put all the tweets in lowercase at this stage
# b/c of a screwy problem w/ the "tm" package
potustweets = lapply(potustweets, toString)
potustweets = lapply(potustweets, tolower)
# Let's see the first two
potustweets[1:2]
## [[1]]
## [1] "@potus @usatoday how about addressing all crime and not just singling out immigrants. oh, right, that's not in the nazi playbook."
## 
## [[2]]
## [1] "rt @atlaswiseman: @foxnews @trumpinaugural @jackieevancho @potus sure she/he will get over it"

Tokenization example: Tweets mentioning "@POTUS"

library(tm)
potuscorpus<-Corpus(VectorSource(potustweets))
potuscorpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5

Formatting

Formatting example: Tweets mentioning "@POTUS"

potuscorpus<-tm_map(potuscorpus,
                    removePunctuation)
potuscorpus<-tm_map(potuscorpus,
                    stripWhitespace)
# Did it work?
potuscorpus[[1]]$content
## [1] "potus usatoday how about addressing all crime and not just singling out immigrants oh right thats not in the nazi playbook"
#vs
potustweets[[1]]
## [1] "@potus @usatoday how about addressing all crime and not just singling out immigrants. oh, right, that's not in the nazi playbook."

Formatting example: Tweets mentioning "@POTUS"

potuscorpus<-tm_map(potuscorpus,
                    removeNumbers)

potuscorpus[[1]]$content
## [1] "potus usatoday how about addressing all crime and not just singling out immigrants oh right thats not in the nazi playbook"

Stop words

Stop word removal example: Tweets mentioning "@POTUS"

potuscorpus<-tm_map(potuscorpus,
                     removeWords, stopwords("english"))

potuscorpus[[1]]$content
## [1] "potus usatoday   addressing  crime   just singling  immigrants oh right thats    nazi playbook"
potustweets[[1]]
## [1] "@potus @usatoday how about addressing all crime and not just singling out immigrants. oh, right, that's not in the nazi playbook."

Stemming

Stemming example: Tweets mentioning "@POTUS"

potuscorpus<-tm_map(potuscorpus, 
                    stemDocument)

potuscorpus[[1]]$content
## [1] "potus usatoday   address  crime   just singl  immigr oh right that    nazi playbook"
potustweets[[1]]
## [1] "@potus @usatoday how about addressing all crime and not just singling out immigrants. oh, right, that's not in the nazi playbook."

Building a pipeline

Building a pipeline

text_cleaner<-function(corpus){
  tempcorpus<-Corpus(VectorSource(corpus))
  tempcorpus<-tm_map(tempcorpus,
                    removePunctuation)
  tempcorpus<-tm_map(tempcorpus,
                    stripWhitespace)
  tempcorpus<-tm_map(tempcorpus,
                    removeNumbers)
  tempcorpus<-tm_map(tempcorpus,
                     removeWords, stopwords("english"))
  tempcorpus<-tm_map(tempcorpus, 
                    stemDocument)
  return(tempcorpus)
}

potuscorpus<-text_cleaner(potustweets)
potuscorpus[[1]]$content
## [1] "potus usatoday   address  crime   just singl  immigr oh right that    nazi playbook"
potuscorpus[[2]]$content
## [1] "rt atlaswiseman foxnew trumpinaugur jackieevancho potus sure shehe will get "

From words to vectors

Building the document-term matrix

dtm <- DocumentTermMatrix(potuscorpus)
dtm
## <<DocumentTermMatrix (documents: 5, terms: 50)>>
## Non-/sparse entries: 56/194
## Sparsity           : 78%
## Maximal term length: 15
## Weighting          : term frequency (tf)
inspect(dtm[1:5, 1:5])
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 5/20
## Sparsity           : 80%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs address atlaswiseman attempt author away
##    1       1            0       0      0    0
##    2       0            1       0      0    0
##    3       0            0       0      1    0
##    4       0            0       0      0    1
##    5       0            0       1      0    0
potuscorpus[[2]]$content
## [1] "rt atlaswiseman foxnew trumpinaugur jackieevancho potus sure shehe will get "