Building a corpus using APIs.
Pre-processing text data: tokenization, stemming, removing stop words.
Document-term matrix and TF-IDF.
NLP is a term that is understood as a set of methods which maps natural language units (words, sentences, paragraphs, etc) into a machine readible form.
Once we can figure out how to represent language to a machine, we can then use statistical/mathematical tools to learn things about texts (without having to read them!).
SPAM Filters
Speech recognition (Siri)
Machine translation
Information retrieval (search engines)
Artificial intelligence
\[ word \subset document \subset corpus\]
document - A collection of words, usually the unit of observation.
corpus - A collection of documents or a single document.
The corpus can be thought of as our entire dataset.
The document can be thought of as an observation
Barberá, P., 2013. “Birds of the same feather tweet together, bayesian ideal point estimation using twitter data” Political Analysis
Measuring the political ideology of Twitter users from tweets and patterns of following/followers.
document - Each tweet is a document.
corpus - All of the tweets that are analyzed are the corpus.
Online corpora - There are many built in packages in R and in Python that you can load which have thousands of texts that you can access.
## Warning: unable to access index for repository http://datacube.wu.ac.at/bin/macosx/mavericks/contrib/3.3:
## cannot open URL 'http://datacube.wu.ac.at/bin/macosx/mavericks/contrib/3.3/PACKAGES'
## installing the source package 'tm.corpus.Reuters21578'
library(tm)
install.packages("tm.corpus.Reuters21578",
repos = "http://datacube.wu.ac.at")
library(tm.corpus.Reuters21578)
data(Reuters21578)
There are thousands of online corpora that are available.
R is not really that good for accessing online corpora because they’re not as easy to acquire.
Here we are accessing the Reuters corpus which is a collection of 21,578 Reuters articles from the Reuters newswire in 1987.
inspect(Reuters21578[1:2])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 16
## Content: chars: 2860
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 16
## Content: chars: 438
Reuters21578[[1]]$content
## [1] "Showers continued throughout the week in\nthe Bahia cocoa zone, alleviating the drought since early\nJanuary and improving prospects for the coming temporao,\nalthough normal humidity levels have not been restored,\nComissaria Smith said in its weekly review.\n The dry period means the temporao will be late this year.\n Arrivals for the week ended February 22 were 155,221 bags\nof 60 kilos making a cumulative total for the season of 5.93\nmln against 5.81 at the same stage last year. Again it seems\nthat cocoa delivered earlier on consignment was included in the\narrivals figures.\n Comissaria Smith said there is still some doubt as to how\nmuch old crop cocoa is still available as harvesting has\npractically come to an end. With total Bahia crop estimates\naround 6.4 mln bags and sales standing at almost 6.2 mln there\nare a few hundred thousand bags still in the hands of farmers,\nmiddlemen, exporters and processors.\n There are doubts as to how much of this cocoa would be fit\nfor export as shippers are now experiencing dificulties in\nobtaining +Bahia superior+ certificates.\n In view of the lower quality over recent weeks farmers have\nsold a good part of their cocoa held on consignment.\n Comissaria Smith said spot bean prices rose to 340 to 350\ncruzados per arroba of 15 kilos.\n Bean shippers were reluctant to offer nearby shipment and\nonly limited sales were booked for March shipment at 1,750 to\n1,780 dlrs per tonne to ports to be named.\n New crop sales were also light and all to open ports with\nJune/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs\nunder New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs\nper tonne FOB.\n Routine sales of butter were made. March/April sold at\n4,340, 4,345 and 4,350 dlrs.\n April/May butter went at 2.27 times New York May, June/July\nat 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at\n2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and\n2.27 times New York Dec, Comissaria Smith said.\n Destinations were the U.S., Covertible currency areas,\nUruguay and open ports.\n Cake sales were registered at 785 to 995 dlrs for\nMarch/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times\nNew York Dec for Oct/Dec.\n Buyers were the U.S., Argentina, Uruguay and convertible\ncurrency areas.\n Liquor sales were limited with March/April selling at 2,325\nand 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New\nYork July, Aug/Sept at 2,400 dlrs and at 1.25 times New York\nSept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith\nsaid.\n Total Bahia sales are currently estimated at 6.13 mln bags\nagainst the 1986/87 crop and 1.06 mln bags against the 1987/88\ncrop.\n Final figures for the period to February 28 are expected to\nbe published by the Brazilian Cocoa Trade Commission after\ncarnival which ends midday on February 27.\n Reuter\n"
import nltk
from nltk.corpus import inaugural
[u'1789-Washington.txt', u'1793-Washington.txt', u'1797-Adams.txt', u'1801-Jefferson.txt', u'1805-Jefferson.txt', u'1809-Madison.txt' ...]
In general, I recommend using Python if you want to access ready-to-go corpora.
This can be easily done with the NLTK package.
NLTK has thousands of corpora ready to go but this requires more advanced programming in Python which we do not have time for in this class.
For more details see: http://www.nltk.org/book/ch02.html
In general, ready-to-go corpora are not that useful.
They contain old documents and are unlikely to have the info you want.
For the most part, when you’re doing text analysis, you’ll have to build your own corpora.
The easiest way to do this is to extract data from one of the MILLIONS of APIs out there.
APIs were originally designed to allow developers of apps to have constant streaming access to data.
But they are a treasure trove of information of immesurable use to social scientists who know how to tap into them.
GovTrack.us API - Get any information about Congress (bills, legislators, voting) over several Congresses.
OpenStates API - Tons of information about state legislators, bills, voting, etc.
Opensecrets.org - Money in politics and campaign finance database.
Twitter API - Get streaming tweets from Twitter with user information etc.
Three packages are very useful for this purpose: twitteR, httr, and jsonlite.
You first must establish OAuth credentials if you would like to access Tweets.
You can find out how to do so here Getting OAuth Credentials
library(twitteR)
setup_twitter_oauth(
consumer_key="mRofUWS9AjMOdZvMwrb2CvUHg",
consumer_secret="yclOad0nLABB5V58Lkg3DheRG8K1qPhgIWzqAMMNErDcqh0jXz",
access_token="18249358-xZGyGz8sWmQ9oJ1TBsLKEczwtO24aJ0Q4waDbjxAd",
access_secret="uqH7cC5BLS65iuAEPEv4TXEtUZvFD80wH03xkqiB7SP7Y")
## [1] "Using direct authentication"
searchTwitter("@universityofga")[1:2]
## [[1]]
## [1] "MikeWithout_Ike: Come look at some Black History today at Tate's intersection..... and yes food will be provided @UGABMLS #bmls… https://t.co/LjvoEpExUy"
##
## [[2]]
## [1] "SHAAREASSIST: @hargrettlibrary @universityofga @ugaalumniassoc we are asking...step up and step out and be a sanctuary to those who need you #Love"
Tweets are saved as a list in R
UGATweets = searchTwitter("UGA")[1:2]
UGATweets[[1]]
## [1] "LaneNuclear: RT @UTKPrideCenter: Will you reinstate funding for the Pride Center comparable to schools like UGA, UNC, NSCU, UF, UK? #happybdayPride #7ye…"
library(jsonlite)
#Bills with the term "refugee in them"
bills<-fromJSON("https://www.govtrack.us/api/v2/bill?q=refugee")
# We can retrieve the title and other information about these bills here
# I'm creating a data frame with the bill title, bill id, bill sponsor
# id and the bill sponsors gender
billtitles<-bills$objects$title
billid<-bills$objects$id
sponsorid<-bills$objects$sponsor$id
sponsiridgender<-bills$objects$sponsor$gender
refugeebilldat<-
data.frame(billtitles,billid,sponsorid,sponsiridgender)
refugeebilldat[1:2,1:2]
## billtitles
## 1 S. 2842 (101st): Refugee Repayment Act
## 2 H.R. 6428 (94th): A bill to amend the Migration and Refugee Assistance to refugees from Southeast Asia.
## billid
## 1 95731
## 2 216254
Now that we have the data, we need to prepare it for analysis.
This involves a couple of steps which take strings and breaks them down into analyzeable units.
Tokenization - splits the document into tokens which can be words or n-grams (phrases).
Formatting - punctuation, numbers, case, spacing.
Stop words - removal of “stop words”
Stemming - removal of certain types of suffixes.
“Bag of words” model - most text analysis methods treat documents as a big bunch of words or terms.
Order is generally not taken into account, just word and term frequencies.
There are ways to parse documents into ngrams or words but we’ll stick with words for now.
# Search Tweets with "@POTUS"
potustweets = searchTwitter("@POTUS",n=5)
# Extract ONLY TEXT from the tweets
potustweets = lapply(potustweets, function(t) t$getText())
# Emoji's screw up all of out functions so we have to make sure
# that everything is in UTF-8 encoding
for(i in 1:length(potustweets)){
potustweets[[i]]<-iconv(potustweets[[i]], "ASCII", "UTF-8", sub="")
}
# We have to put all the tweets in lowercase at this stage
# b/c of a screwy problem w/ the "tm" package
potustweets = lapply(potustweets, toString)
potustweets = lapply(potustweets, tolower)
# Let's see the first two
potustweets[1:2]
## [[1]]
## [1] "@potus @usatoday how about addressing all crime and not just singling out immigrants. oh, right, that's not in the nazi playbook."
##
## [[2]]
## [1] "rt @atlaswiseman: @foxnews @trumpinaugural @jackieevancho @potus sure she/he will get over it"
library(tm)
potuscorpus<-Corpus(VectorSource(potustweets))
potuscorpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5
To analyze texts we must make sure that all words are in the same format.
punctuation we have to get rid of all punctuation “,.?!:” ..etc.
numbers numbers should be removed (years too!).
case all words should be lower case.
potuscorpus<-tm_map(potuscorpus,
removePunctuation)
potuscorpus<-tm_map(potuscorpus,
stripWhitespace)
# Did it work?
potuscorpus[[1]]$content
## [1] "potus usatoday how about addressing all crime and not just singling out immigrants oh right thats not in the nazi playbook"
#vs
potustweets[[1]]
## [1] "@potus @usatoday how about addressing all crime and not just singling out immigrants. oh, right, that's not in the nazi playbook."
potuscorpus<-tm_map(potuscorpus,
removeNumbers)
potuscorpus[[1]]$content
## [1] "potus usatoday how about addressing all crime and not just singling out immigrants oh right thats not in the nazi playbook"
Stop words are simply words that removed during text processing.
They tend to be words that are very common “the”, “and”, “is” etc.
These common words can cause problems for machine learning algorithms and search engines because they add noise.
BEWARE Each package defines different lists of stop words and sometimes removal can decrease performance of supervised mechine learning classifiers.
potuscorpus<-tm_map(potuscorpus,
removeWords, stopwords("english"))
potuscorpus[[1]]$content
## [1] "potus usatoday addressing crime just singling immigrants oh right thats nazi playbook"
potustweets[[1]]
## [1] "@potus @usatoday how about addressing all crime and not just singling out immigrants. oh, right, that's not in the nazi playbook."
In linguistics, stemming is the process of reducing words to their stems.
“argue”, “argued”, “argues”, “arguing”, and “argus” reduce to the stem “argu”
This is especially useful for unsupervised machine learning algorithms but may introuce issues in supervised machine learning.
For example “cats” and “catty” would both be reduced to the term “cat”.
potuscorpus<-tm_map(potuscorpus,
stemDocument)
potuscorpus[[1]]$content
## [1] "potus usatoday address crime just singl immigr oh right that nazi playbook"
potustweets[[1]]
## [1] "@potus @usatoday how about addressing all crime and not just singling out immigrants. oh, right, that's not in the nazi playbook."
Instead of going through each procedure individually, let’s just create a pipeline.
Pipeline takes in text and outputs clean text.
We can incorporate everything we did above into one function.
text_cleaner<-function(corpus){
tempcorpus<-Corpus(VectorSource(corpus))
tempcorpus<-tm_map(tempcorpus,
removePunctuation)
tempcorpus<-tm_map(tempcorpus,
stripWhitespace)
tempcorpus<-tm_map(tempcorpus,
removeNumbers)
tempcorpus<-tm_map(tempcorpus,
removeWords, stopwords("english"))
tempcorpus<-tm_map(tempcorpus,
stemDocument)
return(tempcorpus)
}
potuscorpus<-text_cleaner(potustweets)
potuscorpus[[1]]$content
## [1] "potus usatoday address crime just singl immigr oh right that nazi playbook"
potuscorpus[[2]]$content
## [1] "rt atlaswiseman foxnew trumpinaugur jackieevancho potus sure shehe will get "
Need to go from texts \(\rightarrow\) numbers.
If \(d =\) documents in a corpus and \(w=\) words in a corpus.
Create a matrix \(\Delta \in \mathbb{N}^{dXw}\).
Rows are documents, columns are words
This is called the document-term matrix.
dtm <- DocumentTermMatrix(potuscorpus)
dtm
## <<DocumentTermMatrix (documents: 5, terms: 50)>>
## Non-/sparse entries: 56/194
## Sparsity : 78%
## Maximal term length: 15
## Weighting : term frequency (tf)
inspect(dtm[1:5, 1:5])
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 5/20
## Sparsity : 80%
## Maximal term length: 12
## Weighting : term frequency (tf)
##
## Terms
## Docs address atlaswiseman attempt author away
## 1 1 0 0 0 0
## 2 0 1 0 0 0
## 3 0 0 0 1 0
## 4 0 0 0 0 1
## 5 0 0 1 0 0
potuscorpus[[2]]$content
## [1] "rt atlaswiseman foxnew trumpinaugur jackieevancho potus sure shehe will get "