Introduction to Natural Language Processing

2/23/2017

For today

Building a corpus using APIs.
Pre-processing text data: tokenization, stemming, removing stop words.
Document-term matrix and TF-IDF.

Over the next few weeks

Supervised learning with text data

Naive Bayes
Support vector machines.
Neural networks

Unsupervised learning with text data

K-Means Clustering
Latent Sematic Analysis
Latent dirichlet allocation (topic modeling)

Intro to natural language processing

NLP is a term that is understood as a set of methods which maps natural language units (words, sentences, paragraphs, etc) into a machine readible form.
Once we can figure out how to represent language to a machine, we can then use statistical/mathematical tools to learn things about texts (without having to read them!).

NLP Applications (Real World)

SPAM Filters
Speech recognition (Siri)
Machine translation
Information retrieval (search engines)
Artificial intelligence

NLP Applications (Social Science)

NLP Basic Terminology

\[ word \subset document \subset corpus\]

document - A collection of words, usually the unit of observation.
corpus - A collection of documents or a single document.
The corpus can be thought of as our entire dataset.
The document can be thought of as an observation

Example: Measuring political ideology from Tweets

Barberá, P., 2013. "Birds of the same feather tweet together, bayesian ideal point estimation using twitter data" Political Analysis

Measuring the political ideology of Twitter users from tweets and patterns of following/followers.
document - Each tweet is a document.
corpus - All of the tweets that are analyzed are the corpus.

Acquiring text data

Online corpora - There are many built in packages in R and in Python that you can load which have thousands of texts that you can access.
Build your own corpora

Online corpora

## Warning: unable to access index for repository http://datacube.wu.ac.at/bin/macosx/mavericks/contrib/3.3:
##   cannot open URL 'http://datacube.wu.ac.at/bin/macosx/mavericks/contrib/3.3/PACKAGES'

## installing the source package 'tm.corpus.Reuters21578'

library(tm)
install.packages("tm.corpus.Reuters21578", 
                 repos = "http://datacube.wu.ac.at")
library(tm.corpus.Reuters21578)
data(Reuters21578)

There are thousands of online corpora that are available.
R is not really that good for accessing online corpora because they're not as easy to acquire.
Here we are accessing the Reuters corpus which is a collection of 21,578 Reuters articles from the Reuters newswire in 1987.

Reuters Corpus

Let's use the tm package to see what these articles look like.

inspect(Reuters21578[1:2])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  16
## Content:  chars: 2860
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  16
## Content:  chars: 438

Reuters Corpus

We can also read the articles to see how they're structured in R

Reuters21578[[1]]$content

## [1] "Showers continued throughout the week in\nthe Bahia cocoa zone, alleviating the drought since early\nJanuary and improving prospects for the coming temporao,\nalthough normal humidity levels have not been restored,\nComissaria Smith said in its weekly review.\n    The dry period means the temporao will be late this year.\n    Arrivals for the week ended February 22 were 155,221 bags\nof 60 kilos making a cumulative total for the season of 5.93\nmln against 5.81 at the same stage last year. Again it seems\nthat cocoa delivered earlier on consignment was included in the\narrivals figures.\n    Comissaria Smith said there is still some doubt as to how\nmuch old crop cocoa is still available as harvesting has\npractically come to an end. With total Bahia crop estimates\naround 6.4 mln bags and sales standing at almost 6.2 mln there\nare a few hundred thousand bags still in the hands of farmers,\nmiddlemen, exporters and processors.\n    There are doubts as to how much of this cocoa would be fit\nfor export as shippers are now experiencing dificulties in\nobtaining +Bahia superior+ certificates.\n    In view of the lower quality over recent weeks farmers have\nsold a good part of their cocoa held on consignment.\n    Comissaria Smith said spot bean prices rose to 340 to 350\ncruzados per arroba of 15 kilos.\n    Bean shippers were reluctant to offer nearby shipment and\nonly limited sales were booked for March shipment at 1,750 to\n1,780 dlrs per tonne to ports to be named.\n    New crop sales were also light and all to open ports with\nJune/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs\nunder New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs\nper tonne FOB.\n    Routine sales of butter were made. March/April sold at\n4,340, 4,345 and 4,350 dlrs.\n    April/May butter went at 2.27 times New York May, June/July\nat 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at\n2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and\n2.27 times New York Dec, Comissaria Smith said.\n    Destinations were the U.S., Covertible currency areas,\nUruguay and open ports.\n    Cake sales were registered at 785 to 995 dlrs for\nMarch/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times\nNew York Dec for Oct/Dec.\n    Buyers were the U.S., Argentina, Uruguay and convertible\ncurrency areas.\n    Liquor sales were limited with March/April selling at 2,325\nand 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New\nYork July, Aug/Sept at 2,400 dlrs and at 1.25 times New York\nSept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith\nsaid.\n    Total Bahia sales are currently estimated at 6.13 mln bags\nagainst the 1986/87 crop and 1.06 mln bags against the 1987/88\ncrop.\n    Final figures for the period to February 28 are expected to\nbe published by the Brazilian Cocoa Trade Commission after\ncarnival which ends midday on February 27.\n Reuter\n"

Corpora in Python

import nltk
from nltk.corpus import inaugural
[u'1789-Washington.txt', u'1793-Washington.txt', u'1797-Adams.txt', u'1801-Jefferson.txt', u'1805-Jefferson.txt', u'1809-Madison.txt' ...]

In general, I recommend using Python if you want to access ready-to-go corpora.
This can be easily done with the NLTK package.
NLTK has thousands of corpora ready to go but this requires more advanced programming in Python which we do not have time for in this class.
For more details see: http://www.nltk.org/book/ch02.html

Building your own corpora using APIs

In general, ready-to-go corpora are not that useful.
They contain old documents and are unlikely to have the info you want.
For the most part, when you're doing text analysis, you'll have to build your own corpora.

Building your own corpora using APIs

The easiest way to do this is to extract data from one of the MILLIONS of APIs out there.
APIs were originally designed to allow developers of apps to have constant streaming access to data.
But they are a treasure trove of information of immesurable use to social scientists who know how to tap into them.

Political science APIs

GovTrack.us API - Get any information about Congress (bills, legislators, voting) over several Congresses.
OpenStates API - Tons of information about state legislators, bills, voting, etc.
Opensecrets.org - Money in politics and campaign finance database.
Twitter API - Get streaming tweets from Twitter with user information etc.

Extracting data from APIs using R

Three packages are very useful for this purpose: twitteR, httr, and jsonlite.
You first must establish OAuth credentials if you would like to access Tweets.
You can find out how to do so here Getting OAuth Credentials

Extracting Tweets using TwitteR

library(twitteR)
setup_twitter_oauth(
  consumer_key="mRofUWS9AjMOdZvMwrb2CvUHg", 
  consumer_secret="yclOad0nLABB5V58Lkg3DheRG8K1qPhgIWzqAMMNErDcqh0jXz", 
  access_token="18249358-xZGyGz8sWmQ9oJ1TBsLKEczwtO24aJ0Q4waDbjxAd", 
  access_secret="uqH7cC5BLS65iuAEPEv4TXEtUZvFD80wH03xkqiB7SP7Y")

## [1] "Using direct authentication"

searchTwitter("@universityofga")[1:2]

## [[1]]
## [1] "MikeWithout_Ike: Come look at some Black History today at Tate's intersection..... and yes food will be provided @UGABMLS #bmls… https://t.co/LjvoEpExUy"
## 
## [[2]]
## [1] "SHAAREASSIST: @hargrettlibrary @universityofga @ugaalumniassoc we are asking...step up and step out and be a sanctuary to those who need you #Love"

Extracting Tweets using TwitteR

Tweets are saved as a list in R

UGATweets = searchTwitter("UGA")[1:2]
UGATweets[[1]]

## [1] "universityofga: Shreya Ganeshan is committed to clean energy innovation\xed\xa0\xbd\xed\xb2\xa1\nhttps://t.co/MZoSrEmloI | #MyCommitment https://t.co/YX88CMhm1o"

Example 2: Tapping into APIs using "jsonlite"

library(jsonlite)
#Bills with the term "refugee in them"
bills<-fromJSON("https://www.govtrack.us/api/v2/bill?q=refugee")

Example 2: Tapping into APIs using "jsonlite"

# We can retrieve the title and other information about these bills here 
# I'm creating a data frame with the bill title, bill id, bill sponsor
# id and the bill sponsors gender

billtitles<-bills$objects$title
billid<-bills$objects$id
sponsorid<-bills$objects$sponsor$id
sponsiridgender<-bills$objects$sponsor$gender

refugeebilldat<-
  data.frame(billtitles,billid,sponsorid,sponsiridgender)

refugeebilldat[1:2,1:2]

##                                                                                                billtitles
## 1                                                                  S. 2842 (101st): Refugee Repayment Act
## 2 H.R. 6428 (94th): A bill to amend the Migration and Refugee Assistance to refugees from Southeast Asia.
##   billid
## 1  95731
## 2 216254

Pre-processing text data

Now that we have the data, we need to prepare it for analysis.
This involves a couple of steps which take strings and breaks them down into analyzeable units.

Pre-processing text data steps

Tokenization - splits the document into tokens which can be words or n-grams (phrases).
Formatting - punctuation, numbers, case, spacing.
Stop words - removal of "stop words"
Stemming - removal of certain types of suffixes.

Tokenization

"Bag of words" model - most text analysis methods treat documents as a big bunch of words or terms.
Order is generally not taken into account, just word and term frequencies.
There are ways to parse documents into ngrams or words but we'll stick with words for now.

Tokenization example: Tweets mentioning "@POTUS"

# Search Tweets with "@POTUS" 
potustweets = searchTwitter("@POTUS",n=5)
# Extract ONLY TEXT from the tweets
potustweets = lapply(potustweets, function(t) t$getText())
# Emoji's screw up all of out functions so we have to make sure
# that everything is in UTF-8 encoding
for(i in 1:length(potustweets)){
  potustweets[[i]]<-iconv(potustweets[[i]], "ASCII", "UTF-8", sub="")
}
# We have to put all the tweets in lowercase at this stage
# b/c of a screwy problem w/ the "tm" package
potustweets = lapply(potustweets, toString)
potustweets = lapply(potustweets, tolower)
# Let's see the first two
potustweets[1:2]

## [[1]]
## [1] "rt @foxnews: .@potus: \"we're going to have a good relationship with mexico, i hope, and if we don't, we don't.\" https://t.co/d65ustfpgw"
## 
## [[2]]
## [1] "rt @opensecretsdc: founding member of mar-a-lago is @potus pick for ambassador to dominican republic.\n#trump org considering deal there htt"

Tokenization example: Tweets mentioning "@POTUS"

library(tm)
potuscorpus<-Corpus(VectorSource(potustweets))
potuscorpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5

Let's use the "tm" package to create a corpus which effectively tokenizes each of the documents (Tweets)

Formatting

To analyze texts we must make sure that all words are in the same format.
punctuation we have to get rid of all punctuation ",.?!:" ..etc.
numbers numbers should be removed (years too!).
case all words should be lower case.

Formatting example: Tweets mentioning "@POTUS"

potuscorpus<-tm_map(potuscorpus,
                    removePunctuation)
potuscorpus<-tm_map(potuscorpus,
                    stripWhitespace)
# Did it work?
potuscorpus[[1]]$content

## [1] "rt foxnews potus were going to have a good relationship with mexico i hope and if we dont we dont httpstcod65ustfpgw"

#vs
potustweets[[1]]

## [1] "rt @foxnews: .@potus: \"we're going to have a good relationship with mexico, i hope, and if we don't, we don't.\" https://t.co/d65ustfpgw"

Formatting example: Tweets mentioning "@POTUS"

potuscorpus<-tm_map(potuscorpus,
                    removeNumbers)

potuscorpus[[1]]$content

## [1] "rt foxnews potus were going to have a good relationship with mexico i hope and if we dont we dont httpstcodustfpgw"

Let's get rid of numbers.

Stop words

Stop words are simply words that removed during text processing.
They tend to be words that are very common "the", "and", "is" etc.
These common words can cause problems for machine learning algorithms and search engines because they add noise.
BEWARE Each package defines different lists of stop words and sometimes removal can decrease performance of supervised mechine learning classifiers.

Stop word removal example: Tweets mentioning "@POTUS"

potuscorpus<-tm_map(potuscorpus,
                     removeWords, stopwords("english"))

potuscorpus[[1]]$content

## [1] "rt foxnews potus  going    good relationship  mexico  hope    dont  dont httpstcodustfpgw"

potustweets[[1]]

## [1] "rt @foxnews: .@potus: \"we're going to have a good relationship with mexico, i hope, and if we don't, we don't.\" https://t.co/d65ustfpgw"

Stemming

In linguistics, stemming is the process of reducing words to their stems.
"argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu"
This is especially useful for unsupervised machine learning algorithms but may introuce issues in supervised machine learning.
For example "cats" and "catty" would both be reduced to the term "cat".

Stemming example: Tweets mentioning "@POTUS"

potuscorpus<-tm_map(potuscorpus, 
                    stemDocument)

potuscorpus[[1]]$content

## [1] "rt foxnew potus  go    good relationship  mexico  hope    dont  dont httpstcodustfpgw"

potustweets[[1]]

## [1] "rt @foxnews: .@potus: \"we're going to have a good relationship with mexico, i hope, and if we don't, we don't.\" https://t.co/d65ustfpgw"

Building a pipeline

Instead of going through each procedure individually, let's just create a pipeline.
Pipeline takes in text and outputs clean text.
We can incorporate everything we did above into one function.

Building a pipeline

text_cleaner<-function(corpus){
  tempcorpus<-Corpus(VectorSource(corpus))
  tempcorpus<-tm_map(tempcorpus,
                    removePunctuation)
  tempcorpus<-tm_map(tempcorpus,
                    stripWhitespace)
  tempcorpus<-tm_map(tempcorpus,
                    removeNumbers)
  tempcorpus<-tm_map(tempcorpus,
                     removeWords, stopwords("english"))
  tempcorpus<-tm_map(tempcorpus, 
                    stemDocument)
  return(tempcorpus)
}

potuscorpus<-text_cleaner(potustweets)

potuscorpus[[1]]$content

## [1] "rt foxnew potus  go    good relationship  mexico  hope    dont  dont httpstcodustfpgw"

potuscorpus[[2]]$content

## [1] "rt opensecretsdc found member  maralago  potus pick  ambassador  dominican republ trump org consid deal  htt"

From words to vectors

Need to go from texts \(\rightarrow\) numbers.
If \(d =\) documents in a corpus and \(w=\) words in a corpus.
Create a matrix \(\Delta \in \mathbb{N}^{dXw}\).
Rows are documents, columns are words
This is called the document-term matrix.

Building the document-term matrix

dtm <- DocumentTermMatrix(potuscorpus)
dtm

## <<DocumentTermMatrix (documents: 5, terms: 53)>>
## Non-/sparse entries: 59/206
## Sparsity           : 78%
## Maximal term length: 17
## Weighting          : term frequency (tf)

Building the document-term matrix

inspect(dtm[1:5, 1:5])

## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 5/20
## Sparsity           : 80%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs accomplish address aggress ambassador american
##    1          0       0       0          0        0
##    2          0       0       0          1        0
##    3          0       0       0          0        0
##    4          1       0       0          0        1
##    5          0       1       1          0        0

potuscorpus[[2]]$content

## [1] "rt opensecretsdc found member  maralago  potus pick  ambassador  dominican republ trump org consid deal  htt"