Introduction to R

Loading the 2016 Election Data in R

For these exercises the 2016 Presidential Election Data file can be found here http://scholar.harvard.edu/files/janastas/files/president-long.csv. Original data files were provided from Stephen Pettigrew and can be found here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/MLLQDH.

Loading libraries

For this exercise we’re going to be loading real election data in R from the 2016 election and exploring some of the variables a bit. The first thing that you need to do is to install a package that has software which will allow you to access the data using the “library()” function. If you want to know more about what a function does, you can put a “?” in front of it.

library(foreign) 

?library

Accessing data from your working directory

To access your data files you have two options. Option 1 involves finding the path to the directory that your data file is contained in, setting your directory, and importing the data file by name.

Option 2 involves finding the direct path of the file and importing the file using the entire path.

Option 1: Find the working directory of your files

The 2016 election data that I want to upload is stored in a folder called “data”. Below I found the path to the folder and set it as the string object “directory.”

directory = "/Users/jason/Dropbox/Classes-2016-2017-UGA/Spring 2017 Classes/Applied Politics/Lectures/01-10/data"

setwd(directory) # This sets your working directory to the path specified.

list.files(directory) # This lists the files in the directory

## [1] "election-data.r"    "president-long.csv"

president2016data = read.csv("president-long.csv") # This reads the date into R

# Check that the data was loaded correctly by exploring its dimensions
dim(president2016data) # There should be 54,182 rows and 8 columns.

## [1] 54182     8

Option 2: Find the direct path of your file

The 2016 election data is called “president-long.csv”. We can point directly to that file using the full path of that file.

datafile = "/Users/jason/Dropbox/Classes-2016-2017-UGA/Spring 2017 Classes/Applied Politics/Lectures/01-10/data/president-long.csv"

president2016data = read.csv(datafile)

# Check that the data was loaded correctly by exploring its dimensions
dim(president2016data) # There should be 54,182 rows and 8 columns.

## [1] 54182     8

Manipulating data

Accessing variables

Let’s see if we loaded the data correctly by checking if the variable names are correct.

names(president2016data)

## [1] "state"        "jurisdiction" "fipscode"     "office"      
## [5] "candidate"    "party"        "votes"        "totalvotes"

If we want to get a sens of what the first few observations of the data look like we can use the “head()” function

head(president2016data)

##   state jurisdiction fipscode    office                candidate
## 1    AK   District 1       NA President           Darrell Castle
## 2    AK   District 1       NA President          Hillary Clinton
## 3    AK   District 1       NA President Rocky Roque de la Fuente
## 4    AK   District 1       NA President             Gary Johnson
## 5    AK   District 1       NA President               Jill Stein
## 6    AK   District 1       NA President             Donald Trump
##          party votes totalvotes
## 1 Constitution    73       6638
## 2     Democrat  2573       6638
## 3         <NA>    28       6638
## 4  Libertarian   416       6638
## 5        Green   143       6638
## 6   Republican  3180       6638

Looks good! Now we can use the “attach” function to access the variables directly

attach(president2016data)

Tablulating variables by category

We can explore some of the variables with the “table()” function. Remember, each of the observations contains information about the state, precinct (district), votes and precict votes (totalvotes) for each presidential candidate in each district.

table(state) # This tells us how many  observations (precinct-candidates) we have per state

## state
##   AK   AL   AR   AZ   CA   CO   CT   DC   DE   FL   GA   HI   IA   ID   IL 
##  287  335  600  300  580 1408 4056   56   12  804  477   20 1287 1100 2856 
##   IN   KS   KY   LA   MA   MD   ME   MI   MN   MO   MS   MT   NC   ND   NE 
## 1564   25 3480  832 4212 1368 4797 1079  870 1160    7  280  500  371  465 
##   NH   NJ   NM   NV   NY   OH   OK   OR   PA   RI   SC   SD   TN   TX   UT 
## 1536  189  264  102 2418 2024  231  180  335  234  322  264 1425 4318  725 
##   VA   VT   WA   WI   WV   WY 
##  798 1722  273 1152  275  207

table(candidate) # This tells us how many  observations we have per presidential candidate in the entire dataset.

## candidate
##                Ajay Sood               All Others           Alyson Kennedy 
##                        1                      351                      399 
##            Ameer Flippin        Andrew D. Basiago       Anthony J. Valdiva 
##                       24                      310                      120 
##    Anthony Tony Valdivia           Arantxa Aranja              Ariel Cohen 
##                      481                       62                       62 
##      Author C. Brumfield         Barbara Whitaker          Barry Kirschner 
##                       98                       62                       88 
##             Ben Hartnell Bernard "Bernie" Sanders              Blank Votes 
##                      490                       58                      413 
##              BLANK VOTES          Bradford Lyttle        Brown,\xa0 Ray C. 
##                      533                       64                       92 
##          Bruce E. Jaynes  Cathy Johnson Pendleton             Cherunda Fox 
##                       88                       24                     1740 
##           Chris Keniston               Chris Lacy               Coop Smith 
##                      514                       44                       24 
##                   Cooper              Craig Ellis                 Cummings 
##                      169                      120                      169 
##             Dale Steffes             Dan R. Vacek            Dana E. Brown 
##                      254                      186                       24 
##       Daniel Paul Zutler           Darrell Castle             Darryl Perry 
##                      120                     2817                        1 
##          Darryl W. Perry           David G. Stack            David Librace 
##                       44                       98                       24 
##           David Limbaugh              David Perry                    Deame 
##                      168                      120                      169 
##        Delano Steinacker         Demetra Wysinger    Denny Carroll Jackson 
##                       15                      127                      120 
##             Donald Trump               Doug Terry       Douglas W. Thomson 
##                     4564                       24                       88 
##        Duff Cooper Smith             Dustin Baird          Emidio Soltysik 
##                       99                       29                      680 
##             Esther Welsh            Evan McMullin 
##                       62                     2832 
##  [ reached getOption("max.print") -- omitted 131 entries ]

Creating varibles

Let’s calculate the vote share in each precinct for each candidate:

vote.share = votes/totalvotes

We might be interested in creating variables for each candidate but subsetting the “vote.share” variable by candidate Let’s define variables which contain the vote share for only Donald Trump, Hillary Clinton and Gary Johnson.

vote.share.johnson = vote.share[candidate == "Gary Johnson"] # Precinct vote share for Johnson
vote.share.clinton = vote.share[candidate=="Hillary Clinton"] # Precinct vote share for Clinton
vote.share.trump = vote.share[candidate == "Donald Trump"] # Precinct vote share for Trump

Summary statistics

In R, it’s easy to calculte averages, and other summary statistics. Now lets calculate the average vote share for each candidate across precincts:

mean(vote.share.johnson)

## [1] 0.03585668

mean(vote.share.clinton)

## [1] 0.3620333

mean(vote.share.trump)

## [1] 0.5631486

What if we want to know other information like minimum maximum etc? We can use the “summary()” function:

summary(vote.share.johnson)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.02420 0.03478 0.03586 0.04558 0.40000

summary(vote.share.clinton)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2344  0.3500  0.3620  0.4724  1.0000

summary(vote.share.trump)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.4431  0.5759  0.5631  0.7068  1.0000

Visualizing data

Let’s visualize the distributions for each

hist(vote.share.clinton, 
     main="Distribution of Precinct Level  Vote Share for Clinton",
     xlab="Vote share precinct",
     col=blues9)

hist(vote.share.trump, 
     main="Distribution of Precinct Level  Vote Share for Trump",
     xlab="",
     col=blues9)

hist(vote.share.johnson, 
     main="Distribution of Precinct Level Vote Share for Johnson",
     xlab="",
     col=blues9)

Let’s look at distibutions within states, California for example

hist(vote.share.clinton[state == "CA"], 
     main="Clinton, California Precincts",
     xlab= " ",
     col=blues9)

hist(vote.share.trump[state == "CA"],
      main="Trump, California Precincts",
          xlab= " ",
     col=blues9)