For these exercises the 2016 Presidential Election Data file can be found here http://scholar.harvard.edu/files/janastas/files/president-long.csv. Original data files were provided from Stephen Pettigrew and can be found here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/MLLQDH.
For this exercise we’re going to be loading real election data in R from the 2016 election and exploring some of the variables a bit. The first thing that you need to do is to install a package that has software which will allow you to access the data using the “library()” function. If you want to know more about what a function does, you can put a “?” in front of it.
library(foreign)
?library
To access your data files you have two options. Option 1 involves finding the path to the directory that your data file is contained in, setting your directory, and importing the data file by name.
Option 2 involves finding the direct path of the file and importing the file using the entire path.
The 2016 election data that I want to upload is stored in a folder called “data”. Below I found the path to the folder and set it as the string object “directory.”
directory = "/Users/jason/Dropbox/Classes-2016-2017-UGA/Spring 2017 Classes/Applied Politics/Lectures/01-10/data"
setwd(directory) # This sets your working directory to the path specified.
list.files(directory) # This lists the files in the directory
## [1] "election-data.r" "president-long.csv"
president2016data = read.csv("president-long.csv") # This reads the date into R
# Check that the data was loaded correctly by exploring its dimensions
dim(president2016data) # There should be 54,182 rows and 8 columns.
## [1] 54182 8
The 2016 election data is called “president-long.csv”. We can point directly to that file using the full path of that file.
datafile = "/Users/jason/Dropbox/Classes-2016-2017-UGA/Spring 2017 Classes/Applied Politics/Lectures/01-10/data/president-long.csv"
president2016data = read.csv(datafile)
# Check that the data was loaded correctly by exploring its dimensions
dim(president2016data) # There should be 54,182 rows and 8 columns.
## [1] 54182 8
Let’s see if we loaded the data correctly by checking if the variable names are correct.
names(president2016data)
## [1] "state" "jurisdiction" "fipscode" "office"
## [5] "candidate" "party" "votes" "totalvotes"
If we want to get a sens of what the first few observations of the data look like we can use the “head()” function
head(president2016data)
## state jurisdiction fipscode office candidate
## 1 AK District 1 NA President Darrell Castle
## 2 AK District 1 NA President Hillary Clinton
## 3 AK District 1 NA President Rocky Roque de la Fuente
## 4 AK District 1 NA President Gary Johnson
## 5 AK District 1 NA President Jill Stein
## 6 AK District 1 NA President Donald Trump
## party votes totalvotes
## 1 Constitution 73 6638
## 2 Democrat 2573 6638
## 3 <NA> 28 6638
## 4 Libertarian 416 6638
## 5 Green 143 6638
## 6 Republican 3180 6638
Looks good! Now we can use the “attach” function to access the variables directly
attach(president2016data)
We can explore some of the variables with the “table()” function. Remember, each of the observations contains information about the state, precinct (district), votes and precict votes (totalvotes) for each presidential candidate in each district.
table(state) # This tells us how many observations (precinct-candidates) we have per state
## state
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL
## 287 335 600 300 580 1408 4056 56 12 804 477 20 1287 1100 2856
## IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE
## 1564 25 3480 832 4212 1368 4797 1079 870 1160 7 280 500 371 465
## NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT
## 1536 189 264 102 2418 2024 231 180 335 234 322 264 1425 4318 725
## VA VT WA WI WV WY
## 798 1722 273 1152 275 207
table(candidate) # This tells us how many observations we have per presidential candidate in the entire dataset.
## candidate
## Ajay Sood All Others Alyson Kennedy
## 1 351 399
## Ameer Flippin Andrew D. Basiago Anthony J. Valdiva
## 24 310 120
## Anthony Tony Valdivia Arantxa Aranja Ariel Cohen
## 481 62 62
## Author C. Brumfield Barbara Whitaker Barry Kirschner
## 98 62 88
## Ben Hartnell Bernard "Bernie" Sanders Blank Votes
## 490 58 413
## BLANK VOTES Bradford Lyttle Brown,\xa0 Ray C.
## 533 64 92
## Bruce E. Jaynes Cathy Johnson Pendleton Cherunda Fox
## 88 24 1740
## Chris Keniston Chris Lacy Coop Smith
## 514 44 24
## Cooper Craig Ellis Cummings
## 169 120 169
## Dale Steffes Dan R. Vacek Dana E. Brown
## 254 186 24
## Daniel Paul Zutler Darrell Castle Darryl Perry
## 120 2817 1
## Darryl W. Perry David G. Stack David Librace
## 44 98 24
## David Limbaugh David Perry Deame
## 168 120 169
## Delano Steinacker Demetra Wysinger Denny Carroll Jackson
## 15 127 120
## Donald Trump Doug Terry Douglas W. Thomson
## 4564 24 88
## Duff Cooper Smith Dustin Baird Emidio Soltysik
## 99 29 680
## Esther Welsh Evan McMullin
## 62 2832
## [ reached getOption("max.print") -- omitted 131 entries ]
Let’s calculate the vote share in each precinct for each candidate:
vote.share = votes/totalvotes
We might be interested in creating variables for each candidate but subsetting the “vote.share” variable by candidate Let’s define variables which contain the vote share for only Donald Trump, Hillary Clinton and Gary Johnson.
vote.share.johnson = vote.share[candidate == "Gary Johnson"] # Precinct vote share for Johnson
vote.share.clinton = vote.share[candidate=="Hillary Clinton"] # Precinct vote share for Clinton
vote.share.trump = vote.share[candidate == "Donald Trump"] # Precinct vote share for Trump
In R, it’s easy to calculte averages, and other summary statistics. Now lets calculate the average vote share for each candidate across precincts:
mean(vote.share.johnson)
## [1] 0.03585668
mean(vote.share.clinton)
## [1] 0.3620333
mean(vote.share.trump)
## [1] 0.5631486
What if we want to know other information like minimum maximum etc? We can use the “summary()” function:
summary(vote.share.johnson)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.02420 0.03478 0.03586 0.04558 0.40000
summary(vote.share.clinton)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2344 0.3500 0.3620 0.4724 1.0000
summary(vote.share.trump)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4431 0.5759 0.5631 0.7068 1.0000
Let’s visualize the distributions for each
hist(vote.share.clinton,
main="Distribution of Precinct Level Vote Share for Clinton",
xlab="Vote share precinct",
col=blues9)
hist(vote.share.trump,
main="Distribution of Precinct Level Vote Share for Trump",
xlab="",
col=blues9)
hist(vote.share.johnson,
main="Distribution of Precinct Level Vote Share for Johnson",
xlab="",
col=blues9)
Let’s look at distibutions within states, California for example
hist(vote.share.clinton[state == "CA"],
main="Clinton, California Precincts",
xlab= " ",
col=blues9)
hist(vote.share.trump[state == "CA"],
main="Trump, California Precincts",
xlab= " ",
col=blues9)