This document contains code which will help you get started to do linear regression in R. We will specifically be exploring the relationship between Poverty and vote share for President Obama in US counties in 2016. Let’s first directly load the data in R from my UC Berkeley website.
library(foreign)
data<-read.csv("https://www.ocf.berkeley.edu/~janastas/data/votes-trunc.csv")
attach(data)
## The following objects are masked from data (pos = 3):
##
## age65plus, area_name, Black, Density, Edu_batchelors,
## Edu_highschool, Hispanic, Income, NonEnglish, Obama,
## pctdem2016, population_change, population2010, population2014,
## Poverty, Romney, state_abbr, total_votes_2016, votes_dem_2016,
## votes_gop_2016, White
What variables are available in this dataset?
names(data)
## [1] "area_name" "pctdem2016" "votes_dem_2016"
## [4] "votes_gop_2016" "total_votes_2016" "state_abbr"
## [7] "Obama" "Romney" "population2014"
## [10] "population2010" "population_change" "age65plus"
## [13] "White" "Black" "Hispanic"
## [16] "NonEnglish" "Edu_highschool" "Edu_batchelors"
## [19] "Income" "Poverty" "Density"
We are interested mostly in the relationship between poverty rates (Poverty) and vote share for Obama in 2012 (Obama), so we would be interesting in creating a scatterplot between these two variables. However, we can also create multiple scatterplots to explore the relationships between a bunch of variables.
Let’s explore the relationships between Obama’s vote share in 2012 and a couple of county-level demographics: White, Black, Hispanic, Poverty and Income
pairs(Obama~Black+Hispanic + Income + Poverty,
col="slategray4")
Let’s focus on the relationship between Poverty and % Obama
plot(x = Poverty,y = Obama,
xlab = "Poverty Rate by County",
ylab = "% Voting for Obama in 2012",
col="slategray4")
\[ \%Democrat = a + b*Poverty\]
In order to understand the relationship between % Democrat and Poverty we model % Democrat as a linear function of poverty and use regression to estimate the y-intercept a and slope b. This is easily done using the “lm()” function
model1<-lm(Obama~Poverty)
model1
##
## Call:
## lm(formula = Obama ~ Poverty)
##
## Coefficients:
## (Intercept) Poverty
## 0.310157 0.004476
We can then use the “summary()” function to get even more information about the model that we estimated including standard errors, coefficient estimates, t-statistics and p-values.
summary(model1)
##
## Call:
## lm(formula = Obama ~ Poverty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41266 -0.10667 -0.01310 0.09514 0.54856
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.310157 0.007188 43.15 <2e-16 ***
## Poverty 0.004476 0.000401 11.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1452 on 3110 degrees of freedom
## Multiple R-squared: 0.03851, Adjusted R-squared: 0.0382
## F-statistic: 124.6 on 1 and 3110 DF, p-value: < 2.2e-16
We can use the coefficient estimates from the model to plot a line.
plot(x = Poverty,y = Obama,
xlab = "Poverty Rate by County",
ylab = "% Voting for Obama in 2012",
col="slategray4")
abline(a = model1$coefficients[1],
b=model1$coefficients[2],
lty=2,
col="red")