Linear Regression in R

This document contains code which will help you get started to do linear regression in R. We will specifically be exploring the relationship between Poverty and vote share for President Obama in US counties in 2016. Let’s first directly load the data in R from my UC Berkeley website.

library(foreign)
data<-read.csv("https://www.ocf.berkeley.edu/~janastas/data/votes-trunc.csv")
attach(data)
## The following objects are masked from data (pos = 3):
## 
##     age65plus, area_name, Black, Density, Edu_batchelors,
##     Edu_highschool, Hispanic, Income, NonEnglish, Obama,
##     pctdem2016, population_change, population2010, population2014,
##     Poverty, Romney, state_abbr, total_votes_2016, votes_dem_2016,
##     votes_gop_2016, White

What variables are available in this dataset?

names(data)
##  [1] "area_name"         "pctdem2016"        "votes_dem_2016"   
##  [4] "votes_gop_2016"    "total_votes_2016"  "state_abbr"       
##  [7] "Obama"             "Romney"            "population2014"   
## [10] "population2010"    "population_change" "age65plus"        
## [13] "White"             "Black"             "Hispanic"         
## [16] "NonEnglish"        "Edu_highschool"    "Edu_batchelors"   
## [19] "Income"            "Poverty"           "Density"

We are interested mostly in the relationship between poverty rates (Poverty) and vote share for Obama in 2012 (Obama), so we would be interesting in creating a scatterplot between these two variables. However, we can also create multiple scatterplots to explore the relationships between a bunch of variables.

Let’s explore the relationships between Obama’s vote share in 2012 and a couple of county-level demographics: White, Black, Hispanic, Poverty and Income

pairs(Obama~Black+Hispanic + Income + Poverty, 
      col="slategray4")

Let’s focus on the relationship between Poverty and % Obama

plot(x = Poverty,y = Obama,
     xlab = "Poverty Rate by County", 
     ylab = "% Voting for Obama in 2012",
     col="slategray4")

Estimating the y-intercept and slope

\[ \%Democrat = a + b*Poverty\]

In order to understand the relationship between % Democrat and Poverty we model % Democrat as a linear function of poverty and use regression to estimate the y-intercept a and slope b. This is easily done using the “lm()” function

model1<-lm(Obama~Poverty)
model1
## 
## Call:
## lm(formula = Obama ~ Poverty)
## 
## Coefficients:
## (Intercept)      Poverty  
##    0.310157     0.004476

We can then use the “summary()” function to get even more information about the model that we estimated including standard errors, coefficient estimates, t-statistics and p-values.

summary(model1)
## 
## Call:
## lm(formula = Obama ~ Poverty)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41266 -0.10667 -0.01310  0.09514  0.54856 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.310157   0.007188   43.15   <2e-16 ***
## Poverty     0.004476   0.000401   11.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1452 on 3110 degrees of freedom
## Multiple R-squared:  0.03851,    Adjusted R-squared:  0.0382 
## F-statistic: 124.6 on 1 and 3110 DF,  p-value: < 2.2e-16

We can use the coefficient estimates from the model to plot a line.

plot(x = Poverty,y = Obama,
     xlab = "Poverty Rate by County", 
     ylab = "% Voting for Obama in 2012",
     col="slategray4")

abline(a = model1$coefficients[1], 
       b=model1$coefficients[2],
       lty=2,
       col="red")