April, 11, 2017

For Today

  • Interpreting linear regression with examples in R.

  • Significance testing for linear regression.

  • Confidence intervals for linear regression.

Simple linear regression

  • With simple linear regression, we are primarily interested in trying to predict a dependent variable by an independent variable.

  • For example we might be interested in the relationship between Democratic voting and poverty.

  • Are areas with higher poverty rates more likely to vote Democrat?

Simple linear regression model

\[ \%Democrat = a + b*Poverty\]

  • % Democrat is the dependent variable & in this case is % of people that voted for Obama in each US County in 2012.

  • Poverty is the poverty rate in each US County.

Plot the relationship using a scatter plot

plot(x = Poverty,y = Obama,xlab = "Poverty Rate by County", ylab = "% Voting for Obama in 2012")

Fit a line using linear regression

model1<-lm(Obama~Poverty)
model1
## 
## Call:
## lm(formula = Obama ~ Poverty)
## 
## Coefficients:
## (Intercept)      Poverty  
##    0.310157     0.004476

Use the summary function to get mode information

summary(model1)
## 
## Call:
## lm(formula = Obama ~ Poverty)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41266 -0.10667 -0.01310  0.09514  0.54856 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.310157   0.007188   43.15   <2e-16 ***
## Poverty     0.004476   0.000401   11.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1452 on 3110 degrees of freedom
## Multiple R-squared:  0.03851,    Adjusted R-squared:  0.0382 
## F-statistic: 124.6 on 1 and 3110 DF,  p-value: < 2.2e-16

Breakdown of output

Estimate Std. Error t-value \(Pr(>|t|)\)
(Intercept) 0.310157 0.007188 43.15 <2e-16***
Poverty 0.004476 0.000401 11.16 <2e-16***
  • Intercept (y-intercept): \(= a\) .

  • Poverty (slope): \(= b\) .

  • Std. Error: standard error of each coefficient estimate.

  • T-Value: Test statistic for each coefficient.

  • \(Pr(>|t|)\): P-value.

Intercept and Slope

\[ \%Democrat = 0.31 + 0.004*Poverty\]

  • These are slope and y-intercept estimates of the line that is fit through the scatterplot above.

  • Why do we call these estimates?

Parameters and statistics in linear regression

  • \(a\) and \(b\) are statistics

  • \(\alpha\) and \(\beta\) are the parameters that we are estimating with \(a\) and \(b\).

  • In other words, \(\alpha\) and \(\beta\) are the ``true'' population estimates of the slope and y-intercept, assuming we had all of the data.

  • What does this mean in practice in simple linear regression?

Relationship between the dependent and independent variable

  • Relationship between the dependent and the independent variable is determined entirely by the estimate of the slope, b.

  • But remember b is only an estimate of the ``true'' slope parameter \(\beta\).

  • So even if b is not zero, \(\beta\) could possibly be zero.

  • This implies that the independent variable is not related to the dependent variable.

Signifiance testing and confidence intervals for \(\beta\)

\[ \begin{aligned} H_{0}:& \beta = 0 \\ H_{a}:& \beta \neq 0 \end{aligned} \]

  • Because we want to know whether the true, population parameter \(\beta\) is equal to zero, we have to conduct a significance test.

Significance test for \(\beta\)

Step 1: Specifify the null and alternative hypotheses \[ \begin{aligned} H_{0}:& \beta = 0 \\ H_{a}:& \beta \neq 0 \end{aligned} \] Step 2: Calculate the test statistic. \[ t_{df} = \frac{Observed-Expected}{SE(b)} = \frac{b - 0}{SE(b)} = \frac{b}{SE(b)} \]

Step 3: Find the two-sided p-value.

\[ p-value = 2*P(T >|t_{df}|) \]

Step 4: Make a decision (reject/do not reject \(H_{0}\))

\[ \begin{aligned} p-value & < \alpha = 0.05 \implies \text{reject } H_{0} \\ p-value & \geq \alpha = 0.05 \implies \text{do not reject } H_{0} \\ \end{aligned} \]

Poverty and Democratic Vote Share in 2012

Model: \[ \% Democrat = 0.310 + 0.004*Poverty \]

  • We are interested in figuring out whether the true relationship/slope (\(\beta\)) between Poverty and % Democrat is zero.

  • This would imply that there is no evidence of a relationship, or correlation between the poverty rate and Democratic voting.

Test for relationship between poverty and democratic vote share: Step 1

Specifify the null and alternative hypotheses \[ \begin{aligned} H_{0}:& \beta_{poverty} = 0 \\ H_{a}:& \beta_{poverty} \neq 0 \end{aligned} \]

Test for relationship between poverty and democratic vote share: Step 2

Estimate Std. Error t-value \(Pr(>|t|)\)
(Intercept) 0.310157 0.007188 43.15 <2e-16***
Poverty 0.004476 0.000401 11.16 <2e-16***

Calculate the test statistic. \[ t_{df} = \frac{b}{SE(b)} = \frac{0.004}{0.0004} = 11.16 \]

  • $ b = 0.004$, \(SE(b) = 0.0004\), \(df = 3110\)

Test for relationship between poverty and democratic vote share: Step 3

Estimate Std. Error t-value \(Pr(>|t|)\)
(Intercept) 0.310157 0.007188 43.15 <2e-16***
Poverty 0.004476 0.000401 11.16 <2e-16***

Find the two-sided p-value.

\[ p-value = 2*P(T >|11.16|) \approx 0.0000 \]

Test for relationship between poverty and democratic vote share: Step 4

Estimate Std. Error t-value \(Pr(>|t|)\)
(Intercept) 0.310157 0.007188 43.15 <2e-16***
Poverty 0.004476 0.000401 11.16 <2e-16***

Make a decision (reject/do not reject \(H_{0}\))

\[ \begin{aligned} p-value & < \alpha = 0.05 \implies \text{reject } H_{0} \\ p-value & \geq \alpha = 0.05 \implies \text{do not reject } H_{0} \\ \end{aligned} \]

Conclusions

\[ \% Democrat = 0.310 + 0.004*Poverty \]

  • Here we clearly reject \(H_{0}\) in favor of the alternative.

  • This means that there is evidence that Poverty is indeed positively correlated with % Democrat.

  • In fact if we go back to the original model, then we find that for each 1 unit increase in the poverty rate, there is a increase in Democratic vote share by 0.004%.

  • So, for example, a county with a poverty rate that is 10 units higher will have a average Democratic vote share that is 4% higher.

Confidence intervals for the slope

  • Recall that \(b\) is just an estimate of \(\beta\).

  • We might be interested in figuring out what the range in values of \(\beta\) is given \(b\).

  • In experiments, this will give us a range of the effect size.

  • In an observational study like this one, it gives us a sense of the range of values that the relationship between the independent and dependent value can take on.

Confidence intervals for the slope

\[ CI \text{ for } \beta: Estimate \pm MOE \implies b \pm t_{c,df}*SE(b) \]

  • We estimate confidence intervals for the slope in the same way that we estimate them for a mean.

95% Confidence interval for slope on Poverty

Estimate Std. Error t-value \(Pr(>|t|)\)
(Intercept) 0.310157 0.007188 43.15 <2e-16***
Poverty 0.004476 0.000401 11.16 <2e-16***

\[ 95\% CI \text{ for } \beta: Estimate \pm MOE \implies b \pm t_{c,df}*SE(b) \]

  • \(t_{c,df}=t_{0.95,3110} = 1.96\)

  • \(SE(b) = 0.0004\)

95% Confidence interval for slope on Poverty

Estimate Std. Error t-value \(Pr(>|t|)\)
(Intercept) 0.310157 0.007188 43.15 <2e-16***
Poverty 0.004476 0.000401 11.16 <2e-16***

\[ 95\% CI \text{ for } \beta: 0.004 \pm 1.96*0.0004 = (0.003,0.005) \]

  • \(t_{c,df}=t_{0.95,3110} = 1.96\)

  • \(SE(b) = 0.0004\)

Interpretation

\[ 95\% CI \text{ for } \beta: 0.004 \pm 1.96*0.0004 = (0.003,0.005) \]

  • We are 95% certain that the true slope \(\beta\) is between 0.003 and 0.005.

  • What this means substantively is that for a 1 unit increase in the poverty rate (Poverty) we would expect to find between a 0.003% and 0.005% increase in county level Democratic voting, on average.