Interpreting linear regression with examples in R.
Significance testing for linear regression.
Confidence intervals for linear regression.
April, 11, 2017
Interpreting linear regression with examples in R.
Significance testing for linear regression.
Confidence intervals for linear regression.
With simple linear regression, we are primarily interested in trying to predict a dependent variable by an independent variable.
For example we might be interested in the relationship between Democratic voting and poverty.
Are areas with higher poverty rates more likely to vote Democrat?
\[ \%Democrat = a + b*Poverty\]
% Democrat is the dependent variable & in this case is % of people that voted for Obama in each US County in 2012.
Poverty is the poverty rate in each US County.
plot(x = Poverty,y = Obama,xlab = "Poverty Rate by County", ylab = "% Voting for Obama in 2012")
model1<-lm(Obama~Poverty) model1
## ## Call: ## lm(formula = Obama ~ Poverty) ## ## Coefficients: ## (Intercept) Poverty ## 0.310157 0.004476
summary(model1)
## ## Call: ## lm(formula = Obama ~ Poverty) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.41266 -0.10667 -0.01310 0.09514 0.54856 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.310157 0.007188 43.15 <2e-16 *** ## Poverty 0.004476 0.000401 11.16 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.1452 on 3110 degrees of freedom ## Multiple R-squared: 0.03851, Adjusted R-squared: 0.0382 ## F-statistic: 124.6 on 1 and 3110 DF, p-value: < 2.2e-16
Estimate | Std. Error | t-value | \(Pr(>|t|)\) | |
---|---|---|---|---|
(Intercept) | 0.310157 | 0.007188 | 43.15 | <2e-16*** |
Poverty | 0.004476 | 0.000401 | 11.16 | <2e-16*** |
Intercept (y-intercept): \(= a\) .
Poverty (slope): \(= b\) .
Std. Error: standard error of each coefficient estimate.
T-Value: Test statistic for each coefficient.
\(Pr(>|t|)\): P-value.
\[ \%Democrat = 0.31 + 0.004*Poverty\]
These are slope and y-intercept estimates of the line that is fit through the scatterplot above.
Why do we call these estimates?
\(a\) and \(b\) are statistics
\(\alpha\) and \(\beta\) are the parameters that we are estimating with \(a\) and \(b\).
In other words, \(\alpha\) and \(\beta\) are the ``true'' population estimates of the slope and y-intercept, assuming we had all of the data.
What does this mean in practice in simple linear regression?
Relationship between the dependent and the independent variable is determined entirely by the estimate of the slope, b.
But remember b is only an estimate of the ``true'' slope parameter \(\beta\).
So even if b is not zero, \(\beta\) could possibly be zero.
This implies that the independent variable is not related to the dependent variable.
\[ \begin{aligned} H_{0}:& \beta = 0 \\ H_{a}:& \beta \neq 0 \end{aligned} \]
Step 1: Specifify the null and alternative hypotheses \[ \begin{aligned} H_{0}:& \beta = 0 \\ H_{a}:& \beta \neq 0 \end{aligned} \] Step 2: Calculate the test statistic. \[ t_{df} = \frac{Observed-Expected}{SE(b)} = \frac{b - 0}{SE(b)} = \frac{b}{SE(b)} \]
Step 3: Find the two-sided p-value.
\[ p-value = 2*P(T >|t_{df}|) \]
Step 4: Make a decision (reject/do not reject \(H_{0}\))
\[ \begin{aligned} p-value & < \alpha = 0.05 \implies \text{reject } H_{0} \\ p-value & \geq \alpha = 0.05 \implies \text{do not reject } H_{0} \\ \end{aligned} \]
Model: \[ \% Democrat = 0.310 + 0.004*Poverty \]
We are interested in figuring out whether the true relationship/slope (\(\beta\)) between Poverty and % Democrat is zero.
This would imply that there is no evidence of a relationship, or correlation between the poverty rate and Democratic voting.
Specifify the null and alternative hypotheses \[ \begin{aligned} H_{0}:& \beta_{poverty} = 0 \\ H_{a}:& \beta_{poverty} \neq 0 \end{aligned} \]
Estimate | Std. Error | t-value | \(Pr(>|t|)\) | |
---|---|---|---|---|
(Intercept) | 0.310157 | 0.007188 | 43.15 | <2e-16*** |
Poverty | 0.004476 | 0.000401 | 11.16 | <2e-16*** |
Calculate the test statistic. \[ t_{df} = \frac{b}{SE(b)} = \frac{0.004}{0.0004} = 11.16 \]
Estimate | Std. Error | t-value | \(Pr(>|t|)\) | |
---|---|---|---|---|
(Intercept) | 0.310157 | 0.007188 | 43.15 | <2e-16*** |
Poverty | 0.004476 | 0.000401 | 11.16 | <2e-16*** |
Find the two-sided p-value.
\[ p-value = 2*P(T >|11.16|) \approx 0.0000 \]
Estimate | Std. Error | t-value | \(Pr(>|t|)\) | |
---|---|---|---|---|
(Intercept) | 0.310157 | 0.007188 | 43.15 | <2e-16*** |
Poverty | 0.004476 | 0.000401 | 11.16 | <2e-16*** |
Make a decision (reject/do not reject \(H_{0}\))
\[ \begin{aligned} p-value & < \alpha = 0.05 \implies \text{reject } H_{0} \\ p-value & \geq \alpha = 0.05 \implies \text{do not reject } H_{0} \\ \end{aligned} \]
\[ \% Democrat = 0.310 + 0.004*Poverty \]
Here we clearly reject \(H_{0}\) in favor of the alternative.
This means that there is evidence that Poverty is indeed positively correlated with % Democrat.
In fact if we go back to the original model, then we find that for each 1 unit increase in the poverty rate, there is a increase in Democratic vote share by 0.004%.
So, for example, a county with a poverty rate that is 10 units higher will have a average Democratic vote share that is 4% higher.
Recall that \(b\) is just an estimate of \(\beta\).
We might be interested in figuring out what the range in values of \(\beta\) is given \(b\).
In experiments, this will give us a range of the effect size.
In an observational study like this one, it gives us a sense of the range of values that the relationship between the independent and dependent value can take on.
\[ CI \text{ for } \beta: Estimate \pm MOE \implies b \pm t_{c,df}*SE(b) \]
Estimate | Std. Error | t-value | \(Pr(>|t|)\) | |
---|---|---|---|---|
(Intercept) | 0.310157 | 0.007188 | 43.15 | <2e-16*** |
Poverty | 0.004476 | 0.000401 | 11.16 | <2e-16*** |
\[ 95\% CI \text{ for } \beta: Estimate \pm MOE \implies b \pm t_{c,df}*SE(b) \]
\(t_{c,df}=t_{0.95,3110} = 1.96\)
\(SE(b) = 0.0004\)
Estimate | Std. Error | t-value | \(Pr(>|t|)\) | |
---|---|---|---|---|
(Intercept) | 0.310157 | 0.007188 | 43.15 | <2e-16*** |
Poverty | 0.004476 | 0.000401 | 11.16 | <2e-16*** |
\[ 95\% CI \text{ for } \beta: 0.004 \pm 1.96*0.0004 = (0.003,0.005) \]
\(t_{c,df}=t_{0.95,3110} = 1.96\)
\(SE(b) = 0.0004\)
\[ 95\% CI \text{ for } \beta: 0.004 \pm 1.96*0.0004 = (0.003,0.005) \]
We are 95% certain that the true slope \(\beta\) is between 0.003 and 0.005.
What this means substantively is that for a 1 unit increase in the poverty rate (Poverty) we would expect to find between a 0.003% and 0.005% increase in county level Democratic voting, on average.