[POLS 4150] Intro to Linear Regression and Correlation

March, 30, 2017

For Today

Linear Relationships
Least Squares Prediction Equation
Linear Regression Model

Relationships between variables

What is the relationship between gun ownership and crime?
What is the relationship between smoking and cancer?
What is the relationship between ethinc and racial diversity and trust?

Putnam's Diversity and Trust Study

Linear relationships

Linear relationship are relationships between two or more variables that have a certain functional form.
\(y =\) response variable – Levels of trust.
\(x =\) explanatory variable – measure of diversity

Linear functions

\[ y = \alpha + \beta x \]

y-intercept \(\alpha\)
slope \(\beta\)

Linear functions

Positive relationships
Negative relationships

Linear functions and models

A linear function provides a model for the relationship between two variables.
Given any two variables, we can estimate a linear function by estimating \(\alpha\) and \(\beta\) to fit a scatterplot.

Scatterplots and linear relationships

trumptweets <- read.csv("https://www.ocf.berkeley.edu/~janastas/trump-tweet-data.csv")
attach(trumptweets)
trumptweets[1:5,1]

## [1] I have not heard any of the pundits or commentators discussing the fact that I spent FAR LESS MONEY on the win than Hillary on the loss!    
## [2] I would have done even better in the election, if that is possible, if the winner was based on popular vote - but would campaign differently
## [3] Campaigning to win the Electoral College is much more difficult & sophisticated than the popular vote. Hillary focused on the wrong states! 
## [4] Yes, it is true - Carlos Slim, the great businessman from Mexico, called me about getting together for a meeting. We met, HE IS A GREAT GUY!
## [5] especially how to get people, even with an unlimited budget, out to vote in the vital swing states ( and more). They focused on wrong states
## 31058 Levels:  ...

Scatterplots and linear relationships

plot(Retweets,Favorites,xlab = "Trump Retweets",ylab = "Trump Favorites")

Least Squares Prediction Equation

How do we fit a model to this data if we are interested in using retweets to explain favorites?
Model: \[y = \alpha + \beta x\]
Prediction Equations: \[\hat{y} = a + bx\]

Least Squares Prediction Equation

\[ b = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sum(x-\bar{x})^2} \]

\[ a = \bar{y} - b\bar{x} \]

Example: Predicting Favorites from Retweets

model.1<-lm(Favorites~Retweets)
summary(model.1)

## 
## Call:
## lm(formula = Favorites ~ Retweets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -253188    -445    -274    -251  118566 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.587e+02  2.649e+01   9.767   <2e-16 ***
## Retweets    2.316e+00  5.513e-03 420.209   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4515 on 31173 degrees of freedom
## Multiple R-squared:  0.8499, Adjusted R-squared:  0.8499 
## F-statistic: 1.766e+05 on 1 and 31173 DF,  p-value: < 2.2e-16

Example: Predicting Favorites from Retweets

\(a = 258.7\)
\(b = 2.316\)

Example: Predicting Favorites from Retweets

plot(Retweets,Favorites,xlab = "Trump Retweets",ylab = "Trump Favorites")
abline(a = 258.7,b = 2.316 )

How do outlier affect prediction?

Outliers can often times do serious damage to our ability to accurately predict the data.
Let's reestimate the model without some outliers and replot the data.

Trump tweet model w/o some outliers

model.2<-lm(Favorites[Retweets<50000]~Retweets[Retweets<50000])
summary(model.2)

## 
## Call:
## lm(formula = Favorites[Retweets < 50000] ~ Retweets[Retweets < 
##     50000])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103751    -147     164     202   90100 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -2.024e+02  1.928e+01   -10.5   <2e-16 ***
## Retweets[Retweets < 50000]  2.728e+00  5.403e-03   504.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3209 on 31150 degrees of freedom
## Multiple R-squared:  0.8911, Adjusted R-squared:  0.8911 
## F-statistic: 2.549e+05 on 1 and 31150 DF,  p-value: < 2.2e-16

Trump tweet model w/o some outliers

plot(Retweets[Retweets<50000],Favorites[Retweets<50000],
     xlab = "Trump Retweets",ylab = "Trump Favorites")
abline(a = -202.4,b = 2.728 )

Two models

Model 1:
\[ \hat{Favorites} = 258.7 + 2.316 * Retweets\]

Model 2: \[ \hat{Favorites} = -202.4 + 2.728 * Retweets\]

Both can be used to predict the number of Favorites give the number of Retweets

How many favorites should Trump's tweet get if it receives 1000 retweets (Model 1)

\[ \hat{Favorites} = 258.7 + 2.316 * Retweets\] \[ \hat{Favorites} = 258.7 + 2.316 * 1000 = 2575 \]

Both can be used to predict the number of Favorites give the number of Retweets

How many favorites should Trump's tweet get if it receives 1000 retweets (Model 2)

\[ \hat{Favorites} = -202.4 + 2.728 * Retweets\]

\[ \hat{Favorites} = -202.4 + 2.728 * 1000 = 2728\]

Which model does a better job at predicting the number of favorites?

To understand this, we have to understand the concept of residuals or prediction error.

\[ Residual_{i} = y_{i} - \hat{y}_{i}\] - Residuals measure how far the predicted value is from the actual value.

Predictive ability of models depends on the size of the residuals

The equation \[ Residual = y_{i} - \hat{y}_{i}\] measures residuals for only one observation.
To understand how good the model is we want to measure the residuals from all observations in the data.

Sum of squared errors

\[ SSE = \sum_{i = 1}^N (y_{i} - \hat{y}_{i})^2\] - Model w/ the lower SSE is better at predicting the response variable, on average.

Residuals for the Trump tweet models: Model 1

# Model 1
Residuals.Model.1 = Favorites - predict(model.1)
plot(Favorites, Residuals.Model.1)

Residuals for the Trump tweet models: Model 2

# Model 1
Residuals.Model.2 = Favorites[Retweets<50000] - predict(model.2)
plot(Favorites[Retweets<50000], Residuals.Model.2)

SSE for models 1 and 2

Model 1

SSE.Model.1 = sum(Residuals.Model.1^2)
SSE.Model.1

## [1] 635433919622

Model 2

SSE.Model.2 = sum(Residuals.Model.2^2)
SSE.Model.2

## [1] 320803091053

Mean Squared Error

Clearly the residuals for 2 are smaller but it is more useful to present the error in standardized way.

\[ MSE = \sqrt{ \frac{SSE}{N-2}} \]

MSE for models 1 and 2

Model 1

N=length(Favorites)
MSE.Model.1 = sqrt(SSE.Model.1/N - 2)
MSE.Model.1

## [1] 4514.732

Model 2

N = length(Favorites[Retweets<50000])
MSE.Model.2 = sqrt(SSE.Model.2/N - 2)
MSE.Model.2

## [1] 3209.048