[POLS 4150] Intro to Linear Regression and Correlation

April, 4, 2017

For Today

Linear Relationships
Least Squares Prediction Equation
Linear Regression Model

Relationships between variables

What is the relationship between gun ownership and crime?
What is the relationship between smoking and cancer?
What is the relationship between ethinc and racial diversity and trust?

Putnam's Diversity and Trust Study

Linear relationships

Linear relationship are relationships between two or more variables that have a certain functional form.
$y =$ response variable – Levels of trust.
$x =$ explanatory variable – measure of diversity

Linear functions

\[ y = \alpha + \beta x \]

y-intercept $\alpha$
slope $\beta$

Linear functions

Positive relationships
Negative relationships

Linear functions and models

A linear function provides a model for the relationship between two variables.
Given any two variables, we can estimate a linear function by estimating $\alpha$ and $\beta$ to fit a scatterplot.

Scatterplots and linear relationships

trumptweets <- read.csv("https://www.ocf.berkeley.edu/~janastas/trump-tweet-data.csv")
attach(trumptweets)
trumptweets[1:5,1]

## [1] I have not heard any of the pundits or commentators discussing the fact that I spent FAR LESS MONEY on the win than Hillary on the loss!    
## [2] I would have done even better in the election, if that is possible, if the winner was based on popular vote - but would campaign differently
## [3] Campaigning to win the Electoral College is much more difficult & sophisticated than the popular vote. Hillary focused on the wrong states! 
## [4] Yes, it is true - Carlos Slim, the great businessman from Mexico, called me about getting together for a meeting. We met, HE IS A GREAT GUY!
## [5] especially how to get people, even with an unlimited budget, out to vote in the vital swing states ( and more). They focused on wrong states
## 31058 Levels:  ...

Scatterplots and linear relationships

plot(Retweets,Favorites,xlab = "Trump Retweets",ylab = "Trump Favorites")

Least Squares Prediction Equation

How do we fit a model to this data if we are interested in using retweets to explain favorites?
Model: \[y = \alpha + \beta x\]
Prediction Equations: \[\hat{y} = a + bx\]

Least Squares Prediction Equation

\[ b = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sum(x-\bar{x})^2} \]

\[ a = \bar{y} - b\bar{x} \]

Example: Predicting Favorites from Retweets

model.1<-lm(Favorites~Retweets)
summary(model.1)

## 
## Call:
## lm(formula = Favorites ~ Retweets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -253188    -445    -274    -251  118566 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.587e+02  2.649e+01   9.767   <2e-16 ***
## Retweets    2.316e+00  5.513e-03 420.209   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4515 on 31173 degrees of freedom
## Multiple R-squared:  0.8499, Adjusted R-squared:  0.8499 
## F-statistic: 1.766e+05 on 1 and 31173 DF,  p-value: < 2.2e-16

Example: Predicting Favorites from Retweets

$a = 258.7$
$b = 2.316$

Example: Predicting Favorites from Retweets

plot(Retweets,Favorites,xlab = "Trump Retweets",ylab = "Trump Favorites")
abline(a = 258.7,b = 2.316 )

How do outlier affect prediction?

Outliers can often times do serious damage to our ability to accurately predict the data.
Let's reestimate the model without some outliers and replot the data.

Trump tweet model w/o some outliers

model.2<-lm(Favorites[Retweets<50000]~Retweets[Retweets<50000])
summary(model.2)

## 
## Call:
## lm(formula = Favorites[Retweets < 50000] ~ Retweets[Retweets < 
##     50000])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103751    -147     164     202   90100 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -2.024e+02  1.928e+01   -10.5   <2e-16 ***
## Retweets[Retweets < 50000]  2.728e+00  5.403e-03   504.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3209 on 31150 degrees of freedom
## Multiple R-squared:  0.8911, Adjusted R-squared:  0.8911 
## F-statistic: 2.549e+05 on 1 and 31150 DF,  p-value: < 2.2e-16

Trump tweet model w/o some outliers

plot(Retweets[Retweets<50000],Favorites[Retweets<50000],
     xlab = "Trump Retweets",ylab = "Trump Favorites")
abline(a = -202.4,b = 2.728 )

Two models

Model 1:
\[ \hat{Favorites} = 258.7 + 2.316 * Retweets\]

Model 2: \[ \hat{Favorites} = -202.4 + 2.728 * Retweets\]

Both can be used to predict the number of Favorites give the number of Retweets

How many favorites should Trump's tweet get if it receives 1000 retweets (Model 1)

\[ \hat{Favorites} = 258.7 + 2.316 * Retweets\] \[ \hat{Favorites} = 258.7 + 2.316 * 1000 = 2575 \]

Both can be used to predict the number of Favorites give the number of Retweets

How many favorites should Trump's tweet get if it receives 1000 retweets (Model 2)

\[ \hat{Favorites} = -202.4 + 2.728 * Retweets\]

\[ \hat{Favorites} = -202.4 + 2.728 * 1000 = 2728\]

Which model does a better job at predicting the number of favorites?

To understand this, we have to understand the concept of residuals or prediction error.

\[ Residual_{i} = y_{i} - \hat{y}_{i}\] - Residuals measure how far the predicted value is from the actual value.

Predictive ability of models depends on the size of the residuals

The equation \[ Residual = y_{i} - \hat{y}_{i}\] measures residuals for only one observation.
To understand how good the model is we want to measure the residuals from all observations in the data.

Sum of squared errors

\[ SSE = \sum_{i = 1}^N (y_{i} - \hat{y}_{i})^2\] - Model w/ the lower SSE is better at predicting the response variable, on average.

Residuals for the Trump tweet models: Model 1

# Model 1
Residuals.Model.1 = Favorites - predict(model.1)
plot(Favorites, Residuals.Model.1)

Residuals for the Trump tweet models: Model 2

# Model 1
Residuals.Model.2 = Favorites[Retweets<50000] - predict(model.2)
plot(Favorites[Retweets<50000], Residuals.Model.2)

SSE for models 1 and 2

Model 1

SSE.Model.1 = sum(Residuals.Model.1^2)
SSE.Model.1

## [1] 635433919622

Model 2

SSE.Model.2 = sum(Residuals.Model.2^2)
SSE.Model.2

## [1] 320803091053

Mean Squared Error

Clearly the residuals for 2 are smaller but it is more useful to present the error in standardized way.

\[ MSE = \sqrt{ \frac{SSE}{n-2}} \]

MSE for models 1 and 2

Model 1

N=length(Favorites)
MSE.Model.1 = sqrt(SSE.Model.1/N - 2)
MSE.Model.1

## [1] 4514.732

Model 2

N = length(Favorites[Retweets<50000])
MSE.Model.2 = sqrt(SSE.Model.2/N - 2)
MSE.Model.2

## [1] 3209.048

The linear regression model

deterministic model - For the model $ y = + x$ each value of x corresponds to a value of y.
This would be unrealistic in the social science world.
Eg) if $x = $ yrs of education, $y=$ income we can't say that a person with 12 years of education should have exactly $30,000

The linear regression model

What we would like to say is that the person has some predicted income that can vary w/in a certain range.
A probabilistic model allows for variablility in the predited outcome.
Linear regression is considered a probabilistic model.

Expected value

Expected value is another way of formally expressing the mean.
It is denoted by "E()".
Thus, E(x) is the mean of variable x.
E(y) is the mean of the variable y.

Linear regression model as a probabilistic model

\[ E(y) = \alpha + \beta *x \]

Expressed in this way, we are making it clear that we are using linear regression to predict an average predicted value that has variation

Mean predictions

Thought of in this way, this means that linear regression gives you predictions that are averages.
Because of this, we can use all of the tools that we used for means (confidence intervals, significance testing etc.) for regression predictions.

Example: Predicting Favorites from Retweets using Trump's Tweets

\[ \hat{Favorites}= E(Favorites) = 258.7 + 2.316 * Retweets\]

here $\alpha$ = 258.8, $\beta = 2.316$, $x = Retweets$
This implies that, on average, the number of favorites goes up by 2.316 for each retweet.
Also implies that if the number of retweets was 0, the average number of favorites would be about 259.

Example: Predicting Favorites from Retweets using Trump's Tweets

How many favorites should Trump's tweet get if it receives 1000 retweets?

\[E(Favorites) = 258.7 + 2.316 * 1000 = 2575 \]

Using the linear regression model we see that the average prediction is 2575 tweets.
What is the range of this prediction?
To figure this out, we can estimate the standard deviation of the predictions and use the 68-95-99% rule.

Standard deviation of the regression model

\[ s = \sqrt{\frac{SSE}{n-2}} = \sqrt{\frac{\sum_{i = 1}^{N}(y-\hat{y})}{n-2}} \]

Does this look familiar?

Mean Squared Error

\[ MSE = \sqrt{ \frac{SSE}{n-2}} \]

Standard deviation of the regression model is the same quantity as the mean squared error which is used to measure how "good" the model is.

Standard deviation for the model is thus

Model 1

N=length(Favorites)
MSE.Model.1 = sqrt(SSE.Model.1/N - 2)
MSE.Model.1

## [1] 4514.732

What is the variability around the prediction?

\[s = 4515\]

Using the 68-95-99% rule we can say that 95% of the favorites for 1000 retweets falls between about

\[ 2575 \pm 2*4515 = (0, 9030) \]

What about predicted favorites for 10000 retweets?

\[E(Favorites) = 258.7 + 2.316 * 10000 = 23418 \]

\[ 23418 \pm 2*4515 = (14388, 32448) \] ## Midterm