March, 30, 2017

For Today

  • Linear Relationships
  • Least Squares Prediction Equation
  • Linear Regression Model

Relationships between variables

  • What is the relationship between gun ownership and crime?

  • What is the relationship between smoking and cancer?

  • What is the relationship between ethinc and racial diversity and trust?

Putnam's Diversity and Trust Study

Linear relationships

  • Linear relationship are relationships between two or more variables that have a certain functional form.

  • \(y =\) response variable – Levels of trust.

  • \(x =\) explanatory variable – measure of diversity

Linear functions

\[ y = \alpha + \beta x \]

  • y-intercept \(\alpha\)

  • slope \(\beta\)

Linear functions

  • Positive relationships

  • Negative relationships

Linear functions and models

  • A linear function provides a model for the relationship between two variables.

  • Given any two variables, we can estimate a linear function by estimating \(\alpha\) and \(\beta\) to fit a scatterplot.

Scatterplots and linear relationships

trumptweets <- read.csv("https://www.ocf.berkeley.edu/~janastas/trump-tweet-data.csv")
attach(trumptweets)
trumptweets[1:5,1]
## [1] I have not heard any of the pundits or commentators discussing the fact that I spent FAR LESS MONEY on the win than Hillary on the loss!    
## [2] I would have done even better in the election, if that is possible, if the winner was based on popular vote - but would campaign differently
## [3] Campaigning to win the Electoral College is much more difficult & sophisticated than the popular vote. Hillary focused on the wrong states! 
## [4] Yes, it is true - Carlos Slim, the great businessman from Mexico, called me about getting together for a meeting. We met, HE IS A GREAT GUY!
## [5] especially how to get people, even with an unlimited budget, out to vote in the vital swing states ( and more). They focused on wrong states
## 31058 Levels:  ...

Scatterplots and linear relationships

plot(Retweets,Favorites,xlab = "Trump Retweets",ylab = "Trump Favorites")

Least Squares Prediction Equation

  • How do we fit a model to this data if we are interested in using retweets to explain favorites?

  • Model: \[y = \alpha + \beta x\]

  • Prediction Equations: \[\hat{y} = a + bx\]

Least Squares Prediction Equation

\[ b = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sum(x-\bar{x})^2} \]

\[ a = \bar{y} - b\bar{x} \]

Example: Predicting Favorites from Retweets

model.1<-lm(Favorites~Retweets)
summary(model.1)
## 
## Call:
## lm(formula = Favorites ~ Retweets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -253188    -445    -274    -251  118566 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.587e+02  2.649e+01   9.767   <2e-16 ***
## Retweets    2.316e+00  5.513e-03 420.209   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4515 on 31173 degrees of freedom
## Multiple R-squared:  0.8499, Adjusted R-squared:  0.8499 
## F-statistic: 1.766e+05 on 1 and 31173 DF,  p-value: < 2.2e-16

Example: Predicting Favorites from Retweets

  • \(a = 258.7\)

  • \(b = 2.316\)

Example: Predicting Favorites from Retweets

plot(Retweets,Favorites,xlab = "Trump Retweets",ylab = "Trump Favorites")
abline(a = 258.7,b = 2.316 )

How do outlier affect prediction?

  • Outliers can often times do serious damage to our ability to accurately predict the data.

  • Let's reestimate the model without some outliers and replot the data.

Trump tweet model w/o some outliers

model.2<-lm(Favorites[Retweets<50000]~Retweets[Retweets<50000])
summary(model.2)
## 
## Call:
## lm(formula = Favorites[Retweets < 50000] ~ Retweets[Retweets < 
##     50000])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103751    -147     164     202   90100 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -2.024e+02  1.928e+01   -10.5   <2e-16 ***
## Retweets[Retweets < 50000]  2.728e+00  5.403e-03   504.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3209 on 31150 degrees of freedom
## Multiple R-squared:  0.8911, Adjusted R-squared:  0.8911 
## F-statistic: 2.549e+05 on 1 and 31150 DF,  p-value: < 2.2e-16

Trump tweet model w/o some outliers

plot(Retweets[Retweets<50000],Favorites[Retweets<50000],
     xlab = "Trump Retweets",ylab = "Trump Favorites")
abline(a = -202.4,b = 2.728 )

Two models

Model 1:
\[ \hat{Favorites} = 258.7 + 2.316 * Retweets\]

Model 2: \[ \hat{Favorites} = -202.4 + 2.728 * Retweets\]

Both can be used to predict the number of Favorites give the number of Retweets

How many favorites should Trump's tweet get if it receives 1000 retweets (Model 1)

\[ \hat{Favorites} = 258.7 + 2.316 * Retweets\] \[ \hat{Favorites} = 258.7 + 2.316 * 1000 = 2575 \]

Both can be used to predict the number of Favorites give the number of Retweets

How many favorites should Trump's tweet get if it receives 1000 retweets (Model 2)

\[ \hat{Favorites} = -202.4 + 2.728 * Retweets\]

\[ \hat{Favorites} = -202.4 + 2.728 * 1000 = 2728\]

Which model does a better job at predicting the number of favorites?

  • To understand this, we have to understand the concept of residuals or prediction error.

\[ Residual_{i} = y_{i} - \hat{y}_{i}\] - Residuals measure how far the predicted value is from the actual value.

Predictive ability of models depends on the size of the residuals

  • The equation \[ Residual = y_{i} - \hat{y}_{i}\] measures residuals for only one observation.

  • To understand how good the model is we want to measure the residuals from all observations in the data.

Sum of squared errors

\[ SSE = \sum_{i = 1}^N (y_{i} - \hat{y}_{i})^2\] - Model w/ the lower SSE is better at predicting the response variable, on average.

Residuals for the Trump tweet models: Model 1

# Model 1
Residuals.Model.1 = Favorites - predict(model.1)
plot(Favorites, Residuals.Model.1)

Residuals for the Trump tweet models: Model 2

# Model 1
Residuals.Model.2 = Favorites[Retweets<50000] - predict(model.2)
plot(Favorites[Retweets<50000], Residuals.Model.2)

SSE for models 1 and 2

Model 1

SSE.Model.1 = sum(Residuals.Model.1^2)
SSE.Model.1
## [1] 635433919622

Model 2

SSE.Model.2 = sum(Residuals.Model.2^2)
SSE.Model.2
## [1] 320803091053

Mean Squared Error

  • Clearly the residuals for 2 are smaller but it is more useful to present the error in standardized way.

\[ MSE = \sqrt{ \frac{SSE}{N-2}} \]

MSE for models 1 and 2

Model 1

N=length(Favorites)
MSE.Model.1 = sqrt(SSE.Model.1/N - 2)
MSE.Model.1
## [1] 4514.732

Model 2

N = length(Favorites[Retweets<50000])
MSE.Model.2 = sqrt(SSE.Model.2/N - 2)
MSE.Model.2
## [1] 3209.048