- Linear Relationships
- Least Squares Prediction Equation
- Linear Regression Model
March, 30, 2017
What is the relationship between gun ownership and crime?
What is the relationship between smoking and cancer?
What is the relationship between ethinc and racial diversity and trust?
Linear relationship are relationships between two or more variables that have a certain functional form.
\(y =\) response variable – Levels of trust.
\(x =\) explanatory variable – measure of diversity
\[ y = \alpha + \beta x \]
y-intercept \(\alpha\)
slope \(\beta\)
Positive relationships
Negative relationships
A linear function provides a model for the relationship between two variables.
Given any two variables, we can estimate a linear function by estimating \(\alpha\) and \(\beta\) to fit a scatterplot.
trumptweets <- read.csv("https://www.ocf.berkeley.edu/~janastas/trump-tweet-data.csv") attach(trumptweets) trumptweets[1:5,1]
## [1] I have not heard any of the pundits or commentators discussing the fact that I spent FAR LESS MONEY on the win than Hillary on the loss! ## [2] I would have done even better in the election, if that is possible, if the winner was based on popular vote - but would campaign differently ## [3] Campaigning to win the Electoral College is much more difficult & sophisticated than the popular vote. Hillary focused on the wrong states! ## [4] Yes, it is true - Carlos Slim, the great businessman from Mexico, called me about getting together for a meeting. We met, HE IS A GREAT GUY! ## [5] especially how to get people, even with an unlimited budget, out to vote in the vital swing states ( and more). They focused on wrong states ## 31058 Levels: ...
plot(Retweets,Favorites,xlab = "Trump Retweets",ylab = "Trump Favorites")
How do we fit a model to this data if we are interested in using retweets to explain favorites?
Model: \[y = \alpha + \beta x\]
Prediction Equations: \[\hat{y} = a + bx\]
\[ b = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sum(x-\bar{x})^2} \]
\[ a = \bar{y} - b\bar{x} \]
model.1<-lm(Favorites~Retweets) summary(model.1)
## ## Call: ## lm(formula = Favorites ~ Retweets) ## ## Residuals: ## Min 1Q Median 3Q Max ## -253188 -445 -274 -251 118566 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.587e+02 2.649e+01 9.767 <2e-16 *** ## Retweets 2.316e+00 5.513e-03 420.209 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4515 on 31173 degrees of freedom ## Multiple R-squared: 0.8499, Adjusted R-squared: 0.8499 ## F-statistic: 1.766e+05 on 1 and 31173 DF, p-value: < 2.2e-16
\(a = 258.7\)
\(b = 2.316\)
plot(Retweets,Favorites,xlab = "Trump Retweets",ylab = "Trump Favorites") abline(a = 258.7,b = 2.316 )
Outliers can often times do serious damage to our ability to accurately predict the data.
Let's reestimate the model without some outliers and replot the data.
model.2<-lm(Favorites[Retweets<50000]~Retweets[Retweets<50000]) summary(model.2)
## ## Call: ## lm(formula = Favorites[Retweets < 50000] ~ Retweets[Retweets < ## 50000]) ## ## Residuals: ## Min 1Q Median 3Q Max ## -103751 -147 164 202 90100 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -2.024e+02 1.928e+01 -10.5 <2e-16 *** ## Retweets[Retweets < 50000] 2.728e+00 5.403e-03 504.8 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3209 on 31150 degrees of freedom ## Multiple R-squared: 0.8911, Adjusted R-squared: 0.8911 ## F-statistic: 2.549e+05 on 1 and 31150 DF, p-value: < 2.2e-16
plot(Retweets[Retweets<50000],Favorites[Retweets<50000], xlab = "Trump Retweets",ylab = "Trump Favorites") abline(a = -202.4,b = 2.728 )
Model 1:
\[ \hat{Favorites} = 258.7 + 2.316 * Retweets\]
Model 2: \[ \hat{Favorites} = -202.4 + 2.728 * Retweets\]
How many favorites should Trump's tweet get if it receives 1000 retweets (Model 1)
\[ \hat{Favorites} = 258.7 + 2.316 * Retweets\] \[ \hat{Favorites} = 258.7 + 2.316 * 1000 = 2575 \]
How many favorites should Trump's tweet get if it receives 1000 retweets (Model 2)
\[ \hat{Favorites} = -202.4 + 2.728 * Retweets\]
\[ \hat{Favorites} = -202.4 + 2.728 * 1000 = 2728\]
\[ Residual_{i} = y_{i} - \hat{y}_{i}\] - Residuals measure how far the predicted value is from the actual value.
The equation \[ Residual = y_{i} - \hat{y}_{i}\] measures residuals for only one observation.
To understand how good the model is we want to measure the residuals from all observations in the data.
\[ SSE = \sum_{i = 1}^N (y_{i} - \hat{y}_{i})^2\] - Model w/ the lower SSE is better at predicting the response variable, on average.
# Model 1 Residuals.Model.1 = Favorites - predict(model.1) plot(Favorites, Residuals.Model.1)
# Model 1 Residuals.Model.2 = Favorites[Retweets<50000] - predict(model.2) plot(Favorites[Retweets<50000], Residuals.Model.2)
Model 1
SSE.Model.1 = sum(Residuals.Model.1^2) SSE.Model.1
## [1] 635433919622
Model 2
SSE.Model.2 = sum(Residuals.Model.2^2) SSE.Model.2
## [1] 320803091053
\[ MSE = \sqrt{ \frac{SSE}{N-2}} \]
Model 1
N=length(Favorites) MSE.Model.1 = sqrt(SSE.Model.1/N - 2) MSE.Model.1
## [1] 4514.732
Model 2
N = length(Favorites[Retweets<50000]) MSE.Model.2 = sqrt(SSE.Model.2/N - 2) MSE.Model.2
## [1] 3209.048