- Linear Relationships
- Least Squares Prediction Equation
- Linear Regression Model
April, 4, 2017
What is the relationship between gun ownership and crime?
What is the relationship between smoking and cancer?
What is the relationship between ethinc and racial diversity and trust?
Linear relationship are relationships between two or more variables that have a certain functional form.
\(y =\) response variable – Levels of trust.
\(x =\) explanatory variable – measure of diversity
\[ y = \alpha + \beta x \]
y-intercept \(\alpha\)
slope \(\beta\)
Positive relationships
Negative relationships
A linear function provides a model for the relationship between two variables.
Given any two variables, we can estimate a linear function by estimating \(\alpha\) and \(\beta\) to fit a scatterplot.
trumptweets <- read.csv("https://www.ocf.berkeley.edu/~janastas/trump-tweet-data.csv") attach(trumptweets) trumptweets[1:5,1]
## [1] I have not heard any of the pundits or commentators discussing the fact that I spent FAR LESS MONEY on the win than Hillary on the loss! ## [2] I would have done even better in the election, if that is possible, if the winner was based on popular vote - but would campaign differently ## [3] Campaigning to win the Electoral College is much more difficult & sophisticated than the popular vote. Hillary focused on the wrong states! ## [4] Yes, it is true - Carlos Slim, the great businessman from Mexico, called me about getting together for a meeting. We met, HE IS A GREAT GUY! ## [5] especially how to get people, even with an unlimited budget, out to vote in the vital swing states ( and more). They focused on wrong states ## 31058 Levels: ...
plot(Retweets,Favorites,xlab = "Trump Retweets",ylab = "Trump Favorites")
How do we fit a model to this data if we are interested in using retweets to explain favorites?
Model: \[y = \alpha + \beta x\]
Prediction Equations: \[\hat{y} = a + bx\]
\[ b = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sum(x-\bar{x})^2} \]
\[ a = \bar{y} - b\bar{x} \]
model.1<-lm(Favorites~Retweets) summary(model.1)
## ## Call: ## lm(formula = Favorites ~ Retweets) ## ## Residuals: ## Min 1Q Median 3Q Max ## -253188 -445 -274 -251 118566 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.587e+02 2.649e+01 9.767 <2e-16 *** ## Retweets 2.316e+00 5.513e-03 420.209 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4515 on 31173 degrees of freedom ## Multiple R-squared: 0.8499, Adjusted R-squared: 0.8499 ## F-statistic: 1.766e+05 on 1 and 31173 DF, p-value: < 2.2e-16
\(a = 258.7\)
\(b = 2.316\)
plot(Retweets,Favorites,xlab = "Trump Retweets",ylab = "Trump Favorites") abline(a = 258.7,b = 2.316 )
Outliers can often times do serious damage to our ability to accurately predict the data.
Let's reestimate the model without some outliers and replot the data.
model.2<-lm(Favorites[Retweets<50000]~Retweets[Retweets<50000]) summary(model.2)
## ## Call: ## lm(formula = Favorites[Retweets < 50000] ~ Retweets[Retweets < ## 50000]) ## ## Residuals: ## Min 1Q Median 3Q Max ## -103751 -147 164 202 90100 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -2.024e+02 1.928e+01 -10.5 <2e-16 *** ## Retweets[Retweets < 50000] 2.728e+00 5.403e-03 504.8 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3209 on 31150 degrees of freedom ## Multiple R-squared: 0.8911, Adjusted R-squared: 0.8911 ## F-statistic: 2.549e+05 on 1 and 31150 DF, p-value: < 2.2e-16
plot(Retweets[Retweets<50000],Favorites[Retweets<50000], xlab = "Trump Retweets",ylab = "Trump Favorites") abline(a = -202.4,b = 2.728 )
Model 1:
\[ \hat{Favorites} = 258.7 + 2.316 * Retweets\]
Model 2: \[ \hat{Favorites} = -202.4 + 2.728 * Retweets\]
How many favorites should Trump's tweet get if it receives 1000 retweets (Model 1)
\[ \hat{Favorites} = 258.7 + 2.316 * Retweets\] \[ \hat{Favorites} = 258.7 + 2.316 * 1000 = 2575 \]
How many favorites should Trump's tweet get if it receives 1000 retweets (Model 2)
\[ \hat{Favorites} = -202.4 + 2.728 * Retweets\]
\[ \hat{Favorites} = -202.4 + 2.728 * 1000 = 2728\]
\[ Residual_{i} = y_{i} - \hat{y}_{i}\] - Residuals measure how far the predicted value is from the actual value.
The equation \[ Residual = y_{i} - \hat{y}_{i}\] measures residuals for only one observation.
To understand how good the model is we want to measure the residuals from all observations in the data.
\[ SSE = \sum_{i = 1}^N (y_{i} - \hat{y}_{i})^2\] - Model w/ the lower SSE is better at predicting the response variable, on average.
# Model 1 Residuals.Model.1 = Favorites - predict(model.1) plot(Favorites, Residuals.Model.1)
# Model 1 Residuals.Model.2 = Favorites[Retweets<50000] - predict(model.2) plot(Favorites[Retweets<50000], Residuals.Model.2)
Model 1
SSE.Model.1 = sum(Residuals.Model.1^2) SSE.Model.1
## [1] 635433919622
Model 2
SSE.Model.2 = sum(Residuals.Model.2^2) SSE.Model.2
## [1] 320803091053
\[ MSE = \sqrt{ \frac{SSE}{n-2}} \]
Model 1
N=length(Favorites) MSE.Model.1 = sqrt(SSE.Model.1/N - 2) MSE.Model.1
## [1] 4514.732
Model 2
N = length(Favorites[Retweets<50000]) MSE.Model.2 = sqrt(SSE.Model.2/N - 2) MSE.Model.2
## [1] 3209.048
deterministic model - For the model $ y = + x$ each value of x corresponds to a value of y.
This would be unrealistic in the social science world.
Eg) if $x = $ yrs of education, \(y=\) income we can't say that a person with 12 years of education should have exactly $30,000
What we would like to say is that the person has some predicted income that can vary w/in a certain range.
A probabilistic model allows for variablility in the predited outcome.
Linear regression is considered a probabilistic model.
Expected value is another way of formally expressing the mean.
It is denoted by "E()".
Thus, E(x) is the mean of variable x.
E(y) is the mean of the variable y.
\[ E(y) = \alpha + \beta *x \]
Thought of in this way, this means that linear regression gives you predictions that are averages.
Because of this, we can use all of the tools that we used for means (confidence intervals, significance testing etc.) for regression predictions.
\[ \hat{Favorites}= E(Favorites) = 258.7 + 2.316 * Retweets\]
here \(\alpha\) = 258.8, \(\beta = 2.316\), \(x = Retweets\)
This implies that, on average, the number of favorites goes up by 2.316 for each retweet.
Also implies that if the number of retweets was 0, the average number of favorites would be about 259.
How many favorites should Trump's tweet get if it receives 1000 retweets?
\[E(Favorites) = 258.7 + 2.316 * 1000 = 2575 \]
Using the linear regression model we see that the average prediction is 2575 tweets.
What is the range of this prediction?
To figure this out, we can estimate the standard deviation of the predictions and use the 68-95-99% rule.
\[ s = \sqrt{\frac{SSE}{n-2}} = \sqrt{\frac{\sum_{i = 1}^{N}(y-\hat{y})}{n-2}} \]
\[ MSE = \sqrt{ \frac{SSE}{n-2}} \]
Model 1
N=length(Favorites) MSE.Model.1 = sqrt(SSE.Model.1/N - 2) MSE.Model.1
## [1] 4514.732
\[s = 4515\]
\[ 2575 \pm 2*4515 = (0, 9030) \]
\[E(Favorites) = 258.7 + 2.316 * 10000 = 23418 \]
\[ 23418 \pm 2*4515 = (14388, 32448) \] ## Midterm