[POLS 4150] Problem Set 2 Solutions

Election Forensics

library(foreign)

data<-read.csv("/Users/jason/Downloads/votes-trunc2.csv")

attach(data)

(a) Create a new variable called ``obama-clinton-diff’’ which is the difference between Hillary Clinton’s vote share in 2016 and Obama’s vote share in 2012. Report the mean, median, mode, 1st and 3rd quartile and standard deviation of this variable.

Solution

obamaclintondiff = pctdem2016 - Obama
summary(obamaclintondiff) # This will give you mean, median, first and third quartile

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.23440 -0.10210 -0.06295 -0.06787 -0.03353  0.09346

sd(obamaclintondiff) # This will give you the standard deviation

## [1] 0.04954128

# Note: the mode is a bit more challenging, so if you got this correct, then you recieved extra credit. If you did not get this correct, you did not get points taken off.

(b) Create a boxplot of ``obama-clinton-diff’’. According to this boxplot are there more negative outliers or more positive outliers? How would you interpret this in terms of comparisons between Hillary Clinton and President Obama? Based on this boxplot is this a left–skewed, right–skewed or symmetric distribution?

Solution

boxplot(obamaclintondiff)

According to this boxplot this is a left-skewed distribution. This is because it has a larger cluster of values on the left side of the distribution.

(c) Calculate the upper and lower bounds for outliers of the ``obama-clinton-diff’’ variable ( remember the 1.5*IQR rule.). Report the state and county names that have negative outliers. Report the state and county names that have positive outliers. In two or three sentences, come up with an explanation about why Hillary Clinton did so well (poorly) versus Obama in these counties.

Solution

IQR = -0.03353 - -0.10210
upper.bound = -0.03353 + 1.5*IQR # Upper bound for the outliers
lower.bound = -0.10210 - 1.5*IQR # Lower bound for the outliers

# States that have positive outliers
state_abbr[obamaclintondiff>upper.bound] # These happen to be texas and virginia

## [1] TX VA
## 50 Levels: AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA ... WY

# States that have negative outliers
state_abbr[obamaclintondiff<lower.bound] # These happen to be Iowa, Illinois, Kentucky and Missouri

## [1] IA IL KY MO
## 50 Levels: AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA ... WY

# Counties that have positive outliers
area_name[obamaclintondiff>upper.bound]

## [1] Parmer County    Arlington County
## 1846 Levels: Abbeville County Acadia Parish Accomack County ... Ziebach County

# Counties that have negative outliers
area_name[obamaclintondiff<lower.bound]

## [1] Howard County    Henderson County Elliott County   Clark County    
## 1846 Levels: Abbeville County Acadia Parish Accomack County ... Ziebach County

It was sufficient to describe the states or counties. Off of the top of my head, it looks like Hillary clinton did poorly compared to Obama in rustbelt areas while she did well compared to Obama in very liberal areas that are connected with the Federal government (Arlington, VA for example). *

(d) Using the 68-95-99% rule what is the range at which 68% of the counties fall in terms of the ``obama-clinton-diff’’ variable? What does this tell us about Hillary Clinton’s popularity vs. Obama?

Solution

mean(obamaclintondiff) + sd(obamaclintondiff)

## [1] -0.01833111

mean(obamaclintondiff) - sd(obamaclintondiff)

## [1] -0.1174137

68% of observations fall between one standard deviation of the mean of “obamaclintondiff”. These bounds are (-0.018, -0.117). In other words 68% of counties voted at a rate of 1.8 to 11.7% less for Hillary Clinton than for Obama. This tells us that Hillary Clinton was not a popular candidate compared to Obama.

(e) Calculate a 95% confidence interval for the mean of the ``obama-clinton-diff’’ variable by calculating the mean, standard error and critical value \(t_{0.95}\).) In words, explain what this interval means.

Solution

mean.obamaclintondiff = mean(obamaclintondiff)
mean.obamaclintondiff

## [1] -0.06787239

critical.value = qt(0.025,df = length(obamaclintondiff) - 1, lower.tail = FALSE)
critical.value

## [1] 1.960727

standard.error = sd(obamaclintondiff)/sqrt(length(obamaclintondiff))
standard.error

## [1] 0.0008880705

# 95% CI
mean.obamaclintondiff + critical.value * standard.error # Upper bound

## [1] -0.06613112

mean.obamaclintondiff - critical.value * standard.error # Lower bound

## [1] -0.06961365

This confidence interval implies that there is a 95% probability that the average difference between Hillary Clinton’s and Obama’s vote share is between -6.96% and -6.66%. This means that we can say with a high degree of confidence that Hillary Clinton did significantly worse than Obama at in US counties.

(f) Members of the Clinton team believe that, while Hillary Clinton did poorly in some areas relative to Obama, overall there really isn’t any difference, on average at the county level, between support that she received and support that Obama received in 2012. Using what you’ve learned about significance tests, set up a significance test to answer this question at the significance level \(\alpha = 0.05\). Be sure to clearly state the null and alternative hypothesis, test statistic and p–value. Is her team correct? Why or why not?

Solution

The null and alternative hypotheses are* \[ \begin{aligned} H_{0}: \mu_{ObamaClinton} = 0 \\ H_{a}: \mu_{ObamaClinton} \neq 0 \\ \end{aligned} \]

# We can use R to conduct this test
t.test(obamaclintondiff,alternative="two.sided", mu = 0, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  obamaclintondiff
## t = -76.427, df = 3111, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.06961365 -0.06613112
## sample estimates:
##   mean of x 
## -0.06787239

We know that \(\bar{x} = -0.0679\), \(se_{\bar{x}} = 0.0008\) and the test statistic is:

\[ t^{*} = \frac{Observed - Expected}{SE} = \frac{-0.0679 - 0}{0.0008} = -76.427 \] According to the t-test the p-value is \(p<0.00000\). It is clear then, that \(p < \alpha = 0.05\). Thus we REJECT the null hypothesis. Hillary Clinton’s team is very wrong.

Hillary’s 2020 Vision

(80 points) You are working for Hillary Clinton’s 2020 election campaign (“Hillary’s 2020 Vision”) and you are charged with trying to figure out what went wrong in the 2016 campaign. Specifically, the Clinton team was interested in identifying demographics in places where President Obama did well in 2012, but Hillary Clinton did poorly in, in 2016.

i. Were there any differences in terms of average % white, % black and % Hispanic in counties where Hillary Clinton did better than Obama vs. counties where Hillary Clinton did worse than Obama?

ii. Were there any differences in terms of average income and % of people that have a bachelor’s degree in counties where Hillary Clinton did better than Obama vs. counties where Hillary Clinton did worse than Obama?

(a) Answer question (i.) using a series of significance tests at the \(\alpha = 0.05\) level. As part of your answer be sure to include: (1) average % white, % black and % Hispanic in counties where Hillary Clinton did than Obama and the average % white, % black and % Hispanic in counties where Hillary Clinton did than Obama in a table; (2) the null and alternative hypotheses for each test; (3) test statistics for each test; (4) conclusions (reject/fail to reject \(H_{0}\)) for each test.

Provide a campaigning recommendation (one paragraph or less) for the Clinton team based on your findings.

Solution

# Easiest way to start this is to create two variables
better.than.obama = obamaclintondiff > 0
worse.than.obama = obamaclintondiff < 0

Let’s find the average % white, black and Hispanic in counties where Hillary did worse than Obama and better than Obama

# % White
mean(White[better.than.obama]) # % White where clinton did better than Obama

## [1] 0.8049953

mean(White[worse.than.obama])  # % White where clinton did worse than Obama

## [1] 0.8575852

# % Black
mean(Black[better.than.obama]) # % Black where clinton did better than Obama

## [1] 0.1065377

mean(Black[worse.than.obama])  # % Black where clinton did worse than Obama

## [1] 0.09201517

# % Hispanic
mean(Hispanic[better.than.obama]) # % Hispanic where clinton did better than Obama

## [1] 0.1984717

mean(Hispanic[worse.than.obama])  # % Hispanic where clinton did worse than Obama

## [1] 0.08257621

The null and alternative hypotheses for each test are:

\[ \begin{aligned} H_{0}: \mu_{+Obama, group} = \mu_{-Obama, group} \\ H_{a}: \mu_{+Obama, group} \neq \mu_{-Obama, group} \end{aligned} \]

Where \(\mu_{-Obama, group}\) is the population average % for the group (white, black or Hispanic) where Clinton did WORSE than Obama and \(\mu_{+Obama, group}\) is the population average % for the group (white, black or Hispanic) where Clinton did better than Obama.

We can use R to automatically conduct t-tests for each of these. Let’s start with Whites:

r # % White t.test t.test(x= White[better.than.obama], y =White[worse.than.obama])

## ## Welch Two Sample t-test ## ## data: White[better.than.obama] and White[worse.than.obama] ## t = -5.1196, df = 250.19, p-value = 6.11e-07 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.07282089 -0.03235889 ## sample estimates: ## mean of x mean of y ## 0.8049953 0.8575852

The test statistic calculated is :

\[ t^{*} = \frac{Observed - Expected}{SE} = \frac{\bar{x}_{+Obama, White} - \bar{x}_{-Obama, White}}{\sqrt{se_{+Obama, group}^{2}+ se_{-Obama, group}^{2}}} = \frac{0.804 - 0.858 - 0 }{\sqrt{0.01^2 + 0.001^2 }} = -5.37 \]

The p-value for this test statistic is \(P(t>|t^{*}) = 0.0000\). Clearly \(0.0000 < 0.05\) so we reject the null hypotheses in favor of the alternative. A quick look at the 95% confidence interval \(95\%CI: (-0.0728 , -0.0324)\) suggests that the % white in areas where Hillary Clinton did better than Obama is lower than the % white in areas where Obama did better than Hillary Clinton. This suggests that Hillary Clinton lost a significant % of white voters that voted for Obama either due to lack of turnout or voting for other candidates.

Repeat this using R for % black and % hispanic:

# % Black t.test
t.test(x= Black[better.than.obama],
      y =Black[worse.than.obama])

## 
##  Welch Two Sample t-test
## 
## data:  Black[better.than.obama] and Black[worse.than.obama]
## t = 1.5733, df = 252.51, p-value = 0.1169
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.003656567  0.032701694
## sample estimates:
##  mean of x  mean of y 
## 0.10653774 0.09201517

# % Hispanic t.test
t.test(x= Hispanic[better.than.obama],
      y =Hispanic[worse.than.obama])

## 
##  Welch Two Sample t-test
## 
## data:  Hispanic[better.than.obama] and Hispanic[worse.than.obama]
## t = 9.3019, df = 227.27, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.09134488 0.14044610
## sample estimates:
##  mean of x  mean of y 
## 0.19847170 0.08257621

Overall we see that counties in which Hillary Clinton did better than Obama vs. counties in which Hillary Clinton did better than Obama had a lower % white and a higher % Hispanic. If I were to advise Hillary Clinton if she were running for the 2020 election based only on this data, I would recommend that she focus on increasing her popularity among white swing voters or the Democratic base.

(b) Repeat part (a) to answer question (ii.).

Solution

# Income
mean(Income[better.than.obama]) # Avg income where clinton did better than Obama

## [1] 30648.02

mean(Income[worse.than.obama])  # Avg income where clinton did worse than Obama

## [1] 23047.06

Clinton did better than Obama in higher income areas.

# Bachelors degree
mean(Edu_batchelors[better.than.obama]) # Avg % of bachelor degree holders where clinton did better than Obama

## [1] 35.4533

mean(Edu_batchelors[worse.than.obama])  # Avg % of bachelor degree holders where clinton did worse than Obama

## [1] 18.59424

Clinton did better than Obama in areas with a higher % of people with bachelor’s degrees.

The hypothesis that you are testing is the same:

\[ \begin{aligned} H_{0}: \mu_{+Obama, group} = \mu_{-Obama, group} \\ H_{a}: \mu_{+Obama, group} \neq \mu_{-Obama, group} \end{aligned} \]

Except that now “groups” are income and % with bachelors degree.

# % Income t.test
t.test(x= Income[better.than.obama],
      y =Income[worse.than.obama])

## 
##  Welch Two Sample t-test
## 
## data:  Income[better.than.obama] and Income[worse.than.obama]
## t = 11.548, df = 218.74, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  6303.741 8898.189
## sample estimates:
## mean of x mean of y 
##  30648.02  23047.06

# % Bachelors degree t.test
t.test(x= Edu_batchelors[better.than.obama],
      y =Edu_batchelors[worse.than.obama])

## 
##  Welch Two Sample t-test
## 
## data:  Edu_batchelors[better.than.obama] and Edu_batchelors[worse.than.obama]
## t = 19.063, df = 221.11, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  15.11616 18.60196
## sample estimates:
## mean of x mean of y 
##  35.45330  18.59424

Counties where Hillary Clinton did better than Obama had a higher income and a higher % of bachelor degree holders than counties where she did worse than Obama. If I were to advise Hillary Clinton if she were to run in 2020, I would suggest that she focus on areas with lower incomes and areas in which people have a fewer % of college degrees.