Methodology

exploring the connection between food and the city

Methodology

Note: All code can be found on GitHub.

Data Sources

For analysis, 3 data sources were utilized:

  1. Yelp (Big Data)
  2. DataSF (Open Data): Health inspection data of restaurants from 2015-2018
  3. American Community Survey: 2013-2017 5 year estimates

Data Acquisition

While the ACS and DataSF data was simple to acquire, the getting the Yelp data required scraping. I coded a scraper that would return the first 1,000 listings of restaurants in a given geography in CSV format. The scraper was used to obtain all the restaurants in our five chosen neighborhoods. The veracity of the data was not affected by the cap since each neighborhood has fewer than 1,000 restaurants.

The scraper and an example output CSV of the scraper can be found on GitHub.

Data Cleaning & Wrangling

The scraper returned data that needed relatively little cleaning. However, there were two steps in the data wrangling process to make the data usable for analysis.

The first step was properly geocoding the data. The data returned street addresses for each listing, however, longitude and latitude coordinates were needed. For this, I utilized the Google Maps API and the GoogleGeocoder package to convert the values.

The other issue was defining the cuisines. In the data, this was under the “categories” column—however, each restaurant could have anywhere from 1-5 categories of cuisine, which would affect our analysis. To equalize the restaurants, I included only the first/primary cuisine.

An example of the data after the cleaning and wrangling process can be found here.