Introduction

The world is increasingly becoming saturated with data, both big and small, as we enter into a new era of data collection and sensing. Never has there been such rich sources of data on almost anything you can imagine; not only is this data abundant and accessible, but the level of detail now available is unlike anything that has ever been seen — we are truly entering into a new era of data.

One of these rich datasets is the Yelp Dataset, a relatively “small” source of big data, available for personal and educational uses. An online service for users to provide reviews for businesses, Yelp has established itself as one of the largest services for partially crowd-sourced data.

Source: Yelp. A screenshot from a Yelp page for a local pizza shop in Berkeley, California

Users are able to provide information and data on business type, location, hours, as well as other metrics such as wheelchair accessibility, dietary accommodations, and so forth. Users are also able to provide a rating for a business, supplemented with a text review.

The dataset includes millions of reviews for close to 200,000 businesses in 10 different metropolitan areas. An estimated 8.69 GB, this data was not something that I could easily analyze on my personal computer, a problem that can easily be solved with cloud computing.

Provided with this dataset, as well as utilizing other data sources such as Census data, we are able to look at relationships between a myriad of interesting metrics. How do inexpensive restaurants perform in high-income neighborhoods? How harshly do users review certain businesses based on their location? Can we predict the success of a business in an area based on the success of previous businesses? Do the reviews for a business differ based on the demographics of its location? Coming from a more statistical and data science background, these questions of prediction and regression piqued my interest and allowed me to look further into the data.

Within the Yelp Dataset, I was primarily interested in comparing different locations of the same business and looking at the user reviews for this business. Given that the businesses are the same, what other confounding factors can affect not only the rating that a user gives to a business, but also their reviews. But to do so, I had to begin cleaning the dataset and beginning the first steps into Exploratory Data Analysis (EDA).

Next: Methodology