Bicycle sharing systems (BSS) provide people with free or rental bicycles suitable for short-distance trips in urban areas, thus reducing traffic congestion, air pollution and noise. Many cities all over the world have introduced and implemented BSS as a way of sustainable transport. These systems generate large amount of transportation data, the mining of which is useful to understand the underlying city dynamics. This project aims to develop a method analyzing BSS usage data to reveal urban mobility patterns with Washington D.C. as a case study.

Data: I used data shared by Capital Bikeshare, metro DC’s bikeshare service, with 4300 bikes and over 500 stations in site.  There are about 32,5800 trips in September 2018 after data cleaning.  The data was cleaned by removing trips with a duration of less than one minute and with the same station as point of departure and destination as a result of user misoperation.

Methods: Through preliminary exploratory data analysis (EDA), I first found some factors such as membership and workday can strongly influence daily usage profiles. Next, I created count series for each station with consideration of the above factors, and then clustered them using Global Alignment Kernel (GAK) K-means algorithm based on the similarity between count series. Finally, I interpreted the meaning of clusters in terms of geography and demographics.

Results: The results tell the attractiveness of stations, and how the neighborhood characteristics of stations affect their mobility patterns. This method may serve as an aid for urban planning, BSS fleet management, business and public space location.

Count time series of stations

Based on the results of exploratory data analysis done in earlier coursework, membership of riders (registered, casual) and workday / non-workday are strong influential factors. Focusing on local users, this project picked records of registered users only. Taking account of the difference between workday and non-workday, the raw data can be used to derive the following counts statistics to describe station usage:

the vector  describes the departure and arrival activity of station s for both workday w1 and non-workday w0 for each hour (h0 – h23), therefore, the length of the vector is 96. I used the median of each hour count across all workdays / non-workdays to mitigate the effect of temporary social events.     

Calculation of the similarity between count time series

Dynamic time warping (DTW) finds the optimal non-linear alignment between two time series compared with the Euclidean distance method. Given the assumption that the four 24-hour time series in the vector  are independent, I need to reshape the vector into a 4 x 24 matrix before comparison. It is noticeable that some stations show similar trend while the amplitude are quite different due to the capacity and location of stations. Therefore, to deal with amplitude scaling issue, I transformed all time series so that their mean and standard deviation in each dimension is 0 and 1. Fig. 1 shows the basic idea of calculating similarity between two time series of 4 dimensions in this case.

Fig. 1. Diagram of calculating similarity between 4-dimension count time series of stations using Dynamic Time Warping (DTW) algorithm
Clustering stations based on similarity of usage profile

K-means algorithm is one of the most popular clustering algorithms and performs faster than hierarchical clustering algorithm when dealing with a large dataset as in this case. However, it cannot separate clusters that are non-linearly separable in input space. Thus, in this project, I adopted Global Alignment Kernel (GAK) K-means algorithm which casts the DTW distances and similarities as positive definite kernels for time series and speeds up K-means clustering with higher efficiency. GAK K-means algorithm was tested on the dataset with a varying number of clusters fron 2 to 10. To pick an appropriate value for the number of clusters, I computed and compared the mean Silhouette Coefficient, a measure of how similar an objects is to its own cluster (cohesion) compared to other clusters (separation), and accordingly spotted the critical point is 5, which was a good candidate for the number of clusters. Besides, to extract the general trends of each cluster for further interpretation, I computed the average of each set of sequences using DTW Barycenter Averaging (DBA) method. Fig. 2 illustrates four scaled 24-hour count time series of 5 station clusters with highlighted average sequences.

Fig. 2. The result of count time series clustering using Global Alignment Kernel (GAK) K-means algorithm. Each row represent one cluster of stations,  including four sets of scaled 24-hour count time series with highlighted average sequences calculated by DTW Barycenter Averaging (DBA) method. Colors of rows corresponds to clusters of stations in Fig. 3.
Interpretation of clusters of stations

To interpret the meaning of clusters, I first created a map of all bike sharing stations colored by the categorical variable of clusters (Fig. 3). The size of circle describes the average hour count of each station, which can be seen as a scaling factor. From a geographical perspective, it is evident to find that the same cluster of stations which have similar temporal profiles also appear to cluster spatially. For example, green cluster of stations are located around National Mall, while orange cluster of stations are mainly in Downtown area.

Blue cluster of stations, mainly located in suburban area, are underutilized, almost zero count on workdays but a few on non-workdays (Fig. 2.). When overlaying this cluster with the map of green space, I found most of those stations are near large size parks such as Rock Creek Park and Anacostia Park, thus named it “suburban parks”. Green cluster of stations, on the other hand, are mainly located near large green space in central Washington D.C., such as National Mall and West Potomac Park. It also serves as a main destination for a lot more arrivals than departures in mid-afternoon on both workdays and non-workdays (Fig. 2.).

I also incorporated socio-economic data from Smart Location Database, to further interpret the meaning of clusters (Table 1) and understand the relationship between neighborhood types and mobility patterns. For instance, both red and yellow clusters show high departure peaks in the workday morning, which formulated my hypothesis that they are housing and then confirmed it given their high population density and less jobs provided. The average hour count of cluster “dense housing” is more than that of cluster “housing” (Fig. 3.) and stations of cluster “housing” rarely serve as destinations no matter on workday or non-workday, which can be explained by differences in population density, employment density, retail and entertainment service density, public transportation accessibility, and closeness to downtown.

Table 1. Mean of each cluster with respect to population density (number of inhabitants per acre), employment density (number of jobs per acre), retail and entertainment service (number of related jobs in grocery stores, restaurants, etc. per acre), and public transportation accessibility (proportion of block group within 0.25 mile of transit stops).

Cluster name Inhabitants / acre Jobs / acre Retail and entertainment jobs / acre Proportion of block group within 0.25 mile of transit stops
“Housing” 27 12 2 13%
“Dense Housing” 39 23 5 27%
“Suburban Parks” 16 25 4 12%
“Central Green Mix-use” 17 101 15 36%
“Downtown Mix-use” 22 160 17 54%

Leave a Reply

Your email address will not be published. Required fields are marked *