Predicting the Popularity of Bicycle Sharing Stations: An Accessibility-Based Approach
By mattwigway
I presented a paper about modeling the popularity of bikesharing stations at the California Geographical Society 2014 annual meeting in Los Angeles. I calculated accessibility measures to jobs and residents using Census and OpenStreetMap data and the open-source OpenTripPlanner network analysis suite. I used those as independent variables in regressions to try to explain the popularity of bikesharing stations. I used bikeshare popularity data from Washington, DC, San Francisco, and Minneapolis–St. Paul. The main goal of the modeling is to build models of station popularity that can be transferred from one city to another, and thus used as planning tools for new bikeshare systems.
I initially tried linear regression, using best-subset selection to choose a subset of the accessibility measures as predictors; ultimately only one variable, the number of jobs within 60 minutes of the station by walking and transit, was used. I used a log-transformed response to control heteroskedasticity. This model predicted fairly well (R^2 = 0.68), but it doesn’t transfer well (test R^2 = 0.31 in Minneapolis/St. Paul and -0.15 in San Francisco, indicating that the model produces more variability rather than explaining any). The residuals were spatially autocorrelated in all of these models, with Moran’s \(I \approx 0.5\).
Next I tried random forests, which seemed like a good choice because they tend to perform well in situations with highly-correlated variables, which is the situation we have–all of the accessibility measures are strongly correlated. The random forest fit the Washington, DC data considerably better than the linear model did (R^2 = 0.84), but again transfer performance was rocky. I has been reduced to being not statistically significant in DC. When transferred, I is lower than with the linear model in San Francisco, but higher in Minneapolis. Ultimately, I suspect that the random forest model is too flexible and is fitting the Washington, DC data too closely.
The models are also likely misspecified. They include accessibility only to jobs and residents, but bikeshare is used for many purposes other than going to work, and thus many more accessibility measures should determine the popularity of a station. However, additional accessibility measures are likely to be highly correlated with those already present, which increases the variance of the coefficients and decreases their t-statistics and statistical significances.
Based on all of this, it seems like we need to pursue models that are inflexible and work well with highly-correlated predictors. Two that seem to fit the bill are ridge regression and principal components regression. Ridge regression works by shrinking coefficient estimates towards zero, introducing some bias but also reducing the variance. Principal components regression works by creating k principal components and using them as predictors in a regression. A principal component is the vector along which the data vary the most. With highly-correlated variables, a small number of principal components can capture most of the variation in the data. Both of these methods represent decreases in flexibility over ordinary linear regression. Applying these types of models is a topic for future research.
Ultimately, the results of this study are mixed. There is a significant connection between accessibility and bikeshare station popularity. The models predict fairly well in Washington, DC, the city for which they were fit, but do not transfer well. For a model to be useful as a new-system planning tool, it needs to transfer not only in form but also in parameters. However, future research with additional accessibility measures and inflexible statistical techniques seems promising.
For a more in-depth treatment, see the full paper. The slides from the conference presentation are available as well. I would like to thank Kostas Goulias in the UCSB Department of Geography for his help with this project. I would also like to thank Eric Fischer for his assistance with San Francisco bikeshare data. Any errors that remain are, of course, mine.
Update (May 4, 2014): I uploaded a new copy of the paper with a few corrections:
- I added a note about OpenTripPlanner as beta software (p. 4)
- I added a note about the units of the accessibility variables (p. 8)
- I added a description of the coefficients of the fit model (p. 8)
- I added a footnote about sampling (p. 8f)
- I added a note about how the San Francisco linear model has been flattened (p. 13)
- I added legends to maps that were missing them (pp. 16-21)
Update (August 24, 2014): The San Francisco accessibility measures were incorrectly calculated using California State Plane Zone 5 instead of Zone 3 projection. It is not believed that this introduced a significant amount of error, in relation to other sources of error such as geographic aggregation by centroid or stochastic variation in methods. The paper has not been modified.