Accessibility Analysis with Python and OpenTripPlanner. . .
OpenTripPlanner is a great bit of software for both customer-facing tools and analysis. Until recently, it had the capability to perform batch queries, calculating an origin-destination matrix or an aggregate measure of accessibility. Configuring this functionality, however, was somewhat awkward, as it used a verbose XML format that was more suited to allowing developers to configure application components than as a user-facing interface (and I say that having been one of the last defenders of this approach on the OTP mailing list).
This batch analysis tool was removed as a side effect of a restructuring and simplification of the OpenTripPlanner codebase that has been ongoing for several months. Its absence sparked a debate on the opentripplanner-dev mailing list, which broke down roughly into two camps: one camp arguing for something that is purely a configuration file, with another camp arguing for “configuration files” that are simply scripts of some sort (I argued for both camps at one point or another). Where that conversation lies now, to make a long story short, is that there are tentative plans to rebuild Batch Analyst using Java Preferences as a configuration file format.
In parallel with this development, development has been ongoing on a web-based analytics framework. This is a very useful (and just plain neat) tool for accessibility analysis in a graphical user interface driven package. This is exactly what is needed for probably the majority of those doing accessibility analysis. However, coming from a research background (quantitative/computational geography), I often want tools that I can script, commit my methods to a git repo, and integrate with other tools. That said, work on this graphical interface to Analyst has driven a rethinking of how analysis is done in OTP and the creation of many useful components.
In some personal projects, I needed to be able to run batch jobs again, and I decided to try to build a quick and dirty Python library to call the OTP analysis functions. (To be fair, integrating OTP and Python was originally proposed by Tuukka Hastrup in the aforementioned thread). The result is here. It’s a Jython library that wraps up the functionality of OTP’s analysis functions in a hacker-friendly library. I decided to take a simple approach and build a library that does one thing and one thing well: creates origin-destination matrices. What you build around that is up to you. If you want a simple cumulative accessibility measure, you can sum the number of links that are below a threshold. If you want to use a more complicated accessibility measure, with distance decays and such, you can just implement some Python code to do that.
The map above is the result of a demonstration of this project. It shows the walking time to the nearest grocery store from every Census block in Chicago. Here’s how I made it. First, I downloaded the binary distribution of OTP’s master (development) version from here. I grabbed OpenStreetMap data for Chicago from mapzen’s metro extracts site, and Census blocks and grocery store locations from the City of Chicago Data Portal. I built an OTP graph using the standard methods. I then edited the grocery stores file to have only latitude and longitude columns (because, internally, OTP seems to try to convert the other columns to integers for use as inputs to aggregators). I then ran this code to perform the analysis. It must be run in Jython as opposed to standard Python, the OTP jar must be on the Jython classpath, and the opentripplanner-jython module must be in Jython’s Python search path somewhere. I ran it like so:
CLASSPATH=~/opentripplanner/otp-latest-master.jar jython accessibility.py
I’ll walk you through what the code does. It loads the graph which was previously built (which it expects to find in the graph subdirectory of the working directory), loads the destinations, links them to the graph, creates a batch processor with the origins, and then evaluates that batch processor on the destinations. The result of the call to BatchProcessor.eval() is an origin-destination matrix, with origins on the rows and destinations on the columns. Unfortunately, numpy is not available in Jython, so data is returned using the opentripplanner.batch.Matrix class.
This tool helps eliminate a lot of the repeated computation in classic batch analyst runs. You load the graph only once, for example, and you could link the destinations only once if you were running the batch processor multiple times, say with different mode sets. You could calculate travel times to multiple destination sets without re-running the batch processor, but by simply calling eval() more than once. Remember that adding additional destinations, or calculating accessibility for additional sets of destinations, is cheap; you’re just sampling points in the graph. Adding additional origins is expensive: for each origin, OTP builds a shortest path tree.
Under the hood, it uses the new Analyst framework, which calculates the travel time from each origin to every vertex in the graph and stores it in a time surface, which we can then sample inexpensively.
One caveat is that this library doesn’t yet support profile routing, although OTP does. Profile routing is a much better way of doing general accessibility analysis for general queries for public transportation (e.g. how long does it take to get to work) versus extremely specific queries (if I leave right now, how long exactly will it take me to get to work today, right now).Permalink to this post
Using GeoTools with Multiple User Accounts. . .
I have a situation where I have multiple GeoTools applications being run on a server by different users. I was having a lot of issues with exceptions about not being able to decode CRS codes, even though I was sure I had the
gt-epsg-hsql file included in my JAR, and had set up Maven correctly to include the dependency.
It turns out the issue was that the
gt-epsg-hsql extracts its hsql database of projections to
Geotools in the system temporary directory, and if there are multiple geotools apps running as different users, the first one to start gets the directory, and the remaining ones crash because they don’t have permissions to access it.
The workaround is to use separate temporary directories for each user. The easy way to do this is
TMPDIR=`mktemp -d` application, which creates a new unique temporary directory each time an application is started.
Running RStudio Server on Amazon EC2. . .
R is a great environment for statistical computing; I’ve used it in a number of projects. RStudio is hands-down the best IDE for R (there is even less debate here than there is about emacs being the best editor). Sometimes, though, I find that I need to run analyses that require more computing power than my personal computer can provide (especially since my desktop is currently in storage in California and I’m in Illinois with a circa-2007 netbook with 2GB of RAM).
Amazon EC2 is a great solution for this type of issue; it allows you to spin up powerful computers on-demand and access them over ssh. You don’t get a graphical interface, though, which precludes running RStudio Desktop. However, RStudio provides a server tool that allows you to run R on a server and access it through your browser. Configuring it on EC2, however, is a wee bit tricky because most people use public key authentication to access their instances (which is good for security), while RStudio assumes that you can log in with a password. One solution is to switch to password authentication for your entire instance, but it’s possible (and more secure) to keep the public key authentication. Here’s how to do it.
- Spin up an EC2 instance. I like to use the ones from Ubuntu Cloud.
- You can now log in using your key pair like so:
ssh -i ~/.ssh/key.pem ubuntu@<aws-ip>
- Install a recent version of R from CRAN by following the directions at http://cran.r-project.org/bin/linux/ubuntu/ (assuming you’re on Ubuntu).
- Download RStudio Server and install it. You may need to replace
dpkg -ibecause X11 is not available.
- Add the following line to
/etc/rstudio/rserver.conf. This forces RStudio to listen only for connections on localhost, so that a public key is still needed to access it. www-address=127.0.0.1
- Restart RStudio Server:
sudo restart rstudio-server
- Create a password for your login account:
sudo passwd your-user-name. You won’t be able to SSH in with this (assuming that you only allow public key auth), but you’ll use it to log into RStudio.
- On your local machine, run
ssh -N -L localhost:8787:localhost:8787 -i ~/.ssh/aws2.pem ubuntu@<aws-ip>. This forwards the RStudio Server session securely to your computer using SSH tunneling. Note that any user of your local machine can now access RStudio Server, although they’ll need the password you created in step 7.
- Go to http://localhost:8787/ and log in with the password you just created.
- Analyze away.
Note that you are just accessing RStudio on AWS, so you’ll need to have all of your data and R scripts on the server.Permalink to this post
Predicting the Popularity of Bicycle Sharing Stations: An Accessibility-Based Approach. . .
I presented a paper about modeling the popularity of bikesharing stations at the California Geographical Society 2014 annual meeting in Los Angeles. I calculated accessibility measures to jobs and residents using Census and OpenStreetMap data and the open-source OpenTripPlanner network analysis suite. I used those as independent variables in regressions to try to explain the popularity of bikesharing stations. I used bikeshare popularity data from Washington, DC, San Francisco, and Minneapolis—St. Paul. The main goal of the modeling is to build models of station popularity that can be transferred from one city to another, and thus used as planning tools for new bikeshare systems.
I initially tried linear regression, using best-subset selection to choose a subset of the accessibility measures as predictors; ultimately only one variable, the number of jobs within 60 minutes of the station by walking and transit, was used. I used a log-transformed response to control heteroskedasticity. This model predicted fairly well (R2 = 0.68), but it doesn’t transfer well (test R2 = 0.31 in Minneapolis/St. Paul and -0.15 in San Francisco, indicating that the model produces more variability rather than explaining any). The residuals were spatially autocorrelated in all of these models, with Moran’s \(I \approx 0.5\).
Next I tried random forests, which seemed like a good choice because they tend to perform well in situations with highly-correlated variables, which is the situation we have—all of the accessibility measures are strongly correlated. The random forest fit the Washington, DC data considerably better than the linear model did (R2 = 0.84), but again transfer performance was rocky. I has been reduced to being not statistically significant in DC. When transferred, I is lower than with the linear model in San Francisco, but higher in Minneapolis. Ultimately, I suspect that the random forest model is too flexible and is fitting the Washington, DC data too closely.
The models are also likely misspecified. They include accessibility only to jobs and residents, but bikeshare is used for many purposes other than going to work, and thus many more accessibility measures should determine the popularity of a station. However, additional accessibility measures are likely to be highly correlated with those already present, which increases the variance of the coefficients and decreases their t-statistics and statistical significances.
Based on all of this, it seems like we need to pursue models that are inflexible and work well with highly-correlated predictors. Two that seem to fit the bill are ridge regression and principal components regression. Ridge regression works by shrinking coefficient estimates towards zero, introducing some bias but also reducing the variance. Principal components regression works by creating k principal components and using them as predictors in a regression. A principal component is the vector along which the data vary the most. With highly-correlated variables, a small number of principal components can capture most of the variation in the data. Both of these methods represent decreases in flexibility over ordinary linear regression. Applying these types of models is a topic for future research.
Ultimately, the results of this study are mixed. There is a significant connection between accessibility and bikeshare station popularity. The models predict fairly well in Washington, DC, the city for which they were fit, but do not transfer well. For a model to be useful as a new-system planning tool, it needs to transfer not only in form but also in parameters. However, future research with additional accessibility measures and inflexible statistical techniques seems promising.
For a more in-depth treatment, see the full paper. The slides from the conference presentation are available as well. I would like to thank Kostas Goulias in the UCSB Department of Geography for his help with this project. I would also like to thank Eric Fischer for his assistance with San Francisco bikeshare data. Any errors that remain are, of course, mine.
Update (May 4, 2014): I uploaded a new copy of the paper with a few corrections:
- I added a note about OpenTripPlanner as beta software (p. 4)
- I added a note about the units of the accessibility variables (p. 8)
- I added a description of the coefficients of the fit model (p. 8)
- I added a footnote about sampling (p. 8f)
- I added a note about how the San Francisco linear model has been flattened (p. 13)
- I added legends to maps that were missing them (pp. 16-21)
Update (August 24, 2014): The San Francisco accessibility measures were incorrectly calculated using California State Plane Zone 5 instead of Zone 3 projection. It is not believed that this introduced a significant amount of error, in relation to other sources of error such as geographic aggregation by centroid or stochastic variation in methods. The paper has not been modified.Permalink to this post
Identifying Patterns in Bikeshare Trip Making. . .
Bay Area Bike Share has recently released their trip history data; the data file contains the origin station, end station, time and date, and user type (day-pass or subscriber) for all the trips taken on the system since its inauguration on August 29, 2013. For another project, I had calculated accessibility measures for each bikeshare station in the Bay Area Bike Share system (using the beta OpenTripPlanner transportation analysis suite; see this post and the attached paper). I wondered if there were different types of trips represented by different accessibility footprints (e.g., trips that have high job accessibility at one end but not the other may be commute trips).
To examine this, for each trip I took a ratio of start and end accessibilities in several categories: jobs and residents within 10 minutes of the stations by walking, jobs and residents within 30 and 60 minutes of the stations by walking and transit, and other bikeshare stations within 30 minutes by cycling. So, a ratio of 4 for jobs within 10 minutes indicates that there are 4 times as many jobs within 10 minutes of the start station than of the end station. I took natural logs so that scores when the end stations are more accessible have the same magnitude (e.g., a score of 4 and a score of ¼ have the same magnitude when log-transformed).
I then applied k-means clustering (James et al. 2013, 386) to the generated accessibility scores for each trip to see if there were distinct trip fingerprints. I tried several different values for the number of clusters k, and settled on 4. This corresponds to only two distinct categories of trips, because return trips of round trips have the inverse footprint. The results of the clustering can be seen in the visualization. Along with the accessibility ratios, the percentage of casual users and the percentage of trips in each cluster made on the weekend are shown as well (these variables were not used for the clustering). Note that, while the scale is presented in terms of the original ratio values, the values are still log-scaled. I used R for the clustering, and D3 for the visualization.
Clusters 1 and 2 appear to represent commute trips, as they have high (in magnitude) scores for jobs within 10 minutes, indicating that one station is in area with far more jobs than the other station. Clusters 3 and 4 represent other types of trips, with higher scores on resident population. These may be trips from transit stations to residential areas, or trips from residential areas to shopping areas. One interesting thing is that these trips are taken more on the weekends and by casual users (those with one- or three-day passes). Previous research has shown that, in Washington, DC, there is a statistically significant difference between all pairs of morning, midday, afternoon and evening weekday and weekend bikeshare trips.
All of the clusters are dominated by one of the 10-minute categories. This makes sense; these variables have the highest variance, and thus drive the clustering. I chose to not standardize the variables, because they are unitless, and there actually is more variance in the ten-minute ratios. The reason is not hard to discern: the ten-minute ratios are based on the number of jobs in a smaller area, and thus can vary more quickly in space. One can think of the numbers as a kernel function; the accessibility measures with higher cutoffs are effectively wider kernels.
This project is descriptive and has produced some interesting observations. Tripmaking patterns differ between subscribers and casual users, and also differ between the weekdays and the weekends. Work-driven trips are made more by subscribers and on weekdays (unsurprisingly). There are probably additional dimensions that would provide more detail, especially in clusters 3 and 4. In particular, I’d like to calculate accessibility to retail and transit stations (it has been shown in Washington, DC, that people often use bikeshare to access retail, restaurants, and transit stations).Permalink to this post