Dijkstra's Algorithm Visualization in the Browser. . .
Dijkstra’s algorithm is a method for finding the shortest path through a graph with non-negative edge weights. It is a relatively efficient algorithm, and is guaranteed to find the shortest path (unlike some heuristic algorithms). It’s a fairly simple algorithm, but also one that can be difficult to wrap one’s head around. Descriptions abound on the Internet, and there are a number of videos, and even a few interactive demonstrations, but I thought there needed to be a demonstration that was interactive, worked in modern browsers without a plugin, used a non-trivial graph, and was open source. So I wrote one. A prose description of the algorithm is there; I hope it’s easier to understand with the interactive component. Visualizing algorithms tends to make them easier to understand, as observed by Mike Bostock.Permalink to this post
Accessibility Analysis with Python and OpenTripPlanner. . .
OpenTripPlanner is a great bit of software for both customer-facing tools and analysis. Until recently, it had the capability to perform batch queries, calculating an origin-destination matrix or an aggregate measure of accessibility. Configuring this functionality, however, was somewhat awkward, as it used a verbose XML format that was more suited to allowing developers to configure application components than as a user-facing interface (and I say that having been one of the last defenders of this approach on the OTP mailing list).
This batch analysis tool was removed as a side effect of a restructuring and simplification of the OpenTripPlanner codebase that has been ongoing for several months. Its absence sparked a debate on the opentripplanner-dev mailing list, which broke down roughly into two camps: one camp arguing for something that is purely a configuration file, with another camp arguing for “configuration files” that are simply scripts of some sort (I argued for both camps at one point or another). Where that conversation lies now, to make a long story short, is that there are tentative plans to rebuild Batch Analyst using Java Preferences as a configuration file format.
In parallel with this development, development has been ongoing on a web-based analytics framework. This is a very useful (and just plain neat) tool for accessibility analysis in a graphical user interface driven package. This is exactly what is needed for probably the majority of those doing accessibility analysis. However, coming from a research background (quantitative/computational geography), I often want tools that I can script, commit my methods to a git repo, and integrate with other tools. That said, work on this graphical interface to Analyst has driven a rethinking of how analysis is done in OTP and the creation of many useful components.
In some personal projects, I needed to be able to run batch jobs again, and I decided to try to build a quick and dirty Python library to call the OTP analysis functions. (To be fair, integrating OTP and Python was originally proposed by Tuukka Hastrup in the aforementioned thread). The result is here. It’s a Jython library that wraps up the functionality of OTP’s analysis functions in a hacker-friendly library. I decided to take a simple approach and build a library that does one thing and one thing well: creates origin-destination matrices. What you build around that is up to you. If you want a simple cumulative accessibility measure, you can sum the number of links that are below a threshold. If you want to use a more complicated accessibility measure, with distance decays and such, you can just implement some Python code to do that.
The map above is the result of a demonstration of this project. It shows the walking time to the nearest grocery store from every Census block in Chicago. Here’s how I made it. First, I downloaded the binary distribution of OTP’s master (development) version from here. I grabbed OpenStreetMap data for Chicago from mapzen’s metro extracts site, and Census blocks and grocery store locations from the City of Chicago Data Portal. I built an OTP graph using the standard methods. I then edited the grocery stores file to have only latitude and longitude columns (because, internally, OTP seems to try to convert the other columns to integers for use as inputs to aggregators). I then ran this code to perform the analysis. It must be run in Jython as opposed to standard Python, the OTP jar must be on the Jython classpath, and the opentripplanner-jython module must be in Jython’s Python search path somewhere. I ran it like so:
CLASSPATH=~/opentripplanner/otp-latest-master.jar jython -J-Xmx8192m accessibility.py
-J-Xmx8192m tells the Java Virtual Machine to use 8GB of RAM. If you don’t have that much, you can
experiment with smaller numbers.
I’ll walk you through what the code does. It loads the graph which was previously built (which it expects to find in the graph subdirectory of the working directory), loads the destinations, links them to the graph, creates a batch processor with the origins, and then evaluates that batch processor on the destinations. The result of the call to BatchProcessor.eval() is an origin-destination matrix, with origins on the rows and destinations on the columns. Unfortunately, numpy is not available in Jython, so data is returned using the opentripplanner.batch.Matrix class.
This tool helps eliminate a lot of the repeated computation in classic batch analyst runs. You load the graph only once, for example, and you could link the destinations only once if you were running the batch processor multiple times, say with different mode sets. You could calculate travel times to multiple destination sets without re-running the batch processor, but by simply calling eval() more than once. Remember that adding additional destinations, or calculating accessibility for additional sets of destinations, is cheap; you’re just sampling points in the graph. Adding additional origins is expensive: for each origin, OTP builds a shortest path tree.
Under the hood, it uses the new Analyst framework, which calculates the travel time from each origin to every vertex in the graph and stores it in a time surface, which we can then sample inexpensively.
One caveat is that this library doesn’t yet support profile routing, although OTP does. Profile routing is a much better way of doing general accessibility analysis for general queries for public transportation (e.g. how long does it take to get to work) versus extremely specific queries (if I leave right now, how long exactly will it take me to get to work today, right now).
Update 2014-12-31: I added notes about memory consumption.Permalink to this post
Using GeoTools with Multiple User Accounts. . .
I have a situation where I have multiple GeoTools applications being run on a server by different users. I was having a lot of issues with exceptions about not being able to decode CRS codes, even though I was sure I had the
gt-epsg-hsql file included in my JAR, and had set up Maven correctly to include the dependency.
It turns out the issue was that the
gt-epsg-hsql extracts its hsql database of projections to
Geotools in the system temporary directory, and if there are multiple geotools apps running as different users, the first one to start gets the directory, and the remaining ones crash because they don’t have permissions to access it.
The workaround is to use separate temporary directories for each user. The easy way to do this is
TMPDIR=`mktemp -d` application, which creates a new unique temporary directory each time an application is started.
Running RStudio Server on Amazon EC2. . .
R is a great environment for statistical computing; I’ve used it in a number of projects. RStudio is hands-down the best IDE for R (there is even less debate here than there is about emacs being the best editor). Sometimes, though, I find that I need to run analyses that require more computing power than my personal computer can provide (especially since my desktop is currently in storage in California and I’m in Illinois with a circa-2007 netbook with 2GB of RAM).
Amazon EC2 is a great solution for this type of issue; it allows you to spin up powerful computers on-demand and access them over ssh. You don’t get a graphical interface, though, which precludes running RStudio Desktop. However, RStudio provides a server tool that allows you to run R on a server and access it through your browser. Configuring it on EC2, however, is a wee bit tricky because most people use public key authentication to access their instances (which is good for security), while RStudio assumes that you can log in with a password. One solution is to switch to password authentication for your entire instance, but it’s possible (and more secure) to keep the public key authentication. Here’s how to do it.
- Spin up an EC2 instance. I like to use the ones from Ubuntu Cloud.
- You can now log in using your key pair like so:
ssh -i ~/.ssh/key.pem ubuntu@<aws-ip>
- Install a recent version of R from CRAN by following the directions at http://cran.r-project.org/bin/linux/ubuntu/ (assuming you’re on Ubuntu).
- Download RStudio Server and install it. You may need to replace
dpkg -ibecause X11 is not available.
- Add the following line to
/etc/rstudio/rserver.conf. This forces RStudio to listen only for connections on localhost, so that a public key is still needed to access it. www-address=127.0.0.1
- Restart RStudio Server:
sudo restart rstudio-server
- Create a password for your login account:
sudo passwd your-user-name. You won’t be able to SSH in with this (assuming that you only allow public key auth), but you’ll use it to log into RStudio.
- On your local machine, run
ssh -N -L localhost:8787:localhost:8787 -i ~/.ssh/aws2.pem ubuntu@<aws-ip>. This forwards the RStudio Server session securely to your computer using SSH tunneling. Note that any user of your local machine can now access RStudio Server, although they’ll need the password you created in step 7.
- Go to http://localhost:8787/ and log in with the password you just created.
- Analyze away.
Note that you are just accessing RStudio on AWS, so you’ll need to have all of your data and R scripts on the server.Permalink to this post
Predicting the Popularity of Bicycle Sharing Stations: An Accessibility-Based Approach. . .
I presented a paper about modeling the popularity of bikesharing stations at the California Geographical Society 2014 annual meeting in Los Angeles. I calculated accessibility measures to jobs and residents using Census and OpenStreetMap data and the open-source OpenTripPlanner network analysis suite. I used those as independent variables in regressions to try to explain the popularity of bikesharing stations. I used bikeshare popularity data from Washington, DC, San Francisco, and Minneapolis—St. Paul. The main goal of the modeling is to build models of station popularity that can be transferred from one city to another, and thus used as planning tools for new bikeshare systems.
I initially tried linear regression, using best-subset selection to choose a subset of the accessibility measures as predictors; ultimately only one variable, the number of jobs within 60 minutes of the station by walking and transit, was used. I used a log-transformed response to control heteroskedasticity. This model predicted fairly well (R2 = 0.68), but it doesn’t transfer well (test R2 = 0.31 in Minneapolis/St. Paul and -0.15 in San Francisco, indicating that the model produces more variability rather than explaining any). The residuals were spatially autocorrelated in all of these models, with Moran’s \(I \approx 0.5\).
Next I tried random forests, which seemed like a good choice because they tend to perform well in situations with highly-correlated variables, which is the situation we have—all of the accessibility measures are strongly correlated. The random forest fit the Washington, DC data considerably better than the linear model did (R2 = 0.84), but again transfer performance was rocky. I has been reduced to being not statistically significant in DC. When transferred, I is lower than with the linear model in San Francisco, but higher in Minneapolis. Ultimately, I suspect that the random forest model is too flexible and is fitting the Washington, DC data too closely.
The models are also likely misspecified. They include accessibility only to jobs and residents, but bikeshare is used for many purposes other than going to work, and thus many more accessibility measures should determine the popularity of a station. However, additional accessibility measures are likely to be highly correlated with those already present, which increases the variance of the coefficients and decreases their t-statistics and statistical significances.
Based on all of this, it seems like we need to pursue models that are inflexible and work well with highly-correlated predictors. Two that seem to fit the bill are ridge regression and principal components regression. Ridge regression works by shrinking coefficient estimates towards zero, introducing some bias but also reducing the variance. Principal components regression works by creating k principal components and using them as predictors in a regression. A principal component is the vector along which the data vary the most. With highly-correlated variables, a small number of principal components can capture most of the variation in the data. Both of these methods represent decreases in flexibility over ordinary linear regression. Applying these types of models is a topic for future research.
Ultimately, the results of this study are mixed. There is a significant connection between accessibility and bikeshare station popularity. The models predict fairly well in Washington, DC, the city for which they were fit, but do not transfer well. For a model to be useful as a new-system planning tool, it needs to transfer not only in form but also in parameters. However, future research with additional accessibility measures and inflexible statistical techniques seems promising.
For a more in-depth treatment, see the full paper. The slides from the conference presentation are available as well. I would like to thank Kostas Goulias in the UCSB Department of Geography for his help with this project. I would also like to thank Eric Fischer for his assistance with San Francisco bikeshare data. Any errors that remain are, of course, mine.
Update (May 4, 2014): I uploaded a new copy of the paper with a few corrections:
- I added a note about OpenTripPlanner as beta software (p. 4)
- I added a note about the units of the accessibility variables (p. 8)
- I added a description of the coefficients of the fit model (p. 8)
- I added a footnote about sampling (p. 8f)
- I added a note about how the San Francisco linear model has been flattened (p. 13)
- I added legends to maps that were missing them (pp. 16-21)
Update (August 24, 2014): The San Francisco accessibility measures were incorrectly calculated using California State Plane Zone 5 instead of Zone 3 projection. It is not believed that this introduced a significant amount of error, in relation to other sources of error such as geographic aggregation by centroid or stochastic variation in methods. The paper has not been modified.Permalink to this post