My adopted city of Washington, DC, is shoveling out from under some of the last snow of the season. At the same
time, we’ve just enacted a law that levies fines on residents who don’t
clear their sidewalks, after years of debate.
There’s a very good reason to fine property owners for not shoveling their sidewalks. Snow shoveling
is a what’s known as a multi-player prisoner’s dilemma.
If everyone shovels, everyone is better off (because they can get around more easily, don’t face the danger of falls, don’t
have to pay extra cleaning bills due to muddy/salty pant cuffs and skirt hems, &c.). However, any one person is better off if they
don’t shovel, because they’ve saved time (I actually like shoveling my sidewalk, but let’s assume that the utility for
shoveling is uniformly negative). Each individual isn’t affected much by the state of their own sidewalk, but rather by
the state of everyone else’s.
We can represent this issue graphically, as Thomas Schelling does in chapter 7 of Micromotives and
Macrobehavior. For simplicity’s sake, let’s assume that the only
benefit to shoveled sidewalks is a quicker walk to wherever you need to go. Let’s also assume that it takes five minutes
to shovel your sidewalk, and that you can save twenty minutes in transportation time if every sidewalk is shoveled, as
opposed to none. Finally, let’s assume that if half of the sidewalks are shoveled, you save half as much time: ten minutes.
The graph below shows how much time is saved by people who do not shovel (red line), and those who do shovel, as a function
of how many people choose to shovel. The time is the amount of time saved commuting, minus the time spent shoveling. As you
can see, at every point there is an incentive not to shovel, but take advantage of your neighbors' shoveled walks. However,
if everyone does that, you reach a situation where no one’s walks are shoveled, and no one is saving any time. If everyone’s
walks are shoveled, everyone saves time despite having to spend time shoveling (the right end of the blue line). Since the
second situation is clearly the preferable one, we need to have some system to encourage people to shovel. This could
take many forms: a general sense of community standards, public shaming of those who do not shovel,
or a regulation with fines attached.
We can carry the economic theory further, and say that the amount of the fine has to be greater than the distance between
the two lines, but that’s not really a meaningful thing to do. For one thing, the distance between the lines is in units
of time, while the fine is in units of money, two things that convert in wildly different ways for different people. Also,
we must remember that the numbers we used were completely arbitrary and furthermore not uniform across the population.
Schelling also points out that there is a minimum viable coalition size, a minimum number of people who have to cooperate
to shovel their sidewalks in order to realize a net gain. This is the point where the payoff curve for shoveling reaches
the payoff that everyone would get if no one shoveled. In the figure above, at about 0.2, those who choose to shovel
are better off than they would have been had no one chosen to shovel. This isn’t really a meaningful number here for
several reasons. As mentioned before, the assumption of uniform arbitrary payoffs is invalid. Additionally, you are not
equally affected by every shoveled sidewalk. A geographically dispersed coalition may be no better than no coalition, if
no member of the coalition walks on the shoveled sidewalks of any other member. A geographically concentrated coalition may
be viable even at small numbers, because the sidewalks being shoveled are the sidewalks most relevant to the members.
(Consider a commercial strip where the business cooperate to shovel their stretch of road. Even though the coalition is
small, its members benefit because customers can walk up and down their commercial strip).
Thus, fines for not shoveling are a way of encouraging a socially desirable outcome, when individual motivations
would result in a socially undesirable outcome.
Dijkstra’s algorithm is a
method for finding the shortest path through a graph
with non-negative edge weights. It is a relatively efficient algorithm, and is guaranteed to
find the shortest path (unlike some heuristic algorithms). It’s a fairly simple algorithm,
but also one that can be difficult to wrap one’s head around. Descriptions abound on the Internet,
and there are a number of videos, and even a few interactive demonstrations, but I thought there
needed to be a demonstration that was interactive, worked in modern browsers without a plugin,
used a non-trivial graph, and was open source. So I wrote one.
A prose description of the algorithm is there; I hope it’s easier to understand with the interactive component.
Visualizing algorithms tends to make them easier to understand, as observed by Mike Bostock.
OpenTripPlanner is a great bit of software for
both customer-facing tools and
analysis. Until recently, it had the
capability to perform batch queries, calculating an origin-destination matrix or an aggregate measure of
accessibility. Configuring this functionality, however, was somewhat
awkward, as it used a verbose XML format that was more suited to allowing developers to configure application
components than as a user-facing interface (and I say that having been one of the last defenders of this approach
on the OTP mailing list).
This batch analysis tool was removed as a side effect of a restructuring and simplification of the OpenTripPlanner
codebase that has been ongoing for several months. Its absence sparked
a debate on the opentripplanner-dev mailing list,
which broke down roughly into two camps: one camp arguing for something that is purely a configuration file, with
another camp arguing for “configuration files” that are simply scripts of some sort (I argued for both camps at
one point or another). Where that conversation lies now, to make a long story short, is that there are tentative
plans to rebuild Batch Analyst using Java Preferences as a configuration file format.
In parallel with this development, development has been ongoing on a
web-based analytics framework. This is a very useful (and just plain
neat) tool for accessibility analysis in a graphical user interface driven package. This is exactly what is needed
for probably the majority of those doing accessibility analysis. However, coming from a research background
(quantitative/computational geography), I often want tools that I can script, commit my methods to a git repo,
and integrate with other tools. That said, work on this graphical interface to Analyst has driven a rethinking
of how analysis is done in OTP and the creation of many useful components.
In some personal projects, I needed to be able to run batch jobs again, and I decided to try to build a quick and dirty
Python library to call the OTP analysis functions. (To be fair, integrating OTP and Python was originally proposed by
Tuukka Hastrup in the aforementioned thread). The result is
here. It’s a Jython library
that wraps up the functionality of OTP’s analysis functions in a hacker-friendly library. I decided to take a simple
approach and build a library that does one thing and one thing well: creates origin-destination matrices. What you
build around that is up to you. If you want a simple cumulative accessibility measure, you can sum the number of links
that are below a threshold. If you want to use a more complicated accessibility measure, with distance decays and such,
you can just implement some Python code to do that.
The map above is the result of a demonstration of this project. It shows the walking time to the nearest grocery store
from every Census block in Chicago. Here’s how I made it. First, I downloaded the binary
distribution of OTP’s master (development) version from here. I grabbed
OpenStreetMap data for Chicago from mapzen’s metro extracts site, and Census blocks
and grocery store locations from the City of Chicago Data Portal. I built an OTP
graph using the standard methods. I then edited the grocery stores file to have only latitude and longitude columns
(because, internally, OTP seems to try to convert the other columns to integers for use as inputs to aggregators).
I then ran this code to perform the analysis. It must be run
in Jython as opposed to standard Python, the OTP jar must be on the Jython classpath, and the opentripplanner-jython module
must be in Jython’s Python search path somewhere. I ran it like so:
The -J-Xmx8192m tells the Java Virtual Machine to use 8GB of RAM. If you don’t have that much, you can
experiment with smaller numbers.
I’ll walk you through what the code does. It loads the graph which was previously built (which it expects to find
in the graph subdirectory of the working directory), loads the destinations, links them to the graph, creates a batch
processor with the origins, and then evaluates that batch processor on the destinations. The result of the call to
BatchProcessor.eval() is an origin-destination matrix, with origins on the rows and destinations on the columns. Unfortunately,
numpy is not available in Jython, so data is returned using the opentripplanner.batch.Matrix class.
This tool helps eliminate a lot of the repeated computation in classic batch analyst runs. You load the graph only once,
for example, and you could link the destinations only once if you were running the batch processor multiple times,
say with different mode sets. You could calculate travel times to multiple destination sets without re-running the batch
processor, but by simply calling eval() more than once. Remember that adding additional destinations, or calculating
accessibility for additional sets of destinations, is cheap; you’re just sampling points in the graph. Adding additional
origins is expensive: for each origin, OTP builds a shortest path tree.
Under the hood, it uses the new Analyst framework, which calculates the travel time from each origin to every vertex
in the graph and stores it in a time surface, which we can then sample inexpensively.
One caveat is that this library doesn’t yet support
profile routing, although OTP does. Profile routing
is a much better way of doing general accessibility analysis for general queries for public transportation (e.g. how long does it take to get to work)
versus extremely specific queries (if I leave right now, how long exactly will it take me to get to work today, right now).
Update 2014-12-31: I added notes about memory consumption.
I have a situation where I have multiple GeoTools applications being run on a server by different users. I was having a lot of issues with exceptions about not being able to decode CRS codes, even though I was sure I had the gt-epsg-hsql file included in my JAR, and had set up Maven correctly to include the dependency.
It turns out the issue was that the gt-epsg-hsql extracts its hsql database of projections to Geotools in the system temporary directory, and if there are multiple geotools apps running as different users, the first one to start gets the directory, and the remaining ones crash because they don’t have permissions to access it.
The workaround is to use separate temporary directories for each user. The easy way to do this is TMPDIR=`mktemp -d` application, which creates a new unique temporary directory each time an application is started.
R is a great environment for statistical computing; I’ve used it in
a numberofprojects. RStudio is hands-down
the best IDE for R (there is even less debate here than there is about emacs being the best editor). Sometimes, though,
I find that I need to run analyses that require more computing power than my personal computer can provide (especially
since my desktop is currently in storage in California and I’m in Illinois with a circa-2007 netbook with 2GB of RAM).
Amazon EC2 is a great solution for this type of issue; it allows you to spin up powerful
computers on-demand and access them over ssh. You don’t get a graphical interface, though, which precludes running RStudio Desktop.
However, RStudio provides a server tool that allows you to run R on a server and access it through your browser. Configuring
it on EC2, however, is a wee bit tricky because most people use public key authentication to access their instances
(which is good for security), while RStudio assumes that you can log in with a password. One solution is to switch to
password authentication for your entire instance, but it’s possible (and more secure) to keep the public key authentication.
Here’s how to do it.
Spin up an EC2 instance. I like to use the ones from Ubuntu Cloud.
You can now log in using your key pair like so: ssh -i ~/.ssh/key.pem ubuntu@<aws-ip>
Install a recent version of R from CRAN by following the directions at http://cran.r-project.org/bin/linux/ubuntu/ (assuming you’re on Ubuntu).
Create a password for your login account: sudo passwd your-user-name. You won’t be able to SSH in with this (assuming that you only allow public key auth), but you’ll use it to
log into RStudio.
On your local machine, run ssh -N -L localhost:8787:localhost:8787 -i ~/.ssh/aws2.pem ubuntu@<aws-ip>. This forwards the RStudio Server session securely to your computer using SSH tunneling.
Note that any user of your local machine can now access RStudio Server, although they’ll need the password you created in step 7.