R is a great environment for statistical computing; I’ve used it in a number of projects. RStudio is hands-down the best IDE for R (there is even less debate here than there is about emacs being the best editor). Sometimes, though, I find that I need to run analyses that require more computing power than my personal computer can provide (especially since my desktop is currently in storage in California and I’m in Illinois with a circa-2007 netbook with 2GB of RAM).
Amazon EC2 is a great solution for this type of issue; it allows you to spin up powerful computers on-demand and access them over ssh. You don’t get a graphical interface, though, which precludes running RStudio Desktop. However, RStudio provides a server tool that allows you to run R on a server and access it through your browser. Configuring it on EC2, however, is a wee bit tricky because most people use public key authentication to access their instances (which is good for security), while RStudio assumes that you can log in with a password. One solution is to switch to password authentication for your entire instance, but it’s possible (and more secure) to keep the public key authentication. Here’s how to do it.
ssh -i ~/.ssh/key.pem ubuntu@<aws-ip>
dpkg -ibecause X11 is not available.
/etc/rstudio/rserver.conf. This forces RStudio to listen only for connections on localhost, so that a public key is still needed to access it. www-address=127.0.0.1
sudo restart rstudio-server
sudo passwd your-user-name. You won’t be able to SSH in with this (assuming that you only allow public key auth), but you’ll use it to log into RStudio.
ssh -N -L localhost:8787:localhost:8787 -i ~/.ssh/aws2.pem ubuntu@<aws-ip>. This forwards the RStudio Server session securely to your computer using SSH tunneling. Note that any user of your local machine can now access RStudio Server, although they’ll need the password you created in step 7.
Note that you are just accessing RStudio on AWS, so you’ll need to have all of your data and R scripts on the server.Permalink to this post
I presented a paper about modeling the popularity of bikesharing stations at the California Geographical Society 2014 annual meeting in Los Angeles. I calculated accessibility measures to jobs and residents using Census and OpenStreetMap data and the open-source OpenTripPlanner network analysis suite. I used those as independent variables in regressions to try to explain the popularity of bikesharing stations. I used bikeshare popularity data from Washington, DC, San Francisco, and Minneapolis—St. Paul. The main goal of the modeling is to build models of station popularity that can be transferred from one city to another, and thus used as planning tools for new bikeshare systems.
I initially tried linear regression, using best-subset selection to choose a subset of the accessibility measures as predictors; ultimately only one variable, the number of jobs within 60 minutes of the station by walking and transit, was used. I used a log-transformed response to control heteroskedasticity. This model predicted fairly well (R2 = 0.68), but it doesn’t transfer well (test R2 = 0.31 in Minneapolis/St. Paul and -0.15 in San Francisco, indicating that the model produces more variability rather than explaining any). The residuals were spatially autocorrelated in all of these models, with Moran’s \(I \approx 0.5\).
Next I tried random forests, which seemed like a good choice because they tend to perform well in situations with highly-correlated variables, which is the situation we have—all of the accessibility measures are strongly correlated. The random forest fit the Washington, DC data considerably better than the linear model did (R2 = 0.84), but again transfer performance was rocky. I has been reduced to being not statistically significant in DC. When transferred, I is lower than with the linear model in San Francisco, but higher in Minneapolis. Ultimately, I suspect that the random forest model is too flexible and is fitting the Washington, DC data too closely.
The models are also likely misspecified. They include accessibility only to jobs and residents, but bikeshare is used for many purposes other than going to work, and thus many more accessibility measures should determine the popularity of a station. However, additional accessibility measures are likely to be highly correlated with those already present, which increases the variance of the coefficients and decreases their t-statistics and statistical significances.
Based on all of this, it seems like we need to pursue models that are inflexible and work well with highly-correlated predictors. Two that seem to fit the bill are ridge regression and principal components regression. Ridge regression works by shrinking coefficient estimates towards zero, introducing some bias but also reducing the variance. Principal components regression works by creating k principal components and using them as predictors in a regression. A principal component is the vector along which the data vary the most. With highly-correlated variables, a small number of principal components can capture most of the variation in the data. Both of these methods represent decreases in flexibility over ordinary linear regression. Applying these types of models is a topic for future research.
Ultimately, the results of this study are mixed. There is a significant connection between accessibility and bikeshare station popularity. The models predict fairly well in Washington, DC, the city for which they were fit, but do not transfer well. For a model to be useful as a new-system planning tool, it needs to transfer not only in form but also in parameters. However, future research with additional accessibility measures and inflexible statistical techniques seems promising.
For a more in-depth treatment, see the full paper. The slides from the conference presentation are available as well. I would like to thank Kostas Goulias in the UCSB Department of Geography for his help with this project. I would also like to thank Eric Fischer for his assistance with San Francisco bikeshare data. Any errors that remain are, of course, mine.
Update (May 4, 2014): I uploaded a new copy of the paper with a few corrections:
Bay Area Bike Share has recently released their trip history data; the data file contains the origin station, end station, time and date, and user type (day-pass or subscriber) for all the trips taken on the system since its inauguration on August 29, 2013. For another project, I had calculated accessibility measures for each bikeshare station in the Bay Area Bike Share system (using the beta OpenTripPlanner transportation analysis suite; see this post and the attached paper). I wondered if there were different types of trips represented by different accessibility footprints (e.g., trips that have high job accessibility at one end but not the other may be commute trips).
To examine this, for each trip I took a ratio of start and end accessibilities in several categories: jobs and residents within 10 minutes of the stations by walking, jobs and residents within 30 and 60 minutes of the stations by walking and transit, and other bikeshare stations within 30 minutes by cycling. So, a ratio of 4 for jobs within 10 minutes indicates that there are 4 times as many jobs within 10 minutes of the start station than of the end station. I took natural logs so that scores when the end stations are more accessible have the same magnitude (e.g., a score of 4 and a score of ¼ have the same magnitude when log-transformed).
I then applied k-means clustering (James et al. 2013, 386) to the generated accessibility scores for each trip to see if there were distinct trip fingerprints. I tried several different values for the number of clusters k, and settled on 4. This corresponds to only two distinct categories of trips, because return trips of round trips have the inverse footprint. The results of the clustering can be seen in the visualization. Along with the accessibility ratios, the percentage of casual users and the percentage of trips in each cluster made on the weekend are shown as well (these variables were not used for the clustering). Note that, while the scale is presented in terms of the original ratio values, the values are still log-scaled. I used R for the clustering, and D3 for the visualization.
Clusters 1 and 2 appear to represent commute trips, as they have high (in magnitude) scores for jobs within 10 minutes, indicating that one station is in area with far more jobs than the other station. Clusters 3 and 4 represent other types of trips, with higher scores on resident population. These may be trips from transit stations to residential areas, or trips from residential areas to shopping areas. One interesting thing is that these trips are taken more on the weekends and by casual users (those with one- or three-day passes). Previous research has shown that, in Washington, DC, there is a statistically significant difference between all pairs of morning, midday, afternoon and evening weekday and weekend bikeshare trips.
All of the clusters are dominated by one of the 10-minute categories. This makes sense; these variables have the highest variance, and thus drive the clustering. I chose to not standardize the variables, because they are unitless, and there actually is more variance in the ten-minute ratios. The reason is not hard to discern: the ten-minute ratios are based on the number of jobs in a smaller area, and thus can vary more quickly in space. One can think of the numbers as a kernel function; the accessibility measures with higher cutoffs are effectively wider kernels.
This project is descriptive and has produced some interesting observations. Tripmaking patterns differ between subscribers and casual users, and also differ between the weekdays and the weekends. Work-driven trips are made more by subscribers and on weekdays (unsurprisingly). There are probably additional dimensions that would provide more detail, especially in clusters 3 and 4. In particular, I’d like to calculate accessibility to retail and transit stations (it has been shown in Washington, DC, that people often use bikeshare to access retail, restaurants, and transit stations).Permalink to this post
I like board games, and one of my favorites is Pandemic. The game consists of a board (pictured above) with a world map on it, with various cities highlighted, and a network between the cities. Disease breaks out randomly in the cities at the start of the game (using the shuffled infection deck) and then progresses using the same deck. Players cooperatively attempt to quell disease by moving between cities and treating disease. On each turn, players draw city cards; by collecting five of a particular color, they can cure a disease. Additional cards are drawn each turn from the infection deck to infect additional cities. Periodically, there are ‘epidemics’ in which the cards for the cities that have already been drawn are returned to the top of the infection deck. If a city is infected three times without being treated, and there is an additional infection, an ‘outbreak’ occurs and all of the cities connected to that city are infected.
The network is a major component of gameplay, so it seemed like network theory would be able to shed some light on a strategy for the game. I digitized the network from the game board using Gephi. I then calculated the Eigenvector centrality and degree for each city using NetworkX.
Both degree and Eigenvector centrality are measures of centrality, that is how central a node is in the network. Degree is the simpler of the two; it is the number of connections (edges) each city (node) has. For example, Santiago is connected to only one city (Lima), so it has degree 1. Chicago is connected to five other cities (San Francisco, Los Angeles, Mexico City, Atlanta, and Montréal), so it has degree 5. The more other places a city is connected to, the theory goes, the more important it is.
Eigenvector centrality is a bit more complicated, but not much. As explained by Wikipedia, the centrality of each node is the scaled sum of the centralities of the nodes around it. As it happens, this is also the eigenvector of the adjacency matrix, hence the name. This measure of centrality takes into account not only the number of connections of a city, but the number of connections of each of the cities it is connected to, and so on.
Degree and eigenvector centrality are both theoretically applicable to different parts of gameplay. Degree is most important for preventing outbreaks. Except in rare double outbreaks (when an outbreak in one city causes an outbreak in a connected city), the severity of an outbreak is defined by the degree of the city in which it occurs. If there have been three infections in a city with a high degree, the players would be wise to treat that city ASAP.
Eigenvector centrality is more useful for building research stations. Throughout the game, the players can build research stations, which have multiple uses. The most important from a graph-theoretic standpoint is that players can move from research station to research station as if there were an edge between them. Thus, cities with research stations are much more accessible to players. If research stations are built in cities with high eigenvector centrality, the number of cities that can be reached will be maximized (i.e., one could go to the research station, and then to an adjacent city, and then to another adjacent city; the possibilities are maximized with research stations in cities with high eigenvector centrality). There are caveats, of course; Bangkok and Hong Kong both have high eigenvector centrality, but it probably wouldn’t make sense to build research stations in both cities as they are adjacent to each other.
The real question is whether this is useful for gameplay. Unfortunately I haven’t played the game since I’ve made these calculations, but it initially seems that the centrality measures confirm what most players had already figured out: building research stations and treating disease is most important in the most-connected cities.
While most players don’t think about (let alone calculate) eigenvector centrality during gameplay, they probably have thought about the degree of each city (if not by that name). As it turns out, degree and eigenvector centrality are fairly correlated (see scatterplot at right, made with R; correlation coefficient 0.58), so simply looking at degree gives one a fairly good picture of the centrality of a city.
Realistically, these measures of centrality don’t determine the absolute best strategy. Games tend to played out in a relatively small subset of the cities on the board, because each time there is an epidemic the cities already infected are placed back on the top of the deck to be infected again. Cities near the bottom of the deck rarely if ever come up. If there are no infections in Asia, it is likely not worth the effort to build research stations there despite the high centrality of many of the Asian cities. Building research stations is constrained by the cards each player has and the need to balance research station construction with other tasks such as treating disease.
One interesting pattern in the centralities is that Asian cities have very high centralities, while cities in the global South are much less central. This suggests that disease can spread much more rapidly in the Asian cities (although this is somewhat counterbalanced by increased ability to reach the Asian cities from each other). Gameplay is very different when focused on the Asian cities than when focused on the global South. I have noticed this in gameplay (infections in the South don’t seem to cause as much trouble as other infections, though this is admittedly anecdotal). The centralities provide some theoretical justification for this observation.
One further analysis that could be undertaken would be to treat all research-station-to-research-station links as additional edges in the network, and determine which combinations of cities reduce the average shortest path distance between all pairs of nodes.
And finally, the centralities:
|Ho Chi Minh City||0.252||4|
Correlation matrices show up often in papers and anywhere data is being analyzed. They are useful because they succinctly summarize the observed relationships between a set of variables; this also makes them very good for exploratory data analysis.
However, correlation matrices by themselves are still a bit difficult
to interpret, as they are simply numbers. For example, here is the
output of the R
cor() function. There’s a lot of useful information
there, but it’s still a bit difficult to interpret.
x1 x2 x3 x4 x5 x1 0.00000000 0.03297151 0.85017673 -0.69401590 0.5354154 x2 0.03297151 0.00000000 0.01985976 -0.02100622 0.1290689 x3 0.85017673 0.01985976 0.00000000 -0.61088013 0.5123067 x4 -0.69401590 -0.02100622 -0.61088013 0.00000000 -0.5308175 x5 0.53541535 0.12906890 0.51230666 -0.53081745 0.0000000
This data can also be displayed visually, in a color-coded matrix. Here is exactly the same data, displayed in visual form:
In particular, this improves on Tufte’s 6th and 7th principles of data graphics: encouraging visual comparisons and “reveal[ing] the data at several levels of detail” (page 13). It is much easier to compare the correlations of different variables visually than by doing mental arithmetic to compare the numbers in the correlation matrix. The correlation matrix also presents the data only at a high level of specificity. The visual display, on the other hand, uses colors to display the general patterns in the data, while still having the numbers to diplay the specific relationships.
This idea can be executed in many different data analysis
environments, but I use R. The R code
used to produce the above plot follows. Calling the function
on a data frame will create and display the plot, and return the
CC-BY-NC 4.0 by Matthew Wigginton Conway, 2011-2014. Created with Jekyll and Bootstrap.