Overanalyzing Board Games: Network Analysis and Pandemic

. . .

by mattwigway on March 26, 2014

The Pandemic board, 2nd ed. Copyright © 2012 Z-Man Games

I like board games, and one of my favorites is Pandemic. The game consists of a board (pictured above) with a world map on it, with various cities highlighted, and a network between the cities. Disease breaks out randomly in the cities at the start of the game (using the shuffled infection deck) and then progresses using the same deck. Players cooperatively attempt to quell disease by moving between cities and treating disease. On each turn, players draw city cards; by collecting five of a particular color, they can cure a disease. Additional cards are drawn each turn from the infection deck to infect additional cities. Periodically, there are ‘epidemics’ in which the cards for the cities that have already been drawn are returned to the top of the infection deck. If a city is infected three times without being treated, and there is an additional infection, an ‘outbreak’ occurs and all of the cities connected to that city are infected.

The network is a major component of gameplay, so it seemed like network theory would be able to shed some light on a strategy for the game. I digitized the network from the game board using Gephi. I then calculated the Eigenvector centrality and degree for each city using NetworkX.

Both degree and Eigenvector centrality are measures of centrality, that is how central a node is in the network. Degree is the simpler of the two; it is the number of connections (edges) each city (node) has. For example, Santiago is connected to only one city (Lima), so it has degree 1. Chicago is connected to five other cities (San Francisco, Los Angeles, Mexico City, Atlanta, and Montréal), so it has degree 5. The more other places a city is connected to, the theory goes, the more important it is.

Eigenvector centrality is a bit more complicated, but not much. As explained by Wikipedia, the centrality of each node is the scaled sum of the centralities of the nodes around it. As it happens, this is also the eigenvector of the adjacency matrix, hence the name. This measure of centrality takes into account not only the number of connections of a city, but the number of connections of each of the cities it is connected to, and so on.

Degree and eigenvector centrality are both theoretically applicable to different parts of gameplay. Degree is most important for preventing outbreaks. Except in rare double outbreaks (when an outbreak in one city causes an outbreak in a connected city), the severity of an outbreak is defined by the degree of the city in which it occurs. If there have been three infections in a city with a high degree, the players would be wise to treat that city ASAP.

Eigenvector centrality is more useful for building research stations. Throughout the game, the players can build research stations, which have multiple uses. The most important from a graph-theoretic standpoint is that players can move from research station to research station as if there were an edge between them. Thus, cities with research stations are much more accessible to players. If research stations are built in cities with high eigenvector centrality, the number of cities that can be reached will be maximized (i.e., one could go to the research station, and then to an adjacent city, and then to another adjacent city; the possibilities are maximized with research stations in cities with high eigenvector centrality). There are caveats, of course; Bangkok and Hong Kong both have high eigenvector centrality, but it probably wouldn’t make sense to build research stations in both cities as they are adjacent to each other.

The two centrality measures are correlated

The real question is whether this is useful for gameplay. Unfortunately I haven’t played the game since I’ve made these calculations, but it initially seems that the centrality measures confirm what most players had already figured out: building research stations and treating disease is most important in the most-connected cities.

While most players don’t think about (let alone calculate) eigenvector centrality during gameplay, they probably have thought about the degree of each city (if not by that name). As it turns out, degree and eigenvector centrality are fairly correlated (see scatterplot at right, made with R; correlation coefficient 0.58), so simply looking at degree gives one a fairly good picture of the centrality of a city.

Realistically, these measures of centrality don’t determine the absolute best strategy. Games tend to played out in a relatively small subset of the cities on the board, because each time there is an epidemic the cities already infected are placed back on the top of the deck to be infected again. Cities near the bottom of the deck rarely if ever come up. If there are no infections in Asia, it is likely not worth the effort to build research stations there despite the high centrality of many of the Asian cities. Building research stations is constrained by the cards each player has and the need to balance research station construction with other tasks such as treating disease.

One interesting pattern in the centralities is that Asian cities have very high centralities, while cities in the global South are much less central. This suggests that disease can spread much more rapidly in the Asian cities (although this is somewhat counterbalanced by increased ability to reach the Asian cities from each other). Gameplay is very different when focused on the Asian cities than when focused on the global South. I have noticed this in gameplay (infections in the South don’t seem to cause as much trouble as other infections, though this is admittedly anecdotal). The centralities provide some theoretical justification for this observation.

One further analysis that could be undertaken would be to treat all research-station-to-research-station links as additional edges in the network, and determine which combinations of cities reduce the average shortest path distance between all pairs of nodes.

And finally, the centralities:

CityEigenvector centralityDegree
Hong Kong0.3276
Ho Chi Minh City0.2524
San Francisco0.1144
Los Angeles0.0884
St. Petersburg0.0743
Mexico City0.0655
New York0.0554
São Paulo0.0404
Buenos Aires0.0202
Permalink to this post

Visual Correlation Matrices

. . .

by mattwigway on February 22, 2014

Correlation matrices show up often in papers and anywhere data is being analyzed. They are useful because they succinctly summarize the observed relationships between a set of variables; this also makes them very good for exploratory data analysis.

However, correlation matrices by themselves are still a bit difficult to interpret, as they are simply numbers. For example, here is the output of the R cor() function. There’s a lot of useful information there, but it’s still a bit difficult to interpret.

            x1          x2          x3          x4         x5
x1  0.00000000  0.03297151  0.85017673 -0.69401590  0.5354154
x2  0.03297151  0.00000000  0.01985976 -0.02100622  0.1290689
x3  0.85017673  0.01985976  0.00000000 -0.61088013  0.5123067
x4 -0.69401590 -0.02100622 -0.61088013  0.00000000 -0.5308175
x5  0.53541535  0.12906890  0.51230666 -0.53081745  0.0000000

This data can also be displayed visually, in a color-coded matrix. Here is exactly the same data, displayed in visual form:

color coded correlation matrix

In particular, this improves on Tufte’s 6th and 7th principles of data graphics: encouraging visual comparisons and “reveal[ing] the data at several levels of detail” (page 13). It is much easier to compare the correlations of different variables visually than by doing mental arithmetic to compare the numbers in the correlation matrix. The correlation matrix also presents the data only at a high level of specificity. The visual display, on the other hand, uses colors to display the general patterns in the data, while still having the numbers to diplay the specific relationships.

This idea can be executed in many different data analysis environments, but I use R. The R code used to produce the above plot follows. Calling the function corplot on a data frame will create and display the plot, and return the correlation matrix.

Permalink to this post

Schelling's Segregation Model in JavaScript

. . .

by mattwigway on December 28, 2013

Schelling’s segregation model is an interesting model of neighborhood dynamics developed by the economist Thomas C. Schelling. It’s an agent based model, in which agents of two groups (which could be based on income, political affiliation, race, &c.) are placed on a grid. There is some threshold for what percentage of an agent’s neighbors must be of the same group for it to be happy. For instance, agents might want 30% of their neighbors to be of the same group. If they are not happy, they move. This continues until all agents are happy.

What’s interesting about the model is that relatively low thresholds for individuals (e.g. 30%) end up leading to extreme segregation in the aggregate (Indeed, Schelling’s book is called Micromotives and Macrobehavior). This segregation can be easily seen in the above screenshot: the tolerance is set at 30%, but on average 77% of each agents neighbors are of the same group.

Browser-based interactive implementations of the model have been around for a while, but they all seem to require Java. This is often an obstacle, so I implemented an interactive version using JavaScript and D3. It should run with no plugins in Firefox, Chrome, Safari and possibly recent versions of Internet Explorer. My implementation is here.

A full description of the model can be found in Schelling’s book Micromotives and Macrobehavior, on pages 147—155 (in the 1978 edition anyhow).

Permalink to this post

Analyzing the Effects of Space and Time on Bikeshare Use

. . .

by mattwigway on December 14, 2013

Bikeshare systems have been taking off in the US of late. One of the first of these systems, Capital Bikeshare in Washington, DC, has been in operation since 2010. The automated bikeshare stations generate a wealth of information; the start and end stations and times of each trip are recorded, and are available to the public in anonymous form. This project used the approximately 4.5 million trips taken on the system from the fourth quarter of 2010 until the second quarter of 2013.

I was interested in how space and time affect the usage patterns of bikeshare systems. This data allows one to test the patterns statistically.

It is generally acknowledged that patterns of bikeshare use differ at different times of day. This makes sense; for instance, commuters may ride downtown in the morning and back to a metro station in the afternoon. To test this, eight time periods were defined: morning (6a–9a), midday (9a–3p), afternoon (3p–7p) and overnight (7p–9p) for both weekdays and weekends. These time periods match those used in the Metropolitan Washington Council of Governments travel model and add weekends. Each trip was assigned to one of these time periods, and origin-destination matrices were computed for each time period. The labels were then scrambled (preserving the number of trips in each time period, as well as the origin-destination matrix of the entire dataset). Origin-destination matrices for each randomized time period were then recomputed. Pairwise comparisons of time periods were then computed for both the observed and simulated data.

There is a statistically significant difference between every time period and every other time period. That is, the patterns of bikeshare use differ at different times of day and on the weekends. This is a driver of rebalancing: the system operator must move the bikes to meet differing demands throughout the day. The other driver of rebalancing would be if there is a general trend for the bikes to move to a certain area regardless of time period; this study did not address this question but it could be addressed with the data used.

Some stations are, of course, more popular than others. As it turns out, the popularities of the stations are spatially autocorrelated—-that is, nearby stations tend to have similar popularities. Moran’s I value is 0.78 (p < 0.05). This is not surprising; one can hypothesize several reasons for this finding. The most obvious is that there are certain areas that are more popular than others (for instance, downtown stations are probably more popular than stations in lower-density residential areas). Also, bikeshare trips require both a start and end station; stations that are popular likely have many trips to nearby stations, making those stations popular as well.

Further research could include looking more into the patterns of use by time period, attempting to determine the general flow of bikes at different times of day. One team has developed statistical models to inform rebalancing, however, they modeled each station individually as the trip-level data used in this project is not available where they were working in Chicago. This origin-destination matrices could potentially improve this type of model.

This research was undertaken in Dr. Stuart Sweeney’s Geography 172, Spatial Data Analysis, class in the Department of Geography at UCSB. For a more complete treatment of the project, see the full report.

Permalink to this post

Microaccessibility with OpenTripPlanner

. . .

by mattwigway on June 20, 2013

Analysis of accessibility is generally undertaken in large regions, such as metropolitan areas or entire countries. Frequently it also uses macro temporal scales, as in before-and-after analysis. This analysis instead looks at micro scales, both spatial and temporal. The study area is the University of California, Santa Barbara campus and the adjoining student community of Isla Vista.

I analyzed accessibility at every hour of a typical week, so that accessibility can be compared at different times of day and on different days. This has been done before, looking at accessibility at different times of day (page 8) in the Los Angeles area. I used tighter temporal scales (one hour instead of four chunks) and also analyzed accessibility over the entire week to allow the discernment of weekly cycles.

Only accessibility to eateries was analyzed. Data were obtained from OpenStreetMap for network data and from the UCSB Interactive Campus Map for data on eatery locations. Animations of accessibility over a typical week follow; in the darker blue areas more eateries are accessibile within five minutes' travel time. Five minutes was chosen as the cutoff because it is half of the walking time between the intersection of Pardall and Embarcadero Del Norte and the front of the University Center, two areas where many eateries are concentrated. A more systematic study would need to estimate this from travel data. Acessibility was analyzed for both walking and cycling.

Accessibility to eateries at different times of day by walking.

Accessibility to eateries at different times of day by bicycling

The two animated maps show the accessibility to eateries at different times of day by different modes. The bicycle map shows much more accessibility because with a bicycle one can reach many more opportunities in 5 minutes' time. A daily cycle can easily be determined, with most (but not all) businesses closing in the late evening and opening again in the morning, creating a pulsing accessibility. The eateries on campus (the eastern portion of the maps) do not have the same span of service as the eateries in Isla Vista. On the weekends, most of the campus eateries are not open at all.

There are a few limitations. OpenTripPlanner’s cycling mode currently does not support bicycle parking; at UCSB, there are many bicycle parking areas where one must park before going to one’s building. At a micro scale of analysis, correctness of the network is also very important because small absolute errors can be large relative to the total length of the trip; OpenStreetMap data was improved for this project but is still not perfect, especially given construction on campus.

Further research would use behavioral data to better estimate parameters for the accessibility measure, as well as to interpret the results. Sara Matthews analyzed mode choice in trips to Humboldt State University in the context of residential location. Accessibility could be used as a independent variable in a similar analysis of mode choice.

Even in the context of comprehensive transportation models such as SimAGENT (Southern California Association of Governments) and SF-CHAMP (San Francisco County Transportation Authority), accessibility measures rendered as maps such as these are valuable. They are understandable and thus can easily be presented to non-technical decisionmakers and to the public. They also generally have more of a descriptive rather than projective role; that is, they describe current situations rather than predicting future ones. Finally, they can play a role in individual decision support; Jarrett Walker has noted the usefulness of isochrones for decision support, and these accessibility measures can play the same role. Walk Score® has recently announced understandable accessibility maps; this makes these types of measures much more available.

For a more in-depth treatment, see the full report.

Special thanks to Dr. Konstadinos Goulias and Jae Lee in the GeoTrans lab at UCSB, and to Bryan Karaffa in the UCSB Department of Geography. Map data © OpenStreetMap contributors. Eatery data © UCSB Interactive Campus Map. These maps and analyses are the result of a research project and should not be used for decision support without additional consultation.
Permalink to this post

CC-BY-NC 4.0 by Matthew Wigginton Conway, 2011-2014. Created with Jekyll and Bootstrap.