No word yet from WMATA, but I did end up writing a script to grab NextBus’s routeconfig data (download here). Then I tried to match each NextBus-defined stop with the closest one in the GTFS dataset. Some stats*:
- NextBus’s dataset tracks 711 unique stop IDs. GTFS has 10,380.
- Using this function to measure distance, the average space between matched stops is 164 feet. The smallest is 11 feet. The largest is 9/10ths of a kilometer.
- 218 NextBus stops wound up sharing the same GTFS stop.
All in all, pretty bad — this level of data quality is clearly unusable. My GIS skills are weak; this may be my own stupid fault. I’ll consult with some experts and see what I might be doing wrong. But the basic distance-matching idea is pretty straightforward, so I’m not terrifically optimistic. It’s possible that data quality is just going to really, really stink — to be sure, this is not particularly encouraging. Here’s hoping we can get a proper lookup table out of WMATA or NextBus. Otherwise I don’t see a great alternative to manual intervention.
* These numbers ignore the routes that NextBus tracks but which GTFS does not; those are B99, F99, L99, NH1, P99, REX, S80 and S91 (they appear to be shuttles and the like). I haven’t yet identified the routes that are in GTFS but not tracked by NextBus.
[…] At the moment I’m leaning toward collecting data over a set period of time, then analyzing how badly Metro does at arriving at each stop on time. But doing that would require matching GTFS stops to Nextbus stops, which isn’t easy. […]