euclidean cities

I am currently reading Mapmatics: How We Navigate the World Through Numbers by Paulina Rowińska.^[1]

Chapter 4 — Distanced — starts by discussing homeomorphic representations that distort distances but preserve connections, such as Harry Beck's map of the London Underground. This leads into a brief section critiquing the isochrones on the maps dotted across London that aim to depict where you can walk to in 5 or 15 minutes, but actually show where a crow could fly to, because they don't pay any attention to the street network.

In some silly little corner of my silly little head, a silly little seed was planted.

By the time I'd finished reading the next section where euclidean distance and manhattan distance were introduced, that seed had taken root and grown into a fully formed silly little question.

what's the most euclidean city?

Buildings and railways and rivers and churches and other people's homes all have a habit of getting in the way. From where I am currently sitting, it is a mere ~400 metres to a bakery that'll bankrupt you as fast as you can say "two cinnamon buns please", but to get there I have to walk ~800 m. That's a ratio of 2.

From the train station to the climbing gym I used to frequent it's a short 2.15 km walk. But that's still 1.5 times further than the straight line distance of 1.44 km.

The aquarium is only...you get the idea.

A city that had straight roads from everywhere to everywhere else would be perfectly euclidean (and weird) (Fig 0d). I don't think such a city has been constructed, so where's the next best place? Which city on average has the lowest ratio of network distance to euclidean distance? Or, to twist and frame this question in a way that makes it sound slightly more useful, which city is most accommodating to the reluctant and lazy pedestrian or tired cycle courier?

Fig 0: Ljubljana, Slovenia: 250 m radius. Left to right, top to bottom: (a) all nodes in street network (n_nodes=184); (b) the street network as it is (n_edges=448); (c) the network as it would be if all streets were straight; (d) the perfect network where one would never have to travel further than neccessary (n_edges=16,836 ((n_nodes * n_nodes-1) / 2 )).

literature review

I'm saving the reading until after I've done this, because I don't want to discover that someone else has already done this, and in the way that is better than I am possibly capable of. That would be demoralising, and also rob me of the joy of mucking about with: (a) an idea; (b) some data; and (c) a new-to-me plotting library.

This may become a future post.

method

data

Representative points for each european capital were taken from OpenStreetMap, and for each city, a 3 km buffer was drawn and the street network within this buffer extracted using the python library osmnx. The networks were reprojected into the local UTM zone and some network summary statistics calculated with osmnx.stats.basic_stats.

show the code

basic_stats = []
for capital in tqdm(capitals.itertuples()):
    name = '_'.join(capital.name.split(' '))

    digraph = ox.graph_from_point(
        center_point=(capital.geometry.y, capital.geometry.x),
        dist=3_000,
        network_type='walk',
        simplify=True,
    )

    proj_graph = ox.projection.project_graph(digraph)

    ox.io.save_graphml(
        proj_graph,
        f"graphs/{name}.graphml"
        )
    basic_stats.append(
      {capital.name: ox.stats.basic_stats(proj_graph)}
    )

stats_df = pd.concat([pd.DataFrame.from_dict(s) for s in basic_stats], axis=1).T
stats_df.to_json('basic_stats.json')

sampling & routing

To estimate the network's average ratio of network distance to euclidean distance, pairs of nodes were drawn at random (without replacement) from an area within 1500 m of each city centre point, and both the shortest network path (osmnx.routing.shortest_path) and the straight line distance between them calculated. The motivation behind sampling nodes from a smaller area was to (a) mitigate the shortest route being via a node outside of the 3 km zone originally selected; and (b) keep it to an area that I assumed^[2] would be where rates of pedestrianism are greatest. In the spirit of not letting my laptop get too hot and bothered^[3], only 10% of the nodes were sampled from each city. This proportional sampling means that the number of nodes sampled, and subsequently the number of ratios calculated, varied for each city (table 0).

show the code

def sample_network(capital, bufferdist=1500, samplefrac=0.05):

    name = '_'.join(capital.name.split(' '))

    graph = ox.io.load_graphml(f"graphs/{name}.graphml")
    all_nodes, all_edges = ox.convert.graph_to_gdfs(graph)
    crs = all_nodes.crs

    # select and sample from nodes within `bufferdist` of capital point
    inner_buffer = (
        capitals.loc[capitals['name']==capital.name]
        .to_crs(crs)
        .buffer(bufferdist)
        .iloc[0]
    )

    inner_nodes = all_nodes[
        ~all_nodes.intersection(inner_buffer).is_empty
    ].index.tolist()

    n_samples = int(len(inner_nodes) * samplefrac)

    linestrings = []
    for _ in range(n_samples):
        # pop used nodes
        orig = inner_nodes.pop(random.randrange(len(inner_nodes)))
        dest = inner_nodes.pop(random.randrange(len(inner_nodes)))

        # but allow routing across whole graph - so route can leave
        # the inner_buffer if needed
        route = ox.routing.shortest_path(
            graph,
            orig,
            dest,
            weight='length'
        )

        route_gdf = ox.routing.route_to_gdf(
            graph,
            route,
            weight='length'
        )

        # merge route edges to single linestring
        ls = line_merge(MultiLineString(route_gdf['geometry'].tolist()))
        linestrings.append(ls)

    # make some stats, return reprojcted
    gdf = gpd.GeoDataFrame(geometry=linestrings, crs=crs)
    gdf['network'] = gdf.length
    gdf['euclidean'] = gdf.boundary.apply(
      lambda mp: mp.geoms[0].distance(mp.geoms[1])
      )
    gdf['ratio'] = gdf['network'] / gdf['euclidean']
    gdf['city'] = capital.name
    gdf['utm_crs'] = crs.to_epsg()

    return gdf.to_crs(4326)

And I think that's ok^[4].

Table 0: number of node pairs sampled per city

show the table

city	n
Ankara	72
Valletta	76
Vaduz	76
City of San Marino	82
Andorra la Vella	87
Bucharest	136
Podgorica	149
Nicosia	154
Skopje	155
Monaco	157
Amsterdam	164
Sarajevo	167
Budapest	169
Lisbon	171
Pristina	171
Luxembourg	200
Chișinău	201
Stockholm	207
Belgrade	208
Reykjavik	231
Vatican City	243
Athens	247
Copenhagen	261
Sofia	264
Madrid	266
Zagreb	268
Dublin	275
Kyiv	276
Prague	278
Rome	284
Vilnius	293
Riga	294
Bern	299
Ljubljana	318
Vienna	330
Oslo	339
Brussels	343
Tallinn	345
Tirana	347
Minsk	360
London	375
Berlin	375
Bratislava	381
Paris	400
Moscow	490
Warsaw	517
Helsinki	848

statistics

To ascertain whether or not any differences between two given cities' street network euclidean-ness is statistically significant some statistics were done...

This involves comparing the distribution of network:euclidean ratios for a given city (e.g. Fig 1). ANOVA, or analysis of variance is the tool for the job. However it requires few assumptions to be met:

independence
normality
homoscedasticity^[5]

if you can read this, the figure hasn't loaded. sorry

Fig 1: Distribution of network:euclidean distance ratios across Ljubljana.

Independence is a safe assumption. These are different cities, separated by some distance, in different geographical and geological settings. Sure some urban planning practices, designs and themes might have been copied here and there, but, whatever. Check.

Normality was tested with scipy.stats.normaltest and they all came back as normal enough.^[6] Check.

Homoscedasticity. Fail. scipy.stats.bartlett suggested that these samples are unlikely to have equal variances.

So, no ANOVA today. Kruskal-Wallis, however, can be used as a substitute as it operates on the ranks of the data, rather than the data itself.

Kruskal-Wallis will only tell you if there is a difference between groups, but it won't tell you which cities are different. For that you need a bit of post-hoc analysis.^[7] Pairwise comparisons were carried out using Conover's test as implemented in scikit_posthocs with step-down Bonferroni adjustments to account for there being multiple comparisons.

results

the headline

Madrid is the most euclidean; Valletta the least.

the detail

They are different (Fig 2). Well, some of them are at least. It is a spectrum. You could even say there is spatial heterogeneity.^[8]

Madrid, Brussels, Paris, and Oslo are the most euclidean, apparently, with median ratios of 1.195, 1.216, 1.222, and 1.228, respectively, and if Conover's test is to be believed, they are not significantly different from one another (p-value > whatever threshold you fancy, 0.05, 0.001) (Fig 2). At the other extreme: Valetta (2.440), Vaduz (1.397), and Budapest (1.375) were the least euclidean, and can all be considered similarly inefficient.

if you can read this, the figure hasn't loaded. sorry

Fig 2: Heatmap of post-hoc pairwise comparisons using Conover's test. Small p-values allow us to reject the null hypothesis, and conclude that there is a difference between any given pair of cities, i.e., differences can be found in the purples around the edge of the figure. note: the colour scale is logarithmic.

Cities with larger average ratios also had greater variance^[9] (Fig 3). Even within the most spaghetti-like city, some routes from will still be relatively direct, for example, in Stockholm you could get from the junction of Odengatan and Birger Jarlsgatan to the center of Svensksundsparken on Skeppsholmen traveling only 9.7% further than a bird (euclidean: 2683 m, network: 2942 m; ratio: 1.097). It is worth noting that this is largely independent of distance, i.e. it is not just a few long and wiggly routes that drive the greater variance. There are long direct routes, and short circuitous ones. For each city the ratio was hastily regressed against euclidean distance, and no consistent pattern emerged, with r-squared values typically around ~0.09.

if you can read this, the figure hasn't loaded. sorry

Fig 3: boxplot of network : euclidean distance ratios for each city, sorted by median. Note the logarithmic x-scale. Hover over a box for maximum, minimum, median and IQR values (given to too many decimal places for no good reason)

A quick comparision with all the statistics generated by osmnx.stats.basic_stats shows, mercifully, that I didn't re-invent the wheel here.^[10] Average street circuity is the metric with the greatest similarlity to the ratio calculated here, and whilst they are correlated (r=0.33), they are different. ish. Andorra, with its hairpins tops the circuity charts, whereas Budapest is amongst the least. The circuity average is the result of averaging the circuity^[11] for each edge, whereas the ratio calculated here is for a route across the network, involving multiple edges. It is understandable that these are different, since the angle at which streets intersect is not being accounted for in the former, whereas it is implicity included in the latter.^[12]

why are the cities different?

To understand the differences in network : euclidean distance ratios between cities, it is neccessary to view them in situ, in context (Fig 4).

if you can read this, the figure hasn't loaded. sorry

Fig 4: Map showing routes between random nodes (solid lines) and the associated straight line paths (dashed). Hover over a route to see more details (note: out of respect for your internet speed, only 20% of samples are shown for each city)

The shortest, simplest, answer to explain away any differences is: water.

Valletta, the least euclidean in this cough "study", straddles two harbours^[13]; the Danube cuts Budapest in two, and it's wide (~350 m ish), and the (two) bridges (that fall within my little study area) are a kilometre apart. Whereas Paris, which is admirably euclidean, despite being centred on Île de la Cité in the Seine, manages to squeeze in 10 bridges in ~ 3km. Stockholm's numerous islands and the-opposite-of-numerous bridges, means it has a relatively high median ratio (1.31), and a long tail.

A slightly longer, slightly more nuanced, answer is: water and hills. Vaduz illustrates this most clearly, with its hairpins on Fürst-Franz-Josef-Strasse. Andorra la Vella also has it's fair share of wiggles. But they're both tiny, and that will be discussed shortly.

Brussels, Madrid, Sofia and Oslo have the tightest inter-quartile ranges, and Athens, Madrid, Ankara and Rome have the highest average of streets per node. With no obvious relationship to be found here.

The qualities that can best tease apart the subtle and not-so differences, are perhaps those that can't readily be described by a quantity^[14]: design and history. This is where I would like to talk about Haussmann's renovation of Paris, or how apt it is that Berlin and Helsinki are adjacent to one another in Fig 3 since it was Carl Ludvig Engel, who was trained as a surveyor at the former before becoming the architect charged with reconstructing the latter. But I shalln't because I'm not remotely qualified to, and I respect you more than to just regurgitate wikipedia at you. Mwah.

but

No self-respecting^[15] cough "study" would be complete^[16] without a limitations and caveats section. This is that.

sensitive

In version 1^[17] of this analysis, the sampling strategy differed in two key ways: firstly a, larger, 5 km radius was used instead of 3 km; and secondly nodes were sampled from anywhere within that 5 km, as opposed to the inner 1.5 km done here. Those results looked different. Stockholm was the least euclidean, Paris the most (with a median ratio of 1.14, 0.05 less than Madrid scored in this version). Budapest was somewhere in the middle, as was Vienna.

The motivation behind shrinking the area, in addition to the reasons in the method, was to allow for smaller capitals (Andorra la Vella, Vaduz), to stand a fighting chance of being treated fairly. An option that was considered was to take the whole area of the capital, but for London would that mean taking all of Greater London, which feels wrong, and would include Biggin Hill, or just the City of London, which also feels wrong. Applying a blanket radius was simpler than selecting the areas individually, manually.

So, treat these "results" accordingly.

perhaps wrong

When having a look at some of these results, Belgrade's anomalously long tail stood out. The Sava can be crossed as a pedestrian on both the Brankov and Gazela bridges, and OpenStreetMap shows that to be the case. However it seems that the network is either (a) being incorrectly extracted from OpenStreetMap in the first instance; or (b) the bridges are lost during simplification process. As such, many routes that cross the Sava are being heavily penalised as they detour South to the Ada Bridge.

So, treat these "results" accordingly.

likely incomplete

The search space isn't so large that an option would be to skip the sampling altogether, and do either every pair, or pick a single node at the center and calculate network distances from that node to every other node (networkX has a function for just that).

conclusion

You're probably walking ~1.3x further than a crow is flying. On average. ish.

footnotes

minor grumble: the UK edition has the least interesting cover-art. the german and italian covers are great. ↩︎
not very science-y of me, sorry ↩︎
coupled with some impatience, and not wanting to have to think too hard about writing efficient code ↩︎
as does the friend who i asked ↩︎
this is where i could have included a definition, but chose not to. sorry, you're on your own ↩︎
(ed.) the same cannot be said of the author ↩︎
while being mindful that at some point false positives are going to pop-up ↩︎
small grievance: studies that state this like it is surprising. what would be surprising if everything was the same everywhere ↩︎
i.e. a bigger coefficient of variation ↩︎
i sort of did. but that's ok. wheels are great ↩︎
well, duh ↩︎
fighting the urge to scratch the what-angle-do-streets-typically-meet-at itch ↩︎
and efficiently highlights a shortcoming of this cough "study": there are ferries ↩︎
unfortunately ↩︎
not that i, or this study are capable of that ↩︎
this also isn't ↩︎
yes, believe it or not this is the improved version 2 ↩︎