Local setup for working with Overture data

ivann.schlosser@ox.co.uk

Oxford Progamme for Sustainable Infrastructure Systems (OPSIS)

Importing the data

The Overture website recommends various workflows to download the data. Among them, the one allowing to work a local and self-sufficient manner is the python based overturemaps CLI, available from pip. It requires few arguments: 4 numeric values for the bbox, the type of layer to extract and the type of file to write into.

!overturemaps download --bbox=west,south,east,north -f geoparquet --type=segment -o tanzania_roads.geoparquet

More information on the values allowed in --type is available via the shell command overturemaps download --help. More methods to download Overture data are shown in the documentation.

Working with the data

Once the data is stored locally as .geoparquet, we can work with it in python with duckdb.

import duckdb as db

roads = db.read_parquet("../tanzania_roads.geoparquet")

The data set is read as traditional parquet in which the geometry column is a blob.

┌──────────────────────┬──────────────┬────────────────────────────────────────────────────────────────────────────────┐
│          id          │    class     │                                    geometry                                    │
│       varchar        │   varchar    │                                      blob                                      │
├──────────────────────┼──────────────┼────────────────────────────────────────────────────────────────────────────────┤
│ 089962508d97ffff04…  │ path         │ \x00\x00\x00\x00\x02\x00\x00\x00\x02@<\xD1\xA5\x99\x82\x1C\xA5\xC01.1\x0D\xB…  │
│ 089962508d97ffff04…  │ path         │ \x00\x00\x00\x00\x02\x00\x00\x00\x08@<\xD1\xD2\x08u\xF0G\xC01.\x0A+\xD2\xEC\…  │
│ 088962508d9fffff04…  │ path         │ \x00\x00\x00\x00\x02\x00\x00\x00\x13@<\xD1\xA5\x99\x82\x1C\xA5\xC01.1\x0D\xB…  │
│ 089962508d83ffff04…  │ path         │ \x00\x00\x00\x00\x02\x00\x00\x00\x0B@<\xD2\x06\xEF\x07\x8A8\xC01-\xE34\x17^\…  │
│ 08496251ffffffff04…  │ secondary    │ \x00\x00\x00\x00\x02\x00\x00\x00c@<\xC7\xF8\xAA\xA1(\xDA\xC01&_\xB8\xCD;\x88…  │
│ 086962508fffffff04…  │ unclassified │ \x00\x00\x00\x00\x02\x00\x00\x00S@<\xD2\x16/\x16n\x01\xC01!\xD4\x96\xC5\xA9\…  │
│ 08496251ffffffff04…  │ unclassified │ \x00\x00\x00\x00\x02\x00\x00\x01\x05@<\xD2\x16/\x16n\x01\xC01!\xD4\x96\xC5\x…  │
│ 087962508bffffff04…  │ track        │ \x00\x00\x00\x00\x02\x00\x00\x00!@<\xD3\x9A\x93\x94\x9C\xDB\xC01!\xB0\xFA\x0…  │
│ 086962508fffffff04…  │ track        │ \x00\x00\x00\x00\x02\x00\x00\x00:@<\xD4\xB9\x9F\xC9^\x83\xC01\x1En\xE9\xC2\x…  │
│ 087962508bffffff04…  │ track        │ \x00\x00\x00\x00\x02\x00\x00\x003@<\xD1\xBA\x0B\xFF\xE8\x83\xC01\x1Ew\xA7\xD…  │
├──────────────────────┴──────────────┴────────────────────────────────────────────────────────────────────────────────┤
│ 10 rows                                                                                                    3 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The duckdb loaders do not support reading geoparquet at the moment, but this feature is expected in the upcoming version. We stick to this format for its efficiency when storing large extracts.

To further work with the geometry, we install the duckdb extension.

# installing and loading the extension.
db.install_extension("spatial")
db.load_extension("spatial")

This will allow us to work with the geometry column from within the database, bypassing the limitation of the parquet reader.

Basic interaction with the data

Still with the duckdb package and its SQL-like syntax.

Counting values

db.sql("SELECT count(*) as N_segments,class FROM roads GROUP BY class;")

┌────────────┬───────────────┐
│ N_segments │     class     │
│   int64    │    varchar    │
├────────────┼───────────────┤
│      83290 │ unknown       │
│       3974 │ driveway      │
│     502882 │ track         │
│      69192 │ secondary     │
│    1128458 │ unclassified  │
│     148719 │ tertiary      │
│        319 │ sidewalk      │
│       2473 │ living_street │
│      27548 │ primary       │
│      44763 │ trunk         │
│         93 │ NULL          │
│        373 │ steps         │
│        457 │ pedestrian    │
│         63 │ motorway      │
│    1249548 │ path          │
│       1481 │ parking_aisle │
│        263 │ crosswalk     │
│        144 │ cycleway      │
│    1792660 │ residential   │
│      57044 │ footway       │
│        459 │ alley         │
│         29 │ bridleway     │
├────────────┴───────────────┤
│ 22 rows          2 columns │
└────────────────────────────┘

Data Manipulation

The advantage of working with duckdb is that intensive computations are performed outside the python environment, and all we need to do is collect the results.

Etracting a subset

# filtering out cycleways
ways = db.sql("Select id,ST_GeomFromWKB(geometry) as geometry,subtype,class from roads where class='primary';")

# intermediate step: transform the geometry into WKT and read the subset of data as a pandas DataFrame
ways_wkt = db.sql("select id, ST_AsText(geometry) as geometry, subtype, class from ways;").df()

# Finally, convert the geometry and create a geopandas GeoDataFrame. 
ways_df = gpd.GeoDataFrame(ways_wkt
                          ,geometry=gpd.GeoSeries.from_wkt(ways_wkt["geometry"])
                          ,crs=4326
                          )
ways_df.head()

	id	geometry	subtype	class
0	088971928a1fffff047fb94fc9b6da63	LINESTRING (30.87546 -17.07291, 30.87616 -17.0...	road	primary
1	08497193ffffffff047fafebb6014c10	LINESTRING (30.86457 -17.04400, 30.86431 -17.0...	road	primary
2	08896269203fffff047daff58b6ae4c4	LINESTRING (30.84229 -17.03138, 30.84201 -17.0...	road	primary
3	089962692037ffff047fee1fd2e8c575	LINESTRING (30.84418 -17.03071, 30.84229 -17.0...	road	primary
4	08a962692022ffff047ffd55846af8dc	LINESTRING (30.84504 -17.03038, 30.84418 -17.0...	road	primary

The resulting types:

id            object
geometry    geometry
subtype       object
class         object
dtype: object

Plotting

Text(0.5, 1.0, 'Example segment class')

Text(0.5, 25.13333333333333, 'Longitude [deg]')

Text(303.74186744197146, 0.5, 'Latitude [deg]')

Other workflows

GeoPandas

Once the data is extracted, other options are available to work with it. GeoPandas converts the geometry column for us, so no extra steps are required.

ways_gpd = gpd.read_parquet("../tanzania_roads.geoparquet"
                        #  ,columns=["id","class","connector_ids","geometry"] # read in desired columns only.
                         )

It is however less efficient to read with this method, so it’s only recommended for relatively small data sets.

pyArrow

Is the under the hood reader of geopandas.

import pyarrow.parquet as pq

ways_arrow = pq.read_table("../tanzania_roads.geoparquet")

Still more tools

The vast python package ecosystem provides a wide range of tools that work with (geo)parquet and (geo)arrow file formats and specifications, among them:

geoarrow: the low level specification of parquet.
geoparquet