An opinionated way of working with OSM data • cppRosm

Introduction

Anyone who has used OpenStreetMap (OSM) data in R or python might agree that the tagging system, while it provides great flexibility and extensive classification options, is sometimes a bit challenging to work with. This vignettes will explain the approach to the tagging system that aims to make it easy to interact with features, their associated tags and their geometries. The main ambition being providing an easy access to large OSM data sets.

OSM tags

A tag is a tuple of values, the first element is called a \(key\), while the second is the corresponding \(value\). Together, they form a tag which is written out as a \(key=value\) form. The key corresponds to a broad classification, while the value is specific. For example \(amenity=restaurant\). The \(key\) is \(amenity\), while the \(value\) is \(restaurant\). Generally, there are a few values corresponding to a key. - Every OSM feature consists of a set of nodes and a set of tags. The nodes are geographically referenced, which means they have associated coordinates, allowing to reconstruct the geometry of a feature. A detailed list of possible values and recommendations are available in the following article : OSM map features.

Different data types

To further develop the appropriate tools to work with OSM data, it is important to have a look at what kind of data is actually there and in what format it is most useful. Mainly two types of data can be divided:

Network

First, one main use of OSM data is to obtain connected and routable road networks, an easy to use function called extrat_graph will do that. See the corresponding vignette for details. The road network data is grouped under the \(highway\) key. All categories of roads (residential,motorway,pedestrian etc…) will be values of this key.

Non network

The rest of the data in OSM is usually represented by either a point or a polygon with its set of associated tags. There can be multiple tags associated to a feature, and one might be interested in the values of a specific one. There is, however, an intrinsic hierarchy in the tags, which can be useful in extracting data in a user friendly and exploitable way.

Tagging hierarchy

This package proposes a 2 level hierarchy of tags, which helps extract the data into large data.tables or data.frames in which the high level tags are added as column variables, while secondary tags are grouped into named lists and added in a separate column called attrs. The main consideration is that there are tags that add up information to each other, being complimentary in that sense for a feature, these are grouped in the 2nd level in this schema. While there are also tags that are mutually exclusive, for example a tag with an \(amenity\) key will generally not have a \(healthcare\) key since these are different types of features. While both of them can have the same 2nd level keys such as the address, name, phone number or any specific information.

1st level

The first level corresponds to the main tags. Those are taken from the following list:

 main_first_level <- c(
    "amenity"
    ,"craft"
    ,"healthcare"
    ,"historic"
    ,"sport"
    ,"natural"
    ,"shop"
    ,"tourism"
  )

2nd level

Everything else is left to the second level of tags, for example: \(addr:street\) is a specific key for the address of a feature. It is generally scarce. \(leisure\) is a key that overlaps a lot with the \(sport\) key. It generally contains more specific information on the type of sport.

Explanation

Accounting for the fact that OSM data is crowd sourced and therefore there is somewhere an exception to any kind of rules that can imposed on the data, one can still observe certain patterns that are generally true, this is what this 2 level hierearchy of tags aims to capture. Some observations are:

1st level	2nd level
Tend to be mutually exclusive Provide a certain amount of information, but remain sufficiently broad.	a lot more specific complement each other
amenity, shop, tourism etc	addr:name, addr:street, cuisine, takeaway

Arguably, some tags don’t fall into any of these categories, you can still export them with the export_data function, such a tag can be for the \(building\) key. This qualifies more as a layer of data, much like the road network.

Example

By default, only tags with keys \(amenity,shop,tourism\) will be extracted.

library(cppRosm)
library(sf)

## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

test_file <- system.file(package = 'cppRosm','extdata','map.osm')

data <- cppRosm::extract_data(test_file)

## → Using main keys `amenity`,`shop`,`tourism`.

head(data) |> 
  knitr::kable(digits = 3)

id	key	value	lon	lat	attrs
227734022	amenity	fuel	10.513	43.850	Api-Ip , Q646807 , en:Anonima Petroli Italiana , yes , yes , yes , IP , Netti Santo Nicola , 28234 , MISE - Ministero Sviluppo Economico
227808627	amenity	fuel	10.495	43.841	Esso , Q867662 , yes , yes , Esso , Ciervo Giovanni, 37277
227817156	amenity	fuel	10.495	43.845	Api-Ip , yes , yes , Cobel - Commercio Benzina Lubrificanti - S.R.L. o Cobel - S.R.L., 31551 , no , yes
227817516	amenity	fuel	10.516	43.848	Api-Ip , 2022-09-04 , yes , yes , Giop , Cobel - Commercio Benzina Lubrificanti - S.R.L. o Cobel - S.R.L., yes , yes , yes , 22498 , no , yes
227821623	amenity	fuel	10.516	43.846	Agip , Q377915 , en:Agip , yes , yes , Agip , Fratelli Checchi - S.N.C., 39092
227821625	amenity	fuel	10.516	43.847	Api-Ip , yes , yes , Minghi Marco, 8149 , no , yes

The geometry is simplified to the centroid, for which the coordinates are provided in the lon,lat columns of the data table. A function allowing to reconstruct the full geometry is provided, but since it is not necessarily always usefull to have it, it is ommited in a first extraction. The nodes that constitute a full geometry, if it is more complex than a point, are added to the attrs column as a data.frame. These geometries are always closed, meaning they are polygons. This is meant to save memory especially for large data sets.

With this data format, we can now easily manipulate all the POIs, or select specific keys, or specific values, or both.

# filer by key
data[key=="amenity",] |> head() |> 
  knitr::kable(digits = 3)

id	key	value	lon	lat	attrs
227734022	amenity	fuel	10.513	43.850	Api-Ip , Q646807 , en:Anonima Petroli Italiana , yes , yes , yes , IP , Netti Santo Nicola , 28234 , MISE - Ministero Sviluppo Economico
227808627	amenity	fuel	10.495	43.841	Esso , Q867662 , yes , yes , Esso , Ciervo Giovanni, 37277
227817156	amenity	fuel	10.495	43.845	Api-Ip , yes , yes , Cobel - Commercio Benzina Lubrificanti - S.R.L. o Cobel - S.R.L., 31551 , no , yes
227817516	amenity	fuel	10.516	43.848	Api-Ip , 2022-09-04 , yes , yes , Giop , Cobel - Commercio Benzina Lubrificanti - S.R.L. o Cobel - S.R.L., yes , yes , yes , 22498 , no , yes
227821623	amenity	fuel	10.516	43.846	Agip , Q377915 , en:Agip , yes , yes , Agip , Fratelli Checchi - S.N.C., 39092
227821625	amenity	fuel	10.516	43.847	Api-Ip , yes , yes , Minghi Marco, 8149 , no , yes

# filter by value
data[value=="restaurant",] |> head() |> 
  knitr::kable(digits = 3)

id	key	value	lon	lat	attrs
1948949391	amenity	restaurant	10.506	43.846	Lucca , IT , 9 , 55100 , Via Anfiteatro , regional , info@osteriabaralla.it , wlan , Osteria Baralla , +39 0583 440240 , no , 1860 , https://www.osteriabaralla.it/, limited
1984554026	amenity	restaurant	10.502	43.843	Lucca , 3 , 55100 , Via della Cervia , Buca di Sant’Antonio , +39 0583 55881 , http://www.bucadisantantonio.it, limited
1987005332	amenity	restaurant	10.506	43.845	Lucca , IT , 38 , 55100 , Piazza Anfiteatro , italian , Osteria del Tortellino
2241788287	amenity	restaurant	10.504	43.838	259 , 55100 , Viale Regina Margherita , Pizzeria La Tana dell’Orco, +39 389 0233234
2837898251	amenity	restaurant	10.507	43.849	42 , info@trattoriagostoemea.it , Trattoria Gosto e Mea , Tu-Sa 11:30-14:30,19:00-24:00; Su 19:00-24:00; Mo closed, Gosto e Mea s.r.l. , +39 0583 1805200 , trattoria , www.trattoriagostoemea.it , limited
2993594810	amenity	restaurant	10.503	43.842	italian , Antica locanda dell’angelo, no

# both: use or to include specific values of different keys and keys
data[key=="shop" | value=="restaurant",] |> head() |> 
  knitr::kable(digits = 3)

id	key	value	lon	lat	attrs
248908257	shop	supermarket	10.499	43.849	Lucca , IT , 565 , 55100 , Viale Carlo Del Prete , Esselunga , Esselunga di viale Del Prete , Mo-Sa 07:30-21:00, Su 09:00-14:00, yes , https://www.esselunga.it/ , yes
645006065	shop	coffee	10.505	43.845	NULL
1375297878	shop	bicycle	10.507	43.847	Lucca , IT , 42 , 55100 , Piazza Santa Maria , bicycle_rental , Noleggio bicicletta Antonio Poli, +39 0583 493787 , yes , no , yes
1463854336	shop	optician	10.504	43.844	Ottico Toni, limited
1463859737	shop	books	10.504	43.843	Lucca , 20 , 55100 , Via Roma , Mondadori , Q85355 , en:Arnoldo Mondadori Editore , Mondadori , no , Entrata Via Cenami con campanello assistenza. Entrance Via Cenami with assistance bell.
1531037312	shop	alcohol	10.507	43.846	188 , Via Fillungo, Vinarkía

Filtering 2nd level tags

A function to efficiently filter the secondary tags returns a data frame, where the searched for key will be added as a column for all feature that has a non-NA match. Under the hood, values are matched using regular expressions, which maximises the chance of finding the desired values in sometimes complicated OSM values.

data |> 
  cppRosm::filter_sec(keys=c("cuisine","takeaway")
                      ,cores=1) |> 
  head() |> 
  knitr::kable(digits = 3)

id	key	value	lon	lat	attrs	cuisine	takeaway
1463854159	amenity	fast_food	10.503	43.844	Lucca , 12 , Via Buia , pizza , Pizza da Felice, no	pizza	NA
1948679522	amenity	cafe	10.502	43.840	Lucca , 74 , 55100 , Via Vittorio Veneto , ice_cream , Gelateria Veneta , Gelateria Veneta SRL , +39 0583 467037 , https://www.gelateriaveneta.net/, limited	ice_cream	NA
1948949391	amenity	restaurant	10.506	43.846	Lucca , IT , 9 , 55100 , Via Anfiteatro , regional , info@osteriabaralla.it , wlan , Osteria Baralla , +39 0583 440240 , no , 1860 , https://www.osteriabaralla.it/, limited	regional	NA
1987005332	amenity	restaurant	10.506	43.845	Lucca , IT , 38 , 55100 , Piazza Anfiteatro , italian , Osteria del Tortellino	italian	NA
1987005335	amenity	cafe	10.506	43.845	ice_cream	ice_cream	NA
2969988634	amenity	bar	10.503	43.839	18 , Via Francesco Carrara , japanese;sushi;coffee_shop;italian, Lelemento , Origami , Origami , Origami	japanese;sushi;coffee_shop;italian	NA

data |> 
  cppRosm::filter_sec(keys=list("cuisine"=c("japanese","pizza")
                                ,"takeaway"=c("yes"))
                      ,cores = 1) |> 
  head() |> 
  knitr::kable(digits = 3)

## ℹ Searching for exact key~value matches.

id	key	value	lon	lat	attrs	cuisine	takeaway
1463854159	amenity	fast_food	10.503	43.844	Lucca , 12 , Via Buia , pizza , Pizza da Felice, no	pizza	NA
2969988634	amenity	bar	10.503	43.839	18 , Via Francesco Carrara , japanese;sushi;coffee_shop;italian, Lelemento , Origami , Origami , Origami	japanese;sushi;coffee_shop;italian	NA
2993594823	amenity	restaurant	10.506	43.845	Lucca , IT , 51 , 55100 , Piazza Anfiteatro, pizza , L’angolo tondo	pizza	NA
2993594832	amenity	restaurant	10.503	43.843	pizza , Pellegrini	pizza	NA
2993598933	amenity	restaurant	10.502	43.843	italian;pizza, yes , Piccolo Mondo, yes	italian;pizza	NA
2993609216	amenity	restaurant	10.503	43.842	Lucca , 16 , 55100 , Piazza Napoleone, pizza , Fuori di piazza , Mo-Sa , yes , limited	pizza	NA

Geometries

In a lot of cases, knowing the centroids of POIs is more than enough:

data |> 
  sf::st_as_sf(coords=c("lon","lat"),crs=4326) |> 
  sf::st_geometry() |> 
  plot(pch=19)

# or same result, keeping the original data a `data.table`
# data |> 
#   cppRosm::construct_geom() |> 
#   sf::st_as_sf() |> 
#   sf::st_geometry() |> 
#   plot()

But there will be times when the original geometry might be of interest, in this case use the construct_geom function:

data_geom <- data |> 
  cppRosm::construct_geom(complete = TRUE,cores = 1) |> 
  sf::st_as_sf()

data_geom |> 
  sf::st_geometry() |> 
  plot(pch=19)

Buildings

As discussed earlier, the tags withe the \(building\) key arguably fall into neither categories, and in that sense they constitute a data layer, just like the road network. They can still be queried and extracted with the extract_data function.

buildings <- cppRosm::extract_data(test_file,main_keys = "building")

This function will run in a fraction of seconds and extract all the buildings from the file. It will again only provide the centroid in direct access:

buildings |> 
  cppRosm::construct_geom() |>
  sf::st_set_geometry("geometry") |>
  sf::st_geometry() |> 
  plot(pch=19)

And we can reconstruct the geometries from the data hidden in attrs as follows:

buildings |> 
  cppRosm::construct_geom(complete = TRUE,cores=1) |> 
  sf::st_as_sf() |> 
  sf::st_geometry() |> 
  plot(pch=19)

We can filter for specific tags:

buildings |> 
  cppRosm::filter_sec(keys=c("shop","amenity")) |> 
  dplyr::select(!attrs) |> 
  head() |> 
  knitr::kable(digits = 3)

id	key	value	lon	lat	amenity	shop
35164444	building	church	10.503	43.843	place_of_worship	NA
48981467	building	church	10.502	43.842	place_of_worship	NA
48981475	building	church	10.500	43.843	place_of_worship	NA
48981487	building	church	10.502	43.845	place_of_worship	NA
48981495	building	church	10.509	43.842	place_of_worship	NA
48981496	building	church	10.508	43.846	place_of_worship	NA

We observe the mutually exclusive nature of \(amenity\) and \(shop\) keys, justifying the earlier discussion and the differentiation of the \(building\) key. The only intersection of the two keys is the famous in french bar-tabac, which is both a bar and a place to buy tobacco/cigarettes.

More specific filtering:

buildings |> 
  cppRosm::filter_sec(keys = list("tourism" = c("")
                         ,"abandoned" = c("yes"))) |> 
  dplyr::select(!attrs) |> 
  head() |> 
  knitr::kable(digits = 3)

## ℹ Searching for exact key~value matches.

id	key	value	lon	lat	tourism	abandoned
120076875	building	tower	10.507	43.844	viewpoint	NA
141364087	building	yes	10.512	43.845	museum	NA
180148954	building	yes	10.503	43.846	attraction	NA
180148963	building	yes	10.501	43.841	attraction	NA
180148964	building	public	10.503	43.840	attraction	NA
180148975	building	church	10.504	43.843	attraction	NA

If searching for a specific \(key=value\) in one tag and all values for another tag, consider the trick above.

It’s generally not recommended to extract the building layer together with other main keys.

Alternatives

osmdata: while this is a great package that I have used a lot, I found it frustrating to have all the different geometries gathered into separate tables, and each table containing huge numbers of NA columns. The attrs column in a cppRosm table provides an alternative way to store all the secondary tags. Additionally, all the geometries are simplified to their centroid, but the possibility to reconstruct the original ones are provided with the construct_geom function.
osmextract: great for working with large OSM files. But the filtering of features remains a bit obscure in my opinion, although it seems to provide great flexibility, but requires a good knowledge of OSM internals.

Conclusion

This vignette aimed to explain the approach taken for manipulating OSM data at scale and with flexibility through a specific data table format and supporting functions. Please reach out for recommendations, feature additions etc…

In another vignette, a recommended workflow will be covered, as this package was developed with a few others in mind (rosmium,cppRouting mainly) , so that network and POI analysis could be done at scale in a local setup.