Skip to contents

Introduction

Anyone who has used OpenStreetMap (OSM) data in R or python might agree that the tagging system, while it provides great flexibility and extensive classification options, is sometimes a bit challenging to work with. This vignettes will explain the approach to the tagging system that aims to make it easy to interact with features, their associated tags and their geometries. The main ambition being providing an easy access to large OSM data sets.

OSM tags

A tag is a tuple of values, the first element is called a \(key\), while the second is the corresponding \(value\). Together, they form a tag which is written out as a \(key=value\) form. The key corresponds to a broad classification, while the value is specific. For example \(amenity=restaurant\). The \(key\) is \(amenity\), while the \(value\) is \(restaurant\). Generally, there are a few values corresponding to a key. - Every OSM feature consists of a set of nodes and a set of tags. The nodes are geographically referenced, which means they have associated coordinates, allowing to reconstruct the geometry of a feature. A detailed list of possible values and recommendations are available in the following article : OSM map features.

Different data types

To further develop the appropriate tools to work with OSM data, it is important to have a look at what kind of data is actually there and in what format it is most useful. Mainly two types of data can be divided:

Network

First, one main use of OSM data is to obtain connected and routable road networks, an easy to use function called extrat_graph will do that. See the corresponding vignette for details. The road network data is grouped under the \(highway\) key. All categories of roads (residential,motorway,pedestrian etc…) will be values of this key.

Non network

The rest of the data in OSM is usually represented by either a point or a polygon with its set of associated tags. There can be multiple tags associated to a feature, and one might be interested in the values of a specific one. There is, however, an intrinsic hierarchy in the tags, which can be useful in extracting data in a user friendly and exploitable way.

Tagging hierarchy

This package proposes a 2 level hierarchy of tags, which helps extract the data into large data.tables or data.frames in which the high level tags are added as column variables, while secondary tags are grouped into named lists and added in a separate column called attrs. The main consideration is that there are tags that add up information to each other, being complimentary in that sense for a feature, these are grouped in the 2nd level in this schema. While there are also tags that are mutually exclusive, for example a tag with an \(amenity\) key will generally not have a \(healthcare\) key since these are different types of features. While both of them can have the same 2nd level keys such as the address, name, phone number or any specific information.

1st level

The first level corresponds to the main tags. Those are taken from the following list:

 main_first_level <- c(
    "amenity"
    ,"craft"
    ,"healthcare"
    ,"historic"
    ,"sport"
    ,"natural"
    ,"shop"
    ,"tourism"
  )

2nd level

Everything else is left to the second level of tags, for example: \(addr:street\) is a specific key for the address of a feature. It is generally scarce. \(leisure\) is a key that overlaps a lot with the \(sport\) key. It generally contains more specific information on the type of sport.

Explanation

Accounting for the fact that OSM data is crowd sourced and therefore there is somewhere an exception to any kind of rules that can imposed on the data, one can still observe certain patterns that are generally true, this is what this 2 level hierearchy of tags aims to capture. Some observations are:

1st level 2nd level
  • Tend to be mutually exclusive

  • Provide a certain amount of information, but remain sufficiently broad.

  • a lot more specific

  • complement each other

amenity, shop, tourism etc addr:name, addr:street, cuisine, takeaway

Arguably, some tags don’t fall into any of these categories, you can still export them with the export_data function, such a tag can be for the \(building\) key. This qualifies more as a layer of data, much like the road network.

Example

By default, only tags with keys \(amenity,shop,tourism\) will be extracted.

## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
test_file <- system.file(package = 'cppRosm','extdata','map.osm')

data <- cppRosm::extract_data(test_file)
## → Using main keys `amenity`,`shop`,`tourism`.
head(data) |> 
  knitr::kable(digits = 3)
id key value lon lat attrs
227734022 amenity fuel 10.513 43.850 Api-Ip , Q646807 , en:Anonima Petroli Italiana , yes , yes , yes , IP , Netti Santo Nicola , 28234 , MISE - Ministero Sviluppo Economico
227808627 amenity fuel 10.495 43.841 Esso , Q867662 , yes , yes , Esso , Ciervo Giovanni, 37277
227817156 amenity fuel 10.495 43.845 Api-Ip , yes , yes , Cobel - Commercio Benzina Lubrificanti - S.R.L. o Cobel - S.R.L., 31551 , no , yes
227817516 amenity fuel 10.516 43.848 Api-Ip , 2022-09-04 , yes , yes , Giop , Cobel - Commercio Benzina Lubrificanti - S.R.L. o Cobel - S.R.L., yes , yes , yes , 22498 , no , yes
227821623 amenity fuel 10.516 43.846 Agip , Q377915 , en:Agip , yes , yes , Agip , Fratelli Checchi - S.N.C., 39092
227821625 amenity fuel 10.516 43.847 Api-Ip , yes , yes , Minghi Marco, 8149 , no , yes

The geometry is simplified to the centroid, for which the coordinates are provided in the lon,lat columns of the data table. A function allowing to reconstruct the full geometry is provided, but since it is not necessarily always usefull to have it, it is ommited in a first extraction. The nodes that constitute a full geometry, if it is more complex than a point, are added to the attrs column as a data.frame. These geometries are always closed, meaning they are polygons. This is meant to save memory especially for large data sets.

With this data format, we can now easily manipulate all the POIs, or select specific keys, or specific values, or both.

# filer by key
data[key=="amenity",] |> head() |> 
  knitr::kable(digits = 3)
id key value lon lat attrs
227734022 amenity fuel 10.513 43.850 Api-Ip , Q646807 , en:Anonima Petroli Italiana , yes , yes , yes , IP , Netti Santo Nicola , 28234 , MISE - Ministero Sviluppo Economico
227808627 amenity fuel 10.495 43.841 Esso , Q867662 , yes , yes , Esso , Ciervo Giovanni, 37277
227817156 amenity fuel 10.495 43.845 Api-Ip , yes , yes , Cobel - Commercio Benzina Lubrificanti - S.R.L. o Cobel - S.R.L., 31551 , no , yes
227817516 amenity fuel 10.516 43.848 Api-Ip , 2022-09-04 , yes , yes , Giop , Cobel - Commercio Benzina Lubrificanti - S.R.L. o Cobel - S.R.L., yes , yes , yes , 22498 , no , yes
227821623 amenity fuel 10.516 43.846 Agip , Q377915 , en:Agip , yes , yes , Agip , Fratelli Checchi - S.N.C., 39092
227821625 amenity fuel 10.516 43.847 Api-Ip , yes , yes , Minghi Marco, 8149 , no , yes
# filter by value
data[value=="restaurant",] |> head() |> 
  knitr::kable(digits = 3)
id key value lon lat attrs
1948949391 amenity restaurant 10.506 43.846 Lucca , IT , 9 , 55100 , Via Anfiteatro , regional , , wlan , Osteria Baralla , +39 0583 440240 , no , 1860 , https://www.osteriabaralla.it/, limited
1984554026 amenity restaurant 10.502 43.843 Lucca , 3 , 55100 , Via della Cervia , Buca di Sant’Antonio , +39 0583 55881 , http://www.bucadisantantonio.it, limited
1987005332 amenity restaurant 10.506 43.845 Lucca , IT , 38 , 55100 , Piazza Anfiteatro , italian , Osteria del Tortellino
2241788287 amenity restaurant 10.504 43.838 259 , 55100 , Viale Regina Margherita , Pizzeria La Tana dell’Orco, +39 389 0233234
2837898251 amenity restaurant 10.507 43.849 42 , , Trattoria Gosto e Mea , Tu-Sa 11:30-14:30,19:00-24:00; Su 19:00-24:00; Mo closed, Gosto e Mea s.r.l. , +39 0583 1805200 , trattoria , www.trattoriagostoemea.it , limited
2993594810 amenity restaurant 10.503 43.842 italian , Antica locanda dell’angelo, no
# both: use or to include specific values of different keys and keys
data[key=="shop" | value=="restaurant",] |> head() |> 
  knitr::kable(digits = 3)
id key value lon lat attrs
248908257 shop supermarket 10.499 43.849 Lucca , IT , 565 , 55100 , Viale Carlo Del Prete , Esselunga , Esselunga di viale Del Prete , Mo-Sa 07:30-21:00, Su 09:00-14:00, yes , https://www.esselunga.it/ , yes
645006065 shop coffee 10.505 43.845 NULL
1375297878 shop bicycle 10.507 43.847 Lucca , IT , 42 , 55100 , Piazza Santa Maria , bicycle_rental , Noleggio bicicletta Antonio Poli, +39 0583 493787 , yes , no , yes
1463854336 shop optician 10.504 43.844 Ottico Toni, limited
1463859737 shop books 10.504 43.843 Lucca , 20 , 55100 , Via Roma , Mondadori , Q85355 , en:Arnoldo Mondadori Editore , Mondadori , no , Entrata Via Cenami con campanello assistenza. Entrance Via Cenami with assistance bell.
1531037312 shop alcohol 10.507 43.846 188 , Via Fillungo, Vinarkía

Filtering 2nd level tags

A function to efficiently filter the secondary tags returns a data frame, where the searched for key will be added as a column for all feature that has a non-NA match. Under the hood, values are matched using regular expressions, which maximises the chance of finding the desired values in sometimes complicated OSM values.

data |> 
  cppRosm::filter_sec(keys=c("cuisine","takeaway")
                      ,cores=1) |> 
  head() |> 
  knitr::kable(digits = 3)
id key value lon lat attrs cuisine takeaway
1463854159 amenity fast_food 10.503 43.844 Lucca , 12 , Via Buia , pizza , Pizza da Felice, no pizza NA
1948679522 amenity cafe 10.502 43.840 Lucca , 74 , 55100 , Via Vittorio Veneto , ice_cream , Gelateria Veneta , Gelateria Veneta SRL , +39 0583 467037 , https://www.gelateriaveneta.net/, limited ice_cream NA
1948949391 amenity restaurant 10.506 43.846 Lucca , IT , 9 , 55100 , Via Anfiteatro , regional , , wlan , Osteria Baralla , +39 0583 440240 , no , 1860 , https://www.osteriabaralla.it/, limited regional NA
1987005332 amenity restaurant 10.506 43.845 Lucca , IT , 38 , 55100 , Piazza Anfiteatro , italian , Osteria del Tortellino italian NA
1987005335 amenity cafe 10.506 43.845 ice_cream ice_cream NA
2969988634 amenity bar 10.503 43.839 18 , Via Francesco Carrara , japanese;sushi;coffee_shop;italian, Lelemento , Origami , Origami , Origami japanese;sushi;coffee_shop;italian NA
data |> 
  cppRosm::filter_sec(keys=list("cuisine"=c("japanese","pizza")
                                ,"takeaway"=c("yes"))
                      ,cores = 1) |> 
  head() |> 
  knitr::kable(digits = 3)
##  Searching for exact key~value matches.
id key value lon lat attrs cuisine takeaway
1463854159 amenity fast_food 10.503 43.844 Lucca , 12 , Via Buia , pizza , Pizza da Felice, no pizza NA
2969988634 amenity bar 10.503 43.839 18 , Via Francesco Carrara , japanese;sushi;coffee_shop;italian, Lelemento , Origami , Origami , Origami japanese;sushi;coffee_shop;italian NA
2993594823 amenity restaurant 10.506 43.845 Lucca , IT , 51 , 55100 , Piazza Anfiteatro, pizza , L’angolo tondo pizza NA
2993594832 amenity restaurant 10.503 43.843 pizza , Pellegrini pizza NA
2993598933 amenity restaurant 10.502 43.843 italian;pizza, yes , Piccolo Mondo, yes italian;pizza NA
2993609216 amenity restaurant 10.503 43.842 Lucca , 16 , 55100 , Piazza Napoleone, pizza , Fuori di piazza , Mo-Sa , yes , limited pizza NA

Geometries

In a lot of cases, knowing the centroids of POIs is more than enough:

data |> 
  sf::st_as_sf(coords=c("lon","lat"),crs=4326) |> 
  sf::st_geometry() |> 
  plot(pch=19)

# or same result, keeping the original data a `data.table`
# data |> 
#   cppRosm::construct_geom() |> 
#   sf::st_as_sf() |> 
#   sf::st_geometry() |> 
#   plot()

But there will be times when the original geometry might be of interest, in this case use the construct_geom function:

data_geom <- data |> 
  cppRosm::construct_geom(complete = TRUE,cores = 1) |> 
  sf::st_as_sf()

data_geom |> 
  sf::st_geometry() |> 
  plot(pch=19)

Buildings

As discussed earlier, the tags withe the \(building\) key arguably fall into neither categories, and in that sense they constitute a data layer, just like the road network. They can still be queried and extracted with the extract_data function.

buildings <- cppRosm::extract_data(test_file,main_keys = "building")

This function will run in a fraction of seconds and extract all the buildings from the file. It will again only provide the centroid in direct access:

buildings |> 
  cppRosm::construct_geom() |>
  sf::st_set_geometry("geometry") |>
  sf::st_geometry() |> 
  plot(pch=19)

And we can reconstruct the geometries from the data hidden in attrs as follows:

buildings |> 
  cppRosm::construct_geom(complete = TRUE,cores=1) |> 
  sf::st_as_sf() |> 
  sf::st_geometry() |> 
  plot(pch=19)

We can filter for specific tags:

buildings |> 
  cppRosm::filter_sec(keys=c("shop","amenity")) |> 
  dplyr::select(!attrs) |> 
  head() |> 
  knitr::kable(digits = 3)
id key value lon lat amenity shop
35164444 building church 10.503 43.843 place_of_worship NA
48981467 building church 10.502 43.842 place_of_worship NA
48981475 building church 10.500 43.843 place_of_worship NA
48981487 building church 10.502 43.845 place_of_worship NA
48981495 building church 10.509 43.842 place_of_worship NA
48981496 building church 10.508 43.846 place_of_worship NA

We observe the mutually exclusive nature of \(amenity\) and \(shop\) keys, justifying the earlier discussion and the differentiation of the \(building\) key. The only intersection of the two keys is the famous in french bar-tabac, which is both a bar and a place to buy tobacco/cigarettes.

More specific filtering:

buildings |> 
  cppRosm::filter_sec(keys = list("tourism" = c("")
                         ,"abandoned" = c("yes"))) |> 
  dplyr::select(!attrs) |> 
  head() |> 
  knitr::kable(digits = 3)
##  Searching for exact key~value matches.
id key value lon lat tourism abandoned
120076875 building tower 10.507 43.844 viewpoint NA
141364087 building yes 10.512 43.845 museum NA
180148954 building yes 10.503 43.846 attraction NA
180148963 building yes 10.501 43.841 attraction NA
180148964 building public 10.503 43.840 attraction NA
180148975 building church 10.504 43.843 attraction NA

If searching for a specific \(key=value\) in one tag and all values for another tag, consider the trick above.

  • It’s generally not recommended to extract the building layer together with other main keys.

Alternatives

  • osmdata: while this is a great package that I have used a lot, I found it frustrating to have all the different geometries gathered into separate tables, and each table containing huge numbers of NA columns. The attrs column in a cppRosm table provides an alternative way to store all the secondary tags. Additionally, all the geometries are simplified to their centroid, but the possibility to reconstruct the original ones are provided with the construct_geom function.
  • osmextract: great for working with large OSM files. But the filtering of features remains a bit obscure in my opinion, although it seems to provide great flexibility, but requires a good knowledge of OSM internals.

Conclusion

This vignette aimed to explain the approach taken for manipulating OSM data at scale and with flexibility through a specific data table format and supporting functions. Please reach out for recommendations, feature additions etc…

In another vignette, a recommended workflow will be covered, as this package was developed with a few others in mind (rosmium,cppRouting mainly) , so that network and POI analysis could be done at scale in a local setup.