Querying OpenStreetMap with Amazon Athena
OpenStreetMap (OSM) is a free, editable map of the world, created and maintained by volunteers and available for use under an open license. Companies and non-profits like Mapbox, Foursquare, Mapzen, the World Bank, the American Red Cross and others use OSM to provide maps, directions, and geographic context to users around the world.
As an open content project, regular OSM data archives are made available to the public via planet.openstreetmap.org in a few different formats (XML, PBF). This includes both snapshots of the current state of data in OSM as well as historical archives.
Working with “the planet” (as the data archives are referred to) can be unwieldy. Because it contains data spanning the entire world, the size of a single archive is on the order of 50 GB. The format is bespoke and extremely specific to OSM. The data is incredibly rich, interesting, and useful, but the size, format, and tooling can often make it very difficult to even start the process of asking complex questions.
However, Amazon can help you to work with this massive database. AWS has the right “tool” an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL and it’s called Amazon Athena!
- Amazon Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
- Easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.
- Fast: Most results are delivered within seconds. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.
The OSM Public Datasets consist of three tables: planet; planet_history and changesets. (In this blog post you can find the way how to register the data tables):
- Planet: Contains the current versions of all elements present in OSM.
- Planet_history: Contains a historical record of all versions of all elements (even those that have been deleted).
- Changesets: Contains information about changesets in which elements were modified (and which have a foreign key relationship to both the planet and planet_history tables).

Let’s see an example
Tanzania Development Trust has a mapping project called Crowd2map Tanzania. Working with the HOT team, Crowd2map Tanzania (hereafter C2M) volunteers added millions of buildings to OpenStreetMap. Each “changeset” was uploaded with the #tanzaniadevelopmenttrust hashtag, which can help us in data management and querying. Now, the organisation needs all the data tagged as “amenity” and sent in a changeset in which the #tanzaniadevelopmenttrust hashtag was also used. The query will be something like this:

The running time, in this case will be less than 23 seconds while the entire “OSM planet” data will be scanned – which is 78.83 GB in May 2021. The result contains all nodes that meet the request criteria and all available information about it: its spatial position, when it was created, who created it, the number of changeset in which it was uploaded, what “tag” was used, and so on.
Here are some queries that can be useful:
Number of changeset per user, where the comment is #tanzaniadevelopmenttrust
SELECT user,
count(*) AS changesets
FROM changesets
WHERE regexp_like(tags['comment'], '(?i)#tanzaniadevelopmenttrust')
GROUP BY user
ORDER BY count(*) DESC;
Number of user who added the at least 1 #tanzaniadevelopmenttrust related changeset
SELECT count(DISTINCT(uid)) AS "Number of volunteers"
FROM changesets
WHERE lower(changesets.tags['comment']) LIKE '%tanzaniadevelopmenttrust%'
Number of user who added the at least 1 #tanzaniadevelopmenttrust related changeset since 1st of January 2020
SELECT count(distinct(uid))
FROM changesets
WHERE lower(changesets.tags['comment']) LIKE '%tanzaniadevelopmenttrust%'
AND changesets.created_at > date '2020-01-01'
Note
If you need an intervall You can use in this way:
AND changesets.created_at >= date '2020-01-01' AND changesets.created_at < date '2020-10-01'
Number of user who added the more than 20 #tanzaniadevelopmenttrust related changeset
SELECT count(uid) FROM (
SELECT uid, count(id) AS num_changesets
FROM changesets
WHERE lower(changesets.tags['comment']) LIKE '%tanzaniadevelopmenttrust%'
GROUP BY uid
) WHERE num_changesets > 20
All user - name; first edit; last edit; calculated lifespan and total edits
SELECT
changesets.user,
min(created_at) AS first_edit,
max(created_at) AS last_edit,
date_diff('day', min(created_at), max(created_at)) AS lifespan,
date_diff('day', max(created_at), current_date) AS offline_since,
sum(num_changes) AS total_edits
FROM
changesets
WHERE
lower(changesets.tags['comment']) LIKE '%tanzaniadevelopmenttrust%' -- hotosm changesets only
GROUP BY
changesets.user
ORDER BY lifespan DESC
All user - name; first edit; last edit; calculated lifespan and total edits but searching the app was used aka “created_by”
SELECT
changesets.user,
min(created_at) AS first_edit,
max(created_at) AS last_edit,
date_diff('day', min(created_at), max(created_at)) AS lifespan,
sum(num_changes) AS total_edits
FROM
changesets
WHERE
lower(changesets.tags['created_by']) LIKE '%maps.me%'
GROUP BY
changesets.user
ORDER BY lifespan DESC
Number of village, city, etc. was added with “TNZ_missing_villages” hashtag
SELECT COUNT(planet.changeset)
FROM planet
JOIN changesets ON planet.changeset = changesets.id
WHERE regexp_like(changesets.tags['comment'], '(?i)#TNZ_missing_villages') AND planet.type = 'node' AND planet.tags['place'] IS NOT NULL;
Get all data from OSM in csv, where is a tag “place” and was added under #TNZ_missing_villages
SELECT planet.*, changesets.tags
FROM planet
JOIN changesets ON planet.changeset = changesets.id
WHERE regexp_like(changesets.tags['comment'], '(?i)#TNZ_missing_villages') AND planet.tags['place'] IS NOT NULL;
Number of building which was added under #tanzaniadevelopmenttrust
SELECT COUNT(planet.changeset)
FROM planet
JOIN changesets ON planet.changeset = changesets.id
WHERE regexp_like(changesets.tags['comment'], '(?i)#tanzaniadevelopmenttrust') AND planet.type = 'way' AND planet.tags['building'] IS NOT NULL;
Get all building as csv which was added under #tanzaniadevelopmenttrust
SELECT planet.*, changesets.tags
FROM planet
JOIN changesets ON planet.changeset = changesets.id
WHERE regexp_like(changesets.tags['comment'], '(?i)#tanzaniadevelopmenttrust') AND planet.type = 'way' AND planet.tags['building'] IS NOT NULL;
Get all amenity was added under #tanzaniadevelopmenttrust
SELECT planet.*,
changesets.tags
FROM planet
JOIN changesets
ON planet.changeset = changesets.id
WHERE regexp_like(changesets.tags['comment'], '(?i)#tanzaniadevelopmenttrust') AND planet.tags['amenity'] IS NOT NULL;
Get all amenity within the box enclosing Tanzania
SELECT * from planet
WHERE type = 'node'
AND tags['amenity'] IN ('hospital', 'clinic', 'doctors')
AND lon BETWEEN 29.3269 AND 40.6584
AND lat BETWEEN -11.7612 AND -0.9854;
MARA BOX
AND lon BETWEEN 33.1411 AND 35.28625
AND lat BETWEEN -0.99878 AND -2.50865;
Note: you can get spatial coordinates for a drawn box from here: http://bboxfinder.com/
Get all amenity within the box enclosing Tanzania box - used maps.me
SELECT planet.*,
changesets.tags
FROM planet
JOIN changesets
ON planet.changeset = changesets.id
WHERE lower(changesets.tags['created_by']) LIKE '%maps.me%'
AND planet.tags['amenity'] IS NOT NULL
AND lon BETWEEN 29.3269 AND 40.6584
AND lat BETWEEN -11.7612 AND -0.9854;
List all changeset tags which contains #tanzaniadevelopmenttrust
SELECT changesets.tags
FROM changesets
WHERE lower(changesets.tags['comment']) LIKE '%tanzaniadevelopmenttrust%'
Get all comments + timeset and user, wich contains #tanzaniadevelopmenttrust
SELECT changesets.tags,
created_at AS timestamp,
user AS user
FROM changesets
WHERE lower(changesets.tags['comment']) LIKE '%tanzaniadevelopmenttrust%'
Look for the app like maps.me was used for mapping
SELECT changesets.user,
min(created_at) AS first_edit,
max(created_at) AS last_edit,
date_diff('day', min(created_at), max(created_at)) AS lifespan, sum(num_changes) AS total_edits
FROM changesets
WHERE lower(changesets.tags['created_by']) LIKE '%maps.me%'
AND changesets.created_at > date '2020-01-01'
AND changesets.min_lat decimal(9,7) = -11.7612, changesets.max_lat decimal(9,7) = -0.9854, changesets.min_lon decimal(10,7) = 29.3269, changesets.max_lon decimal(10,7) = 40.6584
GROUP BY changesets.user
ORDER BY lifespan DESC
Guides
AWS Big Data Blog – Querying OpenStreetMap with Amazon Athena