Querying OpenStreetMap with Amazon Athena

Querying OpenStreetMap with Amazon Athena

OpenStreetMap (OSM) is a free, editable map of the world, created and maintained by volunteers and available for use under an open license. Companies and non-profits like Mapbox, Foursquare, Mapzen, the World Bank, the American Red Cross and others use OSM to provide maps, directions, and geographic context to users around the world.

As an open content project, regular OSM data archives are made available to the public via planet.openstreetmap.org in a few different formats (XML, PBF). This includes both snapshots of the current state of data in OSM as well as historical archives.

Working with “the planet” (as the data archives are referred to) can be unwieldy. Because it contains data spanning the entire world, the size of a single archive is on the order of 50 GB. The format is bespoke and extremely specific to OSM. The data is incredibly rich, interesting, and useful, but the size, format, and tooling can often make it very difficult to even start the process of asking complex questions.

However, Amazon can help you to work with this massive database. AWS has the right “tool” an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL and it’s called Amazon Athena!

  • Amazon Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
  • Easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.
  • Fast: Most results are delivered within seconds. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

 

The OSM Public Datasets consist of three tables: planet; planet_history and changesets. (In this blog post you can find the way how to register the data tables):

  • Planet: Contains the current versions of all elements present in OSM.
  • Planet_history: Contains a historical record of all versions of all elements (even those that have been deleted).
  • Changesets: Contains information about changesets in which elements were modified (and which have a foreign key relationship to both the planet and planet_history tables).


 

Let’s see an example

Tanzania Development Trust has a mapping project called Crowd2map Tanzania. Working with the HOT team, Crowd2map Tanzania (hereafter C2M) volunteers added millions of buildings to OpenStreetMap. Each “changeset” was uploaded with the #tanzaniadevelopmenttrust hashtag, which can help us in data management and querying. Now, the organisation needs all the data tagged as “amenity” and sent in a changeset in which the #tanzaniadevelopmenttrust hashtag was also used. The query will be something like this:

The running time, in this case will be less than 23 seconds while the entire “OSM planet” data will be scanned – which is 78.83 GB in May 2021. The result contains all nodes that meet the request criteria and all available information about it: its spatial position, when it was created, who created it, the number of changeset in which it was uploaded, what “tag” was used, and so on.


 

Here are some queries that can be useful:

Number of changeset per user, where the comment is #tanzaniadevelopmenttrust

SELECT user,
         count(*) AS changesets
FROM changesets
WHERE regexp_like(tags['comment'], '(?i)#tanzaniadevelopmenttrust')
GROUP BY  user
ORDER BY  count(*) DESC;


Number of user who added the at least 1 #tanzaniadevelopmenttrust related changeset

SELECT count(DISTINCT(uid)) AS "Number of volunteers"
FROM changesets
WHERE lower(changesets.tags['comment']) LIKE '%tanzaniadevelopmenttrust%'


Number of user who added the at least 1 #tanzaniadevelopmenttrust related changeset since 1st of January 2020

SELECT count(distinct(uid))
FROM changesets
WHERE lower(changesets.tags['comment']) LIKE '%tanzaniadevelopmenttrust%'
	AND changesets.created_at > date '2020-01-01'

Note
If you need an intervall You can use in this way:
AND changesets.created_at >= date '2020-01-01' AND changesets.created_at < date '2020-10-01'


Number of user who added the more than 20 #tanzaniadevelopmenttrust related changeset 

SELECT count(uid) FROM (
  SELECT uid, count(id) AS num_changesets
  FROM changesets
  WHERE lower(changesets.tags['comment']) LIKE '%tanzaniadevelopmenttrust%'
  GROUP BY uid
) WHERE num_changesets > 20


All user - name; first edit; last edit; calculated lifespan and total edits 

 SELECT
     changesets.user,
     min(created_at) AS first_edit,
     max(created_at) AS last_edit,
     date_diff('day', min(created_at), max(created_at)) AS lifespan,
     date_diff('day', max(created_at), current_date) AS offline_since,
     sum(num_changes) AS total_edits
 FROM
     changesets
 WHERE
     lower(changesets.tags['comment']) LIKE '%tanzaniadevelopmenttrust%' -- hotosm changesets only
 GROUP BY
     changesets.user
 ORDER BY lifespan DESC


All user - name; first edit; last edit; calculated lifespan and total edits but searching the app was used aka “created_by”

 SELECT
     changesets.user,
     min(created_at) AS first_edit,
     max(created_at) AS last_edit,
     date_diff('day', min(created_at), max(created_at)) AS lifespan,
     sum(num_changes) AS total_edits
 FROM
     changesets
 WHERE
     lower(changesets.tags['created_by']) LIKE '%maps.me%'
 GROUP BY
     changesets.user
 ORDER BY lifespan DESC
 

Number of village, city, etc. was added with “TNZ_missing_villages” hashtag

SELECT COUNT(planet.changeset)
FROM planet
JOIN changesets ON planet.changeset = changesets.id
WHERE regexp_like(changesets.tags['comment'], '(?i)#TNZ_missing_villages') AND planet.type = 'node' AND planet.tags['place'] IS NOT NULL;


Get all data from OSM in csv, where is a tag “place” and  was added under #TNZ_missing_villages 

SELECT planet.*, changesets.tags
FROM planet
JOIN changesets ON planet.changeset = changesets.id
WHERE regexp_like(changesets.tags['comment'], '(?i)#TNZ_missing_villages') AND planet.tags['place'] IS NOT NULL;


Number of building which was added under #tanzaniadevelopmenttrust

SELECT COUNT(planet.changeset)
FROM planet
JOIN changesets ON planet.changeset = changesets.id
WHERE regexp_like(changesets.tags['comment'], '(?i)#tanzaniadevelopmenttrust') AND planet.type = 'way' AND planet.tags['building'] IS NOT NULL;


Get all building as csv which was added under #tanzaniadevelopmenttrust

SELECT planet.*, changesets.tags
FROM planet
JOIN changesets ON planet.changeset = changesets.id
WHERE regexp_like(changesets.tags['comment'], '(?i)#tanzaniadevelopmenttrust') AND planet.type = 'way' AND planet.tags['building'] IS NOT NULL;


Get all amenity was added under #tanzaniadevelopmenttrust

SELECT  planet.*,
         changesets.tags
FROM planet
JOIN changesets
    ON planet.changeset = changesets.id
WHERE regexp_like(changesets.tags['comment'], '(?i)#tanzaniadevelopmenttrust') AND planet.tags['amenity'] IS NOT NULL;


Get all amenity within the box enclosing Tanzania

SELECT * from planet
WHERE type = 'node'
  AND tags['amenity'] IN ('hospital', 'clinic', 'doctors')
  AND lon BETWEEN 29.3269 AND 40.6584
  AND lat BETWEEN -11.7612 AND -0.9854;

MARA BOX
  AND lon BETWEEN 33.1411 AND 35.28625
  AND lat BETWEEN -0.99878 AND -2.50865;

Note: you can get spatial coordinates for a drawn box from here: http://bboxfinder.com/


Get all amenity within the box enclosing Tanzania box - used maps.me

SELECT  planet.*,
         changesets.tags
FROM planet
JOIN changesets
    ON planet.changeset = changesets.id
WHERE lower(changesets.tags['created_by']) LIKE '%maps.me%'
        AND planet.tags['amenity'] IS NOT NULL
        AND lon BETWEEN 29.3269 AND 40.6584
        AND lat BETWEEN -11.7612 AND -0.9854;

        
List all changeset tags which contains #tanzaniadevelopmenttrust

SELECT changesets.tags 
FROM changesets 
WHERE lower(changesets.tags['comment']) LIKE '%tanzaniadevelopmenttrust%'


Get all comments + timeset and user, wich contains #tanzaniadevelopmenttrust

SELECT changesets.tags,
     created_at AS timestamp,
     user AS user
FROM changesets 
WHERE lower(changesets.tags['comment']) LIKE '%tanzaniadevelopmenttrust%'


Look for the app like maps.me was used for mapping 

SELECT changesets.user,
         min(created_at) AS first_edit,
         max(created_at) AS last_edit,
         date_diff('day', min(created_at), max(created_at)) AS lifespan, sum(num_changes) AS total_edits
FROM changesets
WHERE lower(changesets.tags['created_by']) LIKE '%maps.me%'
    AND changesets.created_at > date '2020-01-01'
    AND changesets.min_lat decimal(9,7) = -11.7612, changesets.max_lat decimal(9,7) = -0.9854, changesets.min_lon decimal(10,7) = 29.3269, changesets.max_lon decimal(10,7) = 40.6584
GROUP BY  changesets.user
ORDER BY  lifespan DESC 

Guides

AWS Big Data Blog – Querying OpenStreetMap with Amazon Athena

Leave a Reply

Your email address will not be published. Required fields are marked *