Experiment on switching from SQL to Redis to request OpenStreetMap topological data

In short. A custom data storage and manually tuned query steps on a database fully in memory should offer better performance than a SQL interpreter with an automatic planner for sub steps of a database on disk with generic structures and indexes.

The goal is to explore how the tool Osmose-QA, Quality Assurance for OpenStreetMap data can be speedup.

SQL vs Redis

With SQL databases, the query is send to the server, the data are fetched, assembled, then the result is returned to the caller.

Redis is a in memory key-value “database” (better called “store”). Commands available are very low level, like “get”, “set” or “delete” on a key. Values can be strings (binary strings), arrays (aka lists), sets or dictionaryies (aka hashs). Operations on values are available like set intersection, set union, array pop or push… Results of operations can be returned to the caller or stored in a key. See the Redis commands like some sort of assembler. Set some values in registers, then run a command, and do it again. No secondary build-in index is available. The unique available index is the primary one on Redis key. The caller program make multiple calls to Redis server to build the aimed result. Lua scripting is possible on the server side, but very limited. It can used to avoid some round-trips between caller and server when there is no need of logic. The logic (condition, loop) is only on caller side.

We aim to compare SQL on a tuned Osmosis loaded database and Redis for queries on the Paris area.

Osmosis data in Postgres for the area 6GB, with the spatial PostGis indices 7GB.

Load OpenStreetMap data in Redis

For each OpenStreetMap objects (nodes, ways, polygons) store the tags as hash in Redis addressed by a Redis key build on object type and id. In addition, for nodes store latitude and longitude, for ways store node ids composing it, for relations store members.

Add indices to retrieve objects by tags (or by tags plus values). The Redis key is the OpenStreetMap tag key. For each tag key, store a set of object ids using it.

For the purpose of the test (and hardware limitation) we just index the tags: “highway”, “railway” and “power”.

Loaded Redis use 4.3GB of memory. There is no kind of spatial index.

Port a very simple request from SQL to Redis

Goal: locate OpenStreetMap nodes with level crossing tag where there the railway or the highway are not present.

Algorithm:
For each level crossing:
Fetch way using this node
Check if there is at least one railway and one highway

SQL

SELECT
nodes.id,
ST_AsText(nodes.geom),
MIN(ways.id)
FROM
nodes AS nodes
JOIN way_nodes ON
way_nodes.node_id = nodes.id
JOIN ways AS ways ON
ways.id = way_nodes.way_id AND
ways.tags != ''::hstore AND
ways.tags ?| ARRAY['highway', 'railway']
WHERE
nodes.tags != ''::hstore AND
nodes.tags?'railway' AND
nodes.tags->'railway' IN ('level_crossing', 'crossing')
GROUP BY
nodes.id,
nodes.geom
HAVING
NOT BOOL_OR(ways.tags?'highway') OR
NOT BOOL_OR(ways.tags?'railway')

Explain

GroupAggregate  (cost=4028.54..4028.58 rows=1 width=244)
Filter: ((NOT bool_or((ways.tags ? 'highway'::text))) OR (NOT bool_or((ways.tags ? 'railway'::text))))
-> Sort (cost=4028.54..4028.55 rows=1 width=244)
Sort Key: nodes.id, nodes.geom
-> Nested Loop (cost=35.89..4028.53 rows=1 width=244)
-> Nested Loop (cost=35.89..3926.27 rows=12 width=116)
-> Bitmap Heap Scan on nodes (cost=35.89..3808.93 rows=10 width=108)
Recheck Cond: ((tags ? 'railway'::text) AND (tags <> ''::hstore))
Filter: ((tags -> 'railway'::text) = ANY ('{level_crossing,crossing}'::text[]))
-> Bitmap Index Scan on idx_nodes_tags (cost=0.00..35.88 rows=983 width=0)
Index Cond: (tags ? 'railway'::text)
-> Index Scan using idx_way_nodes_node_id on way_nodes (cost=0.00..11.72 rows=1 width=16)
Index Cond: (node_id = nodes.id)
-> Index Scan using pk_ways on ways (cost=0.00..8.51 rows=1 width=136)
Index Cond: (id = way_nodes.way_id)
Filter: ((tags <> ''::hstore) AND (tags ?| '{highway,railway}'::text[]))

Execution time: cold run : 15 s, hot run : 160 ms (second run with disk cached in memory).

Redis

The Redis client is in Ruby.

# Get all node with railway=level_crossing or railway=crossing and store it in register REG1
redis.sunionstore(REG1, 'Node:Key:railway:Value:level_crossing->nids', 'Node:Key:railway:Value:crossing->nids')
# store it in register REG1
redis.sunionstore(REG1, 'Node:Key:railway:Value:level_crossing->nids', 'Node:Key:railway:Value:crossing->nids')
# For each node
redis.sscan_each(REG1) { |nid|
r = redis.pipelined {
# Intersect the list of ways ids using this node with the list of ways using highway and railway tags, sotre in registers REG2 and REG3
redis.sinterstore(REG2, "Node:#{nid}->wids", 'Way:Key:highway->wids')
redis.sinterstore(REG3, "Node:#{nid}->wids", 'Way:Key:railway->wids')

Execution time: 250 ms.

The results for the in memory data are of the same magnitude. Maybe the goal was “too simple”, lets try with something just “simple”.

Note: I know the SQL query and the Redis one does not return exactly the same result. A good point to the first one to find the difference.

Port a more complex query, but still a simple one

Goal: Find ways tagged with “highway” having common nodes with ways tagged with “power”.

SQL

The SQL version was tuned in Osmose-QA. First, normal SQL form is not used, the ways have an array of used node ids (in addition of the join table between ways and nodes). We use the intersection of the node ids arrays to find the common nodes. Secondly, the query is speedup by avoiding the join on way_nodes table by using a spatial index on way geometry bounding box.

SELECT
highway.id AS highway_id,
powerline.id AS powerline_id
FROM
ways AS highway
JOIN ways AS powerline ON
powerline.linestring && highway.linestring AND
powerline.nodes && highway.nodes AND
powerline.id != highway.id AND
powerline.tags != ''::hstore AND
powerline.tags?'power' AND
powerline.tags->'power' IN ('line', 'minor_line', 'cable')
WHERE
highway.tags != ''::hstore AND
highway.tags?'highway'

Explain

Nested Loop  (cost=127.30..18570.93 rows=1 width=16)
Join Filter: ((powerline.nodes && highway.nodes) AND (powerline.id <> highway.id))
-> Bitmap Heap Scan on ways powerline (cost=127.30..13920.64 rows=57 width=736)
Recheck Cond: ((tags <> ''::hstore) AND (tags ? 'power'::text))
Filter: ((tags -> 'power'::text) = ANY ('{line,minor_line,cable}'::text[]))
-> Bitmap Index Scan on idx_ways_power (cost=0.00..127.28 rows=3849 width=0)
-> Index Scan using idx_ways_linestring on ways highway (cost=0.00..81.57 rows=1 width=736)
Index Cond: (powerline.linestring && linestring)
Filter: ((tags <> ''::hstore) AND (tags ? 'highway'::text))

Execution time: cold run : 23 min, hot run : 6 min (second run with disk cached in memory).

Redis

# Loop over ways having the tag "highway"
redis.sscan_each('Way:Key:highway->wids') { |wid|
# Fetch nodes id of the way
nids = redis.smembers("Way:#{wid}->nids")
# Add way node ids to register set collecting them
redis.sadd(REG1, nids)
}
# Loop over ways having the tag "power"
redis.sscan_each('Way:Key:power->wids') { |wid|
# Check if it is a powerline
if [d_line, d_minor_line, d_cable].include?(redis.hget("Way:#{wid}->tags", d_power))
# Fetch nodes id of the way
nid = redis.smembers("Way:#{wid}->nids")
# Then add way node ids to the register set collecting them
redis.sadd(REG2, nid)
end
}
# Intersects the two node ids set and store the result into the register REG3
redis.sinterstore(REG3, REG1, REG2)

Execution time: 54 s.

It need tuning to choose the operations order but the result look promising.

Redis with Lua

One block of code is duplicated and it is executed unconditionally.

# Fetch nodes id of the way
nids = redis.smembers("Way:#{wid}->nids")
# Add way node ids to register set collecting them
redis.sadd(REG1, nids)

We can use a Lua micro-script to avoid sending two commands one by one. Since the second is systematically after the first one, we can avoid to send back the result of the first one to the caller program, then resend it to Redis.

We load a Lua script into Redis, it will be name and callable by a hash.

-- Load a Lua script to append to register set the content of an other set
redis.call('sadd', KEYS[1], unpack(redis.call('smembers', KEYS[2])))

Then we call it from Ruby

redis.evalsha(sadd_smembers, [REG1, "Way:#{wid}->nids"])

Execution time: 32 s.

So, lot of time was lost in exchange between the caller program and Redis.

Redis, Tile38 and spatial index

Like in the original SQL we can try to use a spatial index to speedup the process.

The idea, follow the SQL plan but with Redis. In place of building two very large sets of node ids to intersect then, we can narrow the set of node ids if we are able to know witch ways can potentially intersect an other. That the point of the usage of the spatial index.

Recently Redis add a support of Geohash to convert the couple latitude and longitude to a hash string. The hash can be inserted in a Redis sorted set. Some simple operations are available using this. Unfortunately this does not work for bounding box, and even worst for bounding box intersection.

But a Redis similar project dedicated to spatial objects is available: Tile38. It’s so close to the Redis concept than you can use Redis client to connect to a Tile38 database. It offer new operations and type of value : the spatial ones. In particular it can index geometry with R-Tree, avoiding flat scan and intersection on very large Redis set of ids. So, now we have a secondary in memory databases !

Tile38 with spatial data loaded and indexed: 300MB.

Bad luck ! Exution time: 59 s.

Economies of scale

Unfortunately in Osmose-QA software we run lot of SQL queries in batch. e can consider we are always in hot run. But more, because we query the same kind of objects over many queries, we pre-select and pre-index this data in temp tables.

On a real fly, the last SQL use a temps table with pre-selected “highway” and spatial index on it. With that optimization, the true execution time is: cold: 44 s, hot: 4 s.

We can do some kind of pre-select (or indexing) with Redis. The common operation we can factorize over Redis queries can be the switch from way ids for a tag to the node ids of ways using a tag.

With this factorization we can improve the fastest Redis implementation we have from 32 s to 1 s has the longest task is looping over the numerous way ids tagged with “highway”.

Memory improvement

We focus on the execution time, but we can also reduce memory foot print by removing duplicate strings used by tags. Replace string by an integer and store the string and the integer in a dictionary. Then just store the dictionary as a hash in a key of Redis.

I think lot of other memory improvement can be done.

Is it worth it ?

I learned lot of thing doing this experimentation. Osmose-QA need 2 days of computing on 15 servers to analyze of the whole data of OpenStreetMap. It runs in loop. Event small speed improvement have value. Nevertheless using Redis need lot of try and tuning of strategy and operations order. The result code is far less readable than the SQL one, and we try only on simple queries.

You are still reading ? You’re nice !
You understood what I’m talking about ? We’re both lucky !
You have idea of improvement ? Amazing !

The source code: https://github.com/frodrigo/pbf2redis

Osmose-QA: A tools for analyze the quality of OpenStreetMap data and report issue to the contributors.

Osmosis: OpenStreetMap data loader in Postgres/PostGis database.

Tile38: in-memory statial object store.