Skip to content
Andrew Byrd edited this page Feb 28, 2015 · 17 revisions

Rationale

The PBF format appeared at the end of 2010 as a compact binary alternative to the voluminous XML representation of OpenStreetMap data. It was a major step forward, reducing the size of files by half or more and making OSM processing much less resource-intensive. Despite my appreciation for the benefits of the PBF format, having now written several pieces of software that consume and produce it, I have come to the conclusion that PBF is significantly more complicated than necessary to realize its compactness and speed objectives.

While working on Vanilla Extract I created a very simple debug output format -- a concise textual representation of OSM data. On a hunch I applied a couple of the most effective tricks from PBF (delta compression and variable-byte zigzag encoding), piped it through gzip, and was pleasantly surprised to discover output that was consistently smaller than PBF, despite the simplicity of the code that produced it.

Difficulties with PBF

  1. PBF has a multi-level structure. Optionally compressed Blobs contain PrimitiveBlocks, which containing PrimitiveGroups, which finally contain lists of OSM entities. While breaking the file into blocks is reasonable, this layering strikes me as unnecessarily complex.

  2. There is heavy use of parallel arrays to avoid Protobuf adding integer "tags" to each stored field. This need to circumvent the natural mapping of one Protobuf message to each OSM entity (i.e. one Protobuf type to each OSM entity type) is a sign that Protobuf is not an ideal fit for OSM data when maximum compactness is desired.

  3. String tables are used to elide highly repetitive strings such as tag keys. But because PBF blocks are usually gzipped, these string tables are redundant: gzip compression amounts to building a very efficient lookup table for all data in the stream, strings or otherwise.

  4. Encoding and decoding the PBF format requires quite a bit of additional code beyond that generated from its Protobuf specifications, leaving some doubt as to whether Protobuf is a good match for OSM data at all.

The most effective elements of PBF

The aspect of Protobuf that PBF benefits from the most is the variable-byte zigzag encoding of integers, which PBF applies to delta-coded fixed-precision coordinates. This technique in isolation yields a better compression rate than PBF when coupled to a general purpose compression library like gzip. Interestingly, even storing the delta-coded values as fixed-width integers yields output roughly the same size as PBF; gzip is quite effective at eliding all the empty space and repeated small numbers.

In conclusion, we can achieve compression ratios and processing speeds significantly better than those of PBF with file formats that are much simpler to implement. The unnecessary complexity and resource intensiveness of producing and consuming the current OSM formats may have a negative effect on the development of OSM infrastructure components. Therefore I encourage the OSM community to consider the possibility of a much simpler binary format.

Vex Format Grammar

  1. A vexfile is a gzipped stream of bytes.
  2. It contains a header then a series of blocks, followed by the number of blocks as a vuint.
  3. The header consists of the six bytes VEXFMT.
  4. Each block begins with an element_type as a vuint, followed by a series of N elements, followed by a zero then the number of elements, both as vuints.
  5. The element_type is a numeric code from zero to three, meaning node, way, relation, or end respectively.
  6. All elements within a block are of the same type, which is either node or way or relation.
  7. The final block in the vexfile always contains zero elements and has the element_type end.
  8. All elements begin with a common segment, which consists of an id_delta and a list of tags.
  9. An id_delta is the difference between this element's OSM ID and that of the previous element in the same block as a vsint.
  10. An id_delta may not be zero; this allows us to spot the end of a series of elements.
  11. A list is a vuint giving the number of things, followed by those things (which are all of the same kind).
  12. A tag is one string for a key and one string for a value.
  13. A string is a list of bytes representing characters in the UTF-8 encoding.
  14. A node is a common segment followed by a coordinate_delta.
  15. A coordinate_delta is the difference in fixed_angle latitude as a vsint relative to any previous one appearing in this block, followed by the same construct for the longitude.
  16. A fixed_angle is the integer portion of an angle in decimal degrees multiplied by 10^6.
  17. A way is a common segment followed by a list of reference_deltas for its nodes.
  18. A reference_delta is the difference between an OSM ID being referenced and that of the previous such reference in the same block, expressed as a vsint.
  19. Reference_deltas and id_deltas are independent of one another.
  20. A relation is a common segment followed by a list of members.
  21. A member is an OSM ID reference as an absolute vuint, followed by its element_type as a vuint, followed by string for its role.
  22. A vuint is a variable-byte encoded unsigned 64-bit integer, as used in Protocol Buffers.
  23. a vsint is a variable-byte zigzag-encoded signed 64-bit integer, as used in Protocol Buffers.

Terms in bold are nonterminals and those in italics are terminals or variables (some liberty taken for readability).

We can write and read strings, vuints, vsints, and raw bytes using the functions provided in the Protobuf libraries. In Java this includes the CodedOutputStream, which can wrap a GzipOutputStream and produce VEX output with very simple code: the above set of production rules define a format that can be consumed entirely left-to-right by a recursive descent parser.

Size Experiments

NOTE that some size reduction is due to loss of metadata. Be careful to compare only files that do not have any metadata.

Conversion of Netherlands PBF file to VEX in Java:

  • Duration 5 minutes for round-trip through MapDB
  • 941M PBF / 620M VEX format
  • About 35% reduction in size relative to PBF
  • 3 minutes 20 seconds to convert directly without MapDB

Conversion of New York State PBF file to VEX in Java:

  • 130M PBF / 72M VEX
  • 30 sec to write VEX file out of MapDB (not including loading PBF into MapDB)
  • 45% reduction in size relative to PBF
  • 27 sec to convert directly via a PBF Parser with no MapDB

Conversion of Portland Oregon PBF file to VEX in Java:

  • 16MB PBF / 11MB VEX
  • 10 sec to read PBF into MapDB, 4 sec to write VEX out from MapDB
  • About 31% reduction in size relative to PBF
  • 4 sec to convert directly from PBF to VEX without using MapDB

Loading the entire planet.pbf on dev.opentripplanner.org to an SSD:

  • 26GB PBF loaded into MapDB in just under 2 hours
  • Wrote out VEX in 1 hour 17 minutes, size 17GB (but metadata was stripped)

New York size increased to 77MB when I output the nodes along ways attempting to assist delta compression. The order of the nodes and ways also matters, and by using a MapDB TreeMap we output them in order of increasing ID.

Speed Experiments

Using the crude implementation already in VEX, the native format writing speed appears to be about twice as fast as PBF.

With VEx "primed" (i.e. an extract was already performed so most releavent pages are in memory) we output the whole state of Oregon.

  • 14 sec to PBF (using protobuf-c generated encoding functions)
  • 2.2 sec to VEX + 4.1 sec to gzip the stream = 6.3 sec (using custom code calling low-level protobuf-c functions)
  • 55% speedup
  • 100MB for Oregon PBF / 58MB for Oregon VEX (42% reduction)

The gzip work could conceivably be pipelined using a second thread, or you could just let the web server do it (transfer-encoding). LZMAed VEX (using xz) is much slower and only very slightly smaller, not worth the slowdown. Write to the given OutputStream in another thread: http://stackoverflow.com/a/12532222

Even Simpler?

Encouraged by these results, I gave my original text OSM format a quick makeover and performed some similar tests. Here's an example of the format:

W 26201468 name=Larch Mountain Trail 441;highway=footway;sac_scale=yes
N 491901232 45.575881 -122.113032 
N 1540261796 45.575889 -122.112751 
W 35567365 oneway=yes;highway=service;service=driveway
N 416781319 45.578931 -122.116013 
N 416781320 45.578946 -122.116195 
N 806988988 45.579015 -122.117271 
N 806989064 45.579041 -122.117595 highway=crossing
N 697817499 45.579072 -122.118158 
N 697817504 45.579103 -122.119243 

It's extremely simple: each line is an OSM entity. The first field gives the element type (W=way, N=node, R=relation). Each way is followed by the nodes that make it up, in order. The full detail of each node is included any time it appears: a node line includes the fields [N id lat lon tags]. The whole thing is piped through gzip to compress out repetitive tokens and number fragments.

An XML or PBF OSM file would include the ID once in the node itself, then once each time the node is referenced in a way. Here, because all the node detail is included directly in the ways, we don't repeat the node IDs twice.

Originally the full details of a node were given only the first time it appeared, and only its ID was given thereafter. While appealing because it removes redundant information, this optimization has a minor effect on the total size of the file and adds complexity, so it was removed. It might also yield a slight speedup on the consumer by avoiding repeated insertion of the same node in a database, but again the increase in fragility and complexity are probably not worth the improvement.

In experiments with UK and US data, this format is only about 20-30 percent bigger than PBF, but as you can imagine the code to read and write it is extremely simple. It may turn out to be slower to read and write due to conversions back and forth from strings to numbers, but considering this simplicity maybe not. That needs to be tested.