Protomaps Blog

The Surprising Anatomy of a One-Man Map Tech Stack

Protomaps is a independent project to build a new map of the world. This scale of mission demands a wide range of novel frontend and backend components, like an open source spatial database, a serverless tile archive format, and a vector map renderer. There’s also a web application — the one you’re looking at now — with subsystems to process background tasks, ingest metrics and manage objects on cloud storage. Finally, there’s the core map engine for cartographic generalization and tiling of OpenStreetMap data, which you can access at Protomaps Downloads.

Bringing a viable product to market, in addition to publishing open source software, creates some special pressures for a one-man shop. Chief among those is to make conservative technology choices. Building on stable tech means I can spend my risk budget on unique parts of the project instead of libraries; it means my open source components can be adopted by others with minimum friction and mature independently of my own use case. Rust and WebAssembly are exciting for maps, but a bootstrapped company can’t afford multi-year investments in every emerging low-level technology. In this context, some Protomaps choices raise eyebrows among web and GIS developers, so I’ll document a few of them here.

C++

C++ is the primary language at the heart of the Protomaps mapping system.

C++ isn’t valued here for any particular properties of the language itself. In aspects like consistent programming conventions or package management, it feels miserable compared to newer languages.

If you’re working with the nitty-gritty bits of geodata and computational geometry, C++ should be your first choice, because it’s what the vast majority of the ecosystem is already written in. You’ll have access to decades of battle-tested, robust libraries like glm for vector and matrix operations, GDAL/OGR for reading and writing geospatial data, S2 Geometry for spatial indexing on the sphere, and Clipper for computational geometry.

If you’re using a different language, you can write a from-scratch alternative to those libraries — which for each would be a multi-year investment without immediate payoff — or use wrappers, which might offer a mismatched API or lead you down a debugging hole that would not exist if you just interfaced with C or C++. In addition, a core goal of Protomaps is to make tooling that works at all cartographic scales, from a neighborhood to the entire planet. Using C++ and powerful libraries directly is the best way to efficiently process hundreds of millions of geographic features.

LMDB

All serious data storage and manipulation inside Protomaps uses Symas LMDB.

Most developers haven’t heard of LMDB, yet it’s the rock-solid foundation that contributes the most to making Protomaps work at scale. It’s an open source, multiprocess and fully ACID storage engine with some unique constraints, such as a single write transaction happening at a time. These constraints might limit it for general web applications but are perfect for Protomaps geodata, in which the only writes come from the OpenStreetMap replication stream.

Making LMDB Spatial

LMDB is an embedded storage engine for binary blobs, and is optimized for 64-bit integer keys. It has special functions for key-prefix cursor traversal. Protomaps uses spatial indexing schemes like S2 Cells where geographic features are bucketed into cells identified by a 64-bit integer, and cells share binary prefixes based on parent-child relationships.

cells on the hilbert curve by @enjalot

Memory-Mapping Matters

LMDB is memory-mapped as a fundamental fact of its design. The relevance of this beyond an implementation detail is not immediately obvious, but is revealed when paired with a system such as Cap’n Proto or FlatBuffers. Geographic features can be indexed by compact 64-bit keys, and attributes or geometry accessed with a pointer into virtual memory — a zero-copy operation — instead of wasting memory and CPU cycles on a deserialization step.

Geodata access patterns exhibit locality in geographic space or ID-space; arranging IDs on space-filling curves means high cache hit rates with zero application code, since virtual memory paging is handled by kernels.

LMDB: Highly Underrrated

LMDB has some mindshare overlap with storage engines like LevelDB and RocksDB, but does not have the implicit clout of being born within Google or Facebook, despite its superior B-Tree design for read-heavy applications. The API and documentation unapologetically assume you know what you are doing, and have a deep understanding of trade-offs for your problem space, such as mandatory sorted insertion order for fast write performance with MDB_APPEND. Its mmap-based, zero-copy design precludes compression, for example; for Protomaps, random I/O latency is much more critical to the product than space savings, and disks are cheap relative to other resources.

Go

Protomaps uses Go extensively as a complement to C++.

This is surprising to many developers who perceive Go as an alternative to C++. Go doesn’t have many mature low-level libraries for computational geometry. Whether libraries could even be competitive with C++ is an open question, given Go’s memory model and approach to generic programming.

Networking, HTTP and concurrency are solved problems in C++ via libraries like libasio but anything but easy to implement, especially as a solo developer. These are all trifling matters in Go to the point where they are first-class parts of the standard library. Go programs inside Protomaps avoid external libraries whenever possible and average about 100 lines of code. A typical example is to listen for HTTP requests, unmarshal JSON in the request body, and fork/exec a C++ program in a goroutine, reporting progress through a channel.

  • go-pmtiles is a library to read, write and serve the PMTiles serverless tile archive format.

Other Likes

The Protomaps web application uses Django, mostly for its batteries-included auth and administration modules, and good-enough server rendered HTML. Scripting and lightweight geoprocessing is in Python, via the excellent Shapely and Rasterio libraries. SQLite stores relational data like users and metrics. esbuild makes all TypeScript development a breeze.

Docker: Nope

source: flickr

Given the breadth of software needed to make maps happen, it is surprising to developers that I consciously avoid Docker and its ecosystem.

Docker is a container format and runtime. It is incredibly useful if you are building Heroku or dotCloud, since building a Platform-as-a-Service (PaaS) demands multi-tenant isolation. Protomaps is not building a PaaS, nor are most companies. Companies often use Docker as a glorified static linker, and in the process introduce redundant abstractions and unknown-unknown failure modes.

Docker for many in-house use cases solves a cultural problem rather than a technical one: the historic separation of software development and operations, where “code” is tossed over the wall for SREs to deploy and carry pagers for.

As a one-man shop doing both software development and operations, I can decree a priori that all programs target a specific Ubuntu, depend on a set of packages from the OS, and run with a unprivileged user and designated port via a systemd configuration. My day-to-day development is still on macOS, but this is treated as a environment secondary to the production one.

Revenge of Docker

It’s likely that a PaaS that Protomaps adopts is built on Docker, and might require some Docker incantations. It’s also possible that software Protomaps distributes to customers will be packaged via Docker, although statically linked, architecture-specific binaries are a better solution (goreleaser is my current choice for this).

PostgreSQL: Nope

source: flickr

Protomaps does not use PostgreSQL, despite it being the de-facto standard for storing and querying GIS data via the excellent PostGIS extension.

That is not to say PostgreSQL or PostGIS are flawed; they’re an exemplar of what open source geospatial tech can be and have had a massive impact on GIS in government. They are the perfect solution for relational geodata and multi-writer transaction processing over a network, neither of which are relevant to Protomaps.

Many companies use PostGIS as a convenient SQL frontend to its underlying libraries, GDAL/OGR and GEOS, when the problem could be solved in a simpler fashion by using those directly. By introducing a client-server database, automation becomes much more complex, involving ports, user authorization and process managers, instead of being self-contained in a single file like SQLite or LMDB. Protomaps’ problem of scale again means that fetching data over the wire is just too much overhead, and efficient data structures for cartographic generalization, like planar straight-line graphs, cannot be hammered into the relational model.

The pervasiveness of relational databases like PostGIS as a solution to all geo-shaped problems is, again, a cultural artifact. In this case, it’s the status quo of GIS as an academic discipline and industry: one dominated by a single company (Esri) and its workflows and assumptions, where “doing GIS” means manipulating a relational database, and software development is abstracted away to a vendor.

Other Nopes

Despite years wrangling Ruby on Rails in my Pivotal Labs days, I usually reach for Django for its standardized auth and admin. Server-side JavaScript is avoided; this means re-implementing logic in both JS and Go/Python, which I don’t mind. Java has a mature geospatial ecosystem with libraries like JTS and Osmosis, but moving onto Java is all-or-nothing, given the packging conventions and JVM cold start times.

What Next?

The emerging projects I’m most excited about are “cloud-optimized” formats for geodata, including FlatGeobuf, GeoPngDB, and COPC as complements to the format I’m developing, PMTiles.

If you’d like to keep up with Protomaps, or are interested in building and deploying custom, interactive maps of the entire world, you can reach me at brandon@protomaps.com or find me on Twitter.