As you may have guessed from my previous posts on my new company/project LocalStream and iterables-support, the two topics are related. A lot of the open source work I’ve been doing on GitHub lately actually relates to LocalStream. The post on iterables-support was intended as the first in a series of posts that highlight the different GitHub projects I’ve created while working on LocalStream. In the second post of this serie, I want to highlight GeoKV, a persistent key value store for geospatial data that is optimised for processing and iterating over this data.
LocalStream is all about local. That means it relies on a lot of data about local and processing such data has kept me busy lately. In the iterables-support post, I outlined what that GitHub project entails and why I created it: processing and iterating over (location) data. GeoKV is another important component in the batch data processing environment I have.
The problem with location data is that there is a lot of it (as in my poor laptop would run out of memory if I loaded all of it in memory). When processing data, you typically iterate over a file with the data, do a lot of nearby searches for related stuff, do some processing, and then write the results somewhere. The problem with that is that “somewhere” is typically another file. So, if you want to process the output of your processing, you end up with a complex pipeline of different processing steps and different files. Another problem with this approach is that it is hard to query the data this way. Finally, these files are big and you generally compress them to save disk space and reduce the amount of disk IO.
An alternative is using a geospatial database. The advantage of this is that you can do querying. However, if you’ve ever done any processing on a database, you probably realize that they are not particularly well suited for iterative processing, especially not if you want to write back results. Also, most databases don’t handle many small queries very well and mixed read/write traffic is just about the worst thing you can do to a database and tends to be magnitudes slower than the pipes and filter approach outlined above. On top of that, most databases don’t use compression in their storage layer; or if they do, it is generally done per object, which is a lot less effective than compressing a single file.
GeoKV is designed go make these problems go away and get the best of both worlds. It is optimised for mixed reading and writing of geo spatial data, while doing lots of queries on that data. Working with GeoKV is more like using a Java HashMap than a using a database or reading from a file. You can put things in it (id, latitude, longitude, and some object) and get things out of it by querying it with using the id, a polygon, a circle, or a bounding box.
Internally, GeoKV uses geohashes to partition the data over a large number of buckets. Geohashes have the nice property that coordinates that are close to each other have the same geohash prefix. Another nice property is that the area covered by a geohash prefix is rectangular. So, in GeoKV objects in the same bucket are close to each other in the same rectangular area identified by their geohash prefix.
Each bucket is persisted to a gzipped text file. To convert the objects to and from text, GeoKV simply delegates to an object that you provide at construction time that has a parse and serialize method. So GeoKV doesn’t really care what you store as long as you can tell it how to parse and serialize it.
When accessing the GeoKV, buckets are loaded on demand. Once in memory, the buckets are stored in an LRU cache (courtesy of Google’s Guava library). Buckets that drop out of the cache are persisted when that happens (if there were changes). What this means is that GeoKV reads and writes compressed data lazily. Instead of reading and writing individual objects, it reads and writes buckets, each of which may contain many objects. The resulting disk IO pattern is that of occasionally reading and writing buckets of a few MB (compressed). This is a lot more efficient than reading many objects one by one, which is what most databases do.
To keep track of the objects, an index of ids and geohashes is kept in memory and persisted on disk separately from the buckets. This allows GeoKV to identify the correct bucket for an id.
GeoKV allows you to specify the geohash prefix length for buckets at construction time. Generally a value of 6 or 7 is sensible (covers a few city blocks) in urban areas. The cache size is also configurable and with e.g. 5000 buckets and a hash size of 6, you can easily fit entire cities in a memory heap of a few GB.
When querying the GeoKV, you get back an iterator that returns the objects that match the query. Using an iterator means that GeoKV does not have to have the entire result set in memory. To implement the geo spatial queries, I use another of my GitHub projects: geotools that provides methods to calculate the geohashes that cover a polygon, circle, rectangle, or line. Using that list of geohashes, it is straightforward to then find the corresponding buckets and then iterate over the content of each of those buckets, filtering out those objects that are outside the polygon.
Because it works this way, working with GeoKV is just about as fast as using a HashMap provided your processing is local and the buckets you need at any time fit into the cache.
Some examples of things I have done using GeoKV:
- For each place, find the most important nearby places and add a reference to those places in the place object and write it back.
- Cluster places into neighborhoods and write a neighborhood object to the GeoKV and then modify each place in the neighborhood to have a reference to the newly created neighborhood object; annotate the neighborhood object with references to the most important places inside.
- Find duplicate places by iterating and comparing with nearby objects and decide for each cluster of duplicates which one is the one to keep.
A couple of properties set GeoKV apart from other solutions:
- It persists but not very often. Persisting is essential for preserving your data but it is kind of expensive. GeoKV does this infrequently to keep things fast. The price you pay is of course reliability. Because of this, GeoKV is not suitable for use as a production database (no ACID).
- It shards data based on geo coordinates, efficiently caches nearby data in to memory, and it provides you with geospatial querying to access the cached data. This makes it very suitable for processing large geospatial data sets using very simple algorithms that would normally be impractical.
- It’s designed with Java in mind. So it supports generics and Java’s foreach. This makes using it as straightforward as using a HashMap.
I hope GeoKV may be of use to others doing processing of geospatial data. I’d be interested in any kind of feedback people might have on this.