Using Elastic Search for geo-spatial search

Over the past few months we have been quietly working on the localstre.am platform. As I have mentioned in a previous post, we are using elastic search as a key value store and that’s working pretty nicely for us. In addition to that we are also using it as a geospatial search engine.

Localstrea.am is going to be all about local and that means geospatial data and lots of it. That’s not just POIs (points of interest) but also streets, cities, and areas of interest. In geospatial terms that means shapes: points, paths, and polygons. Doing geospatial search means searching through documents that have geospatial data associated with it using a query that also contains geospatial data. So given a shape, find every document with a shape that overlaps or intersects with it.

Since elastic search is still very new and rapidly evolving (especially the geospatial functionality), I had some worries about whether it would work as advertised. So, after months of coding it was about time to see if it could actually take a decent data set and work as advertised instead of falling over and dying in a horrible way.

I actually have a dataset of about 30M pois that is very suitable for this kind of test. So, I launched elastic search on my laptop. Then I pointed my bulk indexing script at the 3GB of compressed json and hit enter. The script calls the bulk index API in elastic search with 500 json documents in one request and uses six threads to keep on sending it data. This enables elastic search to use multiple threads to index the data and with the recent Lucene 4.x release this the recommended way to index large amounts of data. This took about one hour. That translates into a bit under 10K documents per second. That is pretty good. I might be able to tune the throughput a bit more by fiddling with the bulk size, number of threads and other parameters.

The end result was an index with a size of about 11GB. This too is pretty nice: the uncompressed size of the json is probably about the same size (maybe a bit larger). Elastic search is very IO intensive and having a nice compact index means that it can keep a lot of the data in memory and utilize the OS caching really well. 11GB also means that the compression that is now turned on by default in Lucene 4.x is working pretty nicely.

To understand this a bit better, I need to explain how spatial search actually works in elastic search.

Spatial search in Lucene (and thus Elastic Search, which uses this library) is implemented using so-called prefix trees. There are two varieties of this data structure in Lucene. One is geohash based and the other is quad tree based. The main difference between the two is the number of nodes per level in the tree. A quad tree has, as the name implies, nodes with four nodes below it. A geohash is a base32 encoded quad tree path of the interleaved bits of the latitude and longitude. Each level is a single character (with five bits) in that path and thus there are 32 nodes per level.

So, given a coordinate with a latitude and longitude you can calculate the path to a node in the prefix tree. In the case of the geohash implementation that is a Geohash string and in the case of the quad tree that is a bit set. In both cases, the path actually describes a rectangular area. All points in that area have the same prefix. If you increase the length of the prefix by 1 level, you descend a level in the tree and the rectangle breaks down into 32 smaller rectangles in the case of a geohash or 4 rectangles in the case of a quad tree.

Lucene is a text indexing library and doesn’t support any of the above natively. So, instead given a polygon, a prefix tree is used to calculate which tree nodes the polygon covers. The node paths become simple Sting terms and this is something that Lucene does know how to handle. At query time, whatever shape you are querying for is treated the same and you end up with a terms based query internally. From there on it is business as usual for Lucene. Documents have terms, queries have terms and Lucene is really good at figuring out which documents match which terms.

The problem with this is accuracy and index size. It took a bit of fiddling and asking around on the elastic search mailing list to figure out that spatial search has been a bit of a work in progress recently. To summarize: elastic search has spatial search. The version of elastic search that is about to be released as 0.90 very soon has a vastly improved version of that functionality that works a bit differently than currently documented for the old version. Accuracy depends on the prefix path length. The length of this path is controlled using the tree-levels setting. The more levels, the longer the path is, and the more of the coordinate’s bits are encoded and the more accurate things get. More accurate also means more terms and this is why configuring tree-leves is very important.

You need to trade off accuracy against index size. In my test, I’ve experimented with a few different settings. In the end I used the quad tree implementation with a tree_levels setting of 20 and ended up with an index size of only 11GB. Decent accuracy translates into about 50-100 meter, which for my use case is good enough. Increasing the tree_levels beyond this is possible and would get the accuracy down to a few meters or even better. But that would really bloat the index size. It only takes a few extra levels to get to a terabyte for the same amount of data. If you’d want millimeter accuracy, you’d probably have to bump the levels quite a bit more than that. Elastic search might still be able to do that with a sufficiently large cluster but indexing speed and query execution time would really suffer. In any case, my laptop doesn’t have a TB of disk and life is to short for this.

Finally, I ran some queries to validate accuracy and speed. I used a polygon representing a circular area of 50m radius and created a few queries in different places and ran them. They all came backin 20-30 ms with sensible results. That is pretty good. I actually ran a few queries while it was still indexing and the response times were similarly fast. Not bad for a cold index that is still ingesting data at a rate well beyond what the likes of e.g. mysql would be able to handle. While at Nokia, we spend a few months abusing mysql as a key value store and this is a big reason we are not going anywhere near it in LocalStre.am. It just doesn’t scale.

This little experiment made me pretty happy today. It provides some validation of our choice to use Elastic Search as a spatial search engine and key value store. Having one solution for this means I won’t lose any sleep trying to keep two systems in sync. I can simply add data and a second later it will be available for search and my cluster will handle read and write traffic at the same time. Those are non trivial things with other solutions. I was expecting serious query execution delays while it was indexing. In fact it made no measurable impact on the response times and it seems to handle large amounts of data without breaking a sweat. That is nice to know. Add to this the sharding and replication capabilities and you have a combination that is pretty damn good for what we are planning to do.

I love it when a plan comes together.

16 Replies to “Using Elastic Search for geo-spatial search”

    1. Yes that sort of works. The highway would be a linepath and your query would be a polygon. if the Linepath intersects the polygon, that’s a hit. The trick is coming up with a polygon that defines a 3 km box around the linepath.

      That’s pretty much what I’m planning to do with open street map streets. I’ve tested this on a small data set already and it works pretty good. The only issue is distance ranking, you have to do that manually on the result set currently.

  1. Very interesting post, I hope you don’t mind a few questions

    What implementation of QuadTree did you use? Why did you use it over GeoHash? How do you represent a linepath using a QuadTree index? A BBox? If a bbox of a shape crosses a boundary, is there any special algorithm that you use to decide which index to assign? Or do you assign multiple indexes to that shape?

    Also, unrelated question, Is there any particular advantage to ElasticSearch over SOLR? Solr seems to have a larger community support. Were you able to get higher performance from ES vs Solr or MongoDB?

    1. I used the quad tree. Functionally there is not much difference with the geohash implementation but the indexes ended up being a lot smaller for my usecase.

      Basically you can think of either implementation as providing a grid. So representing a shape like a path is very much like drawing a line using pixels. The higher the tree level, the smaller the pixels.

      Basically this is all (including index management) taken care off by Elastic Search. You simply provide it the shape in geo json format and decide on the size of the ‘pixels’ by setting the tree_levels parameter in the mapping.

      I’ve used Solr in the past and found that it was a bit lacking in a couple of respects. Mainly for clustering the options were very limited and even with the latest version it is far from ideal. We had severe issues with setting up replication and getting it working reliably for example. Apparently things have improved with 4.x but I no longer use it.

      Elastic Search was designed from the ground up to be a sharding and replicating cluster out of the box (it does this by default). Additionally, their REST API is designed by somebody who actually understands REST-ful APIs. I always found the Solr API to be very clumsy.

      Technically, both projects use Lucene and that means that raw performance and algorithms are pretty much the same. I’d say, Elastic Search probably is the definite leader when it comes to e.g. distributed search in a cluster, replication, and sharding. They are doing some very clever things at the cluster level that Solr has yet to adopt.

      Also, one of the Lucene core committers is a co-founder of Elasticsearch. So, I wouldn’t worry too much about the community size. There are lots of active users on the mailing lists and the company behind Elasticsearch is growing rapidly as well (they just got 20M in funding).

      MongoDB is a bit of a different beast entirely. I have very little experience with it and the querying options are a lot more limited. In terms of raw performance, you should be able to get similar performance from Solr and Elastic Search. However, I expect that it would be a bit more work to tweak Solr to get there.

  2. I have a doubt that how does Elastic Search calculates distance between two points on earth. ? Does it use some distance formula as in haversine distance or someother technique. Please elaborate on how does it work.

    Thanks.

  3. Hi Jilles, what settings are you using to achieve 10,000 index operations per second?

    I am importing the quattro-shapes project in to ES on my laptop (quad-core i7, 12GB RAM, SSD) but my CPUs are all maxing out at 160 index operations per sec.

    I’m also sending documents via the bulk index API in batches of 500.

    These are my settings:

    ES_HEAP_SIZE=4g (I think the max is actually 1g)

    settings”: {
    “index”: {
    “refresh_interval”: “-1”,
    “number_of_replicas”: “0”,
    “number_of_shards”: “8”,
    “index.index_concurrency”: “32”,
    “index.merge.policy.merge_factor”: “30”,
    “index.translog.flush_threshold_size”: “1000mb”,
    “index.translog.flush_threshold_ops”: “50000”,
    “index.translog.flush_threshold_period”: “120m”,
    “index.translog.interval”: “10m”,
    “indices.memory.index_buffer_size”: “50%”
    }
    }

    Could you suggest any way to speed up the import or reduce CPU usage?

    When I view ‘hot_threads’ at “http://localhost:9200/_nodes/hot_threads”; it appears that the bottleneck is: “org.apache.lucene.spatial.prefix.tree.SpatialPrefixTree.recursiveGetCells(SpatialPrefixTree.java:192)” which shows up constantly.

    1. I haven’t repeated this with a recent version of elastic search. So, not sure what has changed. The 10K per second was for POI data.

      If you are using the geo_shape (looks like it) type you’ll want to reduce precision a lot probably and take a critical look at the index size on disk as well. At high precision, you are generating tons of terms that need to be indexed. That would certainly explain recursiveGetCells being very busy. Especially if you are indexing large shapes like countries or continents, this is going to get pretty bad.

      It would be helpful if you could post your mapping.

      1. Thanks Jilles; the mapping looks like this:

        “properties”: {
        “boundaries”: {
        “type”: “geo_shape”,
        “tree”: “quadtree”,
        “tree_levels”: “20”
        }
        }

        The shapes are ‘neighborhoods’ so not huge, usually about 10 city blocks square.

        The geojson rows are either polygon or multiploygon.

        Here is an example of a single record:

        {
        “type”: “polygon”,
        “coordinates”: [
        [
        [
        -84.07042243327618,
        9.976736841175153
        ],
        [
        -84.06869356526974,
        9.975812741973272
        ],
        [
        -84.06825016225943,
        9.975448850526874
        ],
        [
        -84.0682279516667,
        9.975455588036624
        ],
        [
        -84.06658713367011,
        9.975953324734093
        ],
        [
        -84.06463623044999,
        9.976145471930025
        ],
        [
        -84.06268532722976,
        9.975953324734093
        ],
        [
        -84.06263300044054,
        9.975937451576016
        ],
        [
        -84.06188964842505,
        9.97442795721983
        ],
        [
        -84.06188964841448,
        9.96885060855
        ],
        [
        -84.06188964841448,
        9.963440335310096
        ],
        [
        -84.06188964842505,
        9.96318231585424
        ],
        [
        -84.06516406074644,
        9.961187586283486
        ],
        [
        -84.0653126108939,
        9.960131816815903
        ],
        [
        -84.06738279041016,
        9.958029972340015
        ],
        [
        -84.06738281249994,
        9.958029949912344
        ],
        [
        -84.06945301716462,
        9.955928079903373
        ],
        [
        -84.06953790950126,
        9.955324751605303
        ],
        [
        -84.07287597660007,
        9.955324753796262
        ],
        [
        -84.07562255859986,
        9.955324753940673
        ],
        [
        -84.07699584962495,
        9.956719036569226
        ],
        [
        -84.07836913934466,
        9.95532475457874
        ],
        [
        -84.07974243164978,
        9.953930466615617
        ],
        [
        -84.08111572268524,
        9.955324753940673
        ],
        [
        -84.08386230469995,
        9.955324753473576
        ],
        [
        -84.08787950065567,
        9.955324751605303
        ],
        [
        -84.087879583464,
        9.955324701160137
        ],
        [
        -84.09500122038651,
        9.955324701160137
        ],
        [
        -84.09555682206752,
        9.9563641587946
        ],
        [
        -84.09559202688365,
        9.95635347953045
        ],
        [
        -84.09058867941815,
        9.96651340285382
        ],
        [
        -84.08821035277674,
        9.967854940826967
        ],
        [
        -84.08565612878584,
        9.972523026534645
        ],
        [
        -84.08219289412182,
        9.976039379800099
        ],
        [
        -84.08111572265,
        9.976145471930025
        ],
        [
        -84.07948751774433,
        9.97598510774391
        ],
        [
        -84.07948327756793,
        9.975984690122957
        ],
        [
        -84.07807610192367,
        9.976736841175153
        ],
        [
        -84.07620017081996,
        9.97730589865398
        ],
        [
        -84.07424926759992,
        9.977498045850169
        ],
        [
        -84.07229836437972,
        9.97730589865398
        ],
        [
        -84.07042243327618,
        9.976736841175153
        ]
        ]
        ]
        }

        I’ve also tried switching from openjdk-7-jre to oracle-java7 and compiling the latest elasticsearch from source but performance is the same.

  4. Hello Jilles,

    i have started learning elasticsearch recently.

    i am want to define different polygons on map into elasticsearch.

    and after define different polygons , i will search one of lat long and i must get a polygon in which that lat long is present.

    is this possible in elasticsearch?

    if yes..
    then wht steps i need to do .

  5. Hi Jiles,

    We have a problem with elasticsear geoshape query. Its weird that the queries work for certain indexes and does not work for some.
    My query is something like this.
    GET ecua/project/_search
    {
    “query”:{
    “geo_shape” : {
    “location.shape” : {
    “shape” : {
    “type” : “circle”,
    “radius” : “5miles”,
    “coordinates” : [-80.1564, 40.385827]
    }
    }
    }
    }
    }

    We have several indexes per customer. What I noticed that when I run the query against some geospatial data in Canada or San Francisco it works. I am getting the desired results

    Whereas I have some indexes where the geospatial data are from Pittsburg and Florida. The same geo_Shape query does not work. (I change the coordinates and the radius)

    Our shape attribute index is of geohash tree.

    I am clueless why it would happen like that. Please let me know if you have ever faced this or know of any reason. Thanks.

  6. If it works for some indices and not for others, the answer is likely some difference in the mapping between the indices. I would double check that first by comparing the mappings returned by es for the indices. If it is not the mapping, you need to look at the data and look at whether the fields are actually correct. I could simply be that you don’t have coordinates for some data or that they are wrong. If you are not sure, create a fresh index and fill it with sample data from both indices and try again with your query. Finally, it might be helpful to plot things on a map just to confirm that your data is where you assume it to be.

    BTW. not really the best place for support requests. Try posting in the elasticsearch group on google groups.

  7. Hi Jilles,

    Nice article.

    I was wondering if you could give some insight into how you would go about structuring and query data for the following use-case:

    – A db of food locations is indexed.
    – Each location has a delivery area.
    – A delivery area is either a circle with mile radius from the location’s coordinates or a custom polygon drawn around city blocks.
    – The location search query will take a user location coordinates and optionally a map viewport bounds or a radius as filter.
    – The query results will be locations whose delivery area cover the user’s location and are within the viewport filter contraints.

    Thanks.

  8. The easiest is to use geo shape polygons for everything in this case. I actually have a github project that turns a point + radius into a polygon: https://github.com/jillesvangurp/geogeometry/blob/master/src/main/java/com/jillesvangurp/geo/GeoGeometry.java#L697. If you do it like that, you have no special cases. All your data has some polygon coverage area and all your queries are simple intersects style queries of either another polygon or point. The only thing to consider here is the precision with which you index these polygons. This depends on two factors: their number and their size (area). Basically if you tune this right, your index won’t explode with a ridiculous amount of terms while still retaining reasonable levels of performance and accuracy.

  9. Hi Jilles,

    Can you give a full example of how to index and query using geo_shape? the elasticsearch documentation is not good enough to easy understand of how to use..
    If you have an example of a class in any language that represent geo_shape be better.

    Thanks.

Leave a Reply