Using Elastic Search as a Key Value store

I have in the past used Solr as a key value store. Doing that provided me with some useful information:

  1. Using Solr as a key value store actually works reasonably well. I have in the past used indexes on a Solr two node master/slave setup with up to 90M documents (json), of roughly 1-2KB each with a total index size (unoptimized) of well over 100GB handling queries that returned tens to hundreds of documents at a 4 queries / second throughput. With writes & replication going on at the same time. In a different project we had 70M compressed (gzip) xml values of roughly 5KB each stuffed in a Solr index that managed to sustain dozens of reads per second in a prolonged load test with sub 10ms response times. Total index size was a bit over 100GB. This was competitive (slightly better actually) with a MySql based solution that we load tested under identical conditions (same machine, data, and test). So, when I say Solr is usable as a key value store, I mean I have used it and would use it again in a production setting for data intensive applications.
  2. You need to be aware of some limitations with respect to eventual consistency, lack of transactionality, and reading your own writes, and a few other things. In short, you need to make sure your writes don’t conflict, beware of a latency between the moment you write something and the moment this write becomes visible through queries, and thus not try not to read your own writes immediately after writing them.
  3. You need to be aware of the cost of certain operations in Lucene (the framework that is used by Solr). Getting stuff by id is cheap. Running queries that require Lucene to look at thousands of documents is not, especially if those things are big. Running queries that produce large result sets is not cheap either. Mixing lots of reads and writes is going to kill performance due to repeated internal cache validation.
  4. Newer versions of Lucene offer vastly improved performance due to more clever use of memory, massive optimizations related to concurrency, and a less disruptive commit phase. Particularly Lucene 4 is a huge improvement, apparently.
  5. My experience under #1 is based on Lucene 2.9.x and 3.x prior to most of the before mentioned improvements. That means I should get better results doing the same things with newer versions.

Recently, I started using Elastic Search, which is an alternative Lucene base search product, and this makes the above even more interesting. Elastic search is often depicted as simply a search engine similar to Solr. However, this is a bit of an understatement and it is quite a bit more than that.

It is better described as a schema less, multi tenant, replicating & sharding** key value store** that implements extensible & advanced search features (geo spatial, faceting, filtering, etc.) as well.

In more simple terms: you launch it, throw data at it, find it back querying it, and add more nodes to scale. It’s that simple. Few other products do this. And even less do it with as little ceremony as Elastic Search. This includes most common relational and nosql solutions on the market today. I’ve  looked at quite a few. None come close to the out of the box utility and usability of Elastic Search.

That’s a big claim. So, lets go through some of the basics to substantiate this a little:

In summary: Elastic Search is a pretty damn good key value store with a lot of properties that make it highly desirable if you are looking for a scalable solution to store and query your json data without spending a lot of time and effort on such things as configuration, monitoring, figuring out how to cluster, shard, and replicate, and getting it to do sensible things, etc.

There are a few caveats of course:

So, we’re going to use Elastic Search at LocalStre.am. We’re a small setup with modest needs for the foreseeable future. Those needs are easily met with a generic elastic search setup (bearing in mind the caveats listed above). Most of our data is going to be fairly static and we like the idea of being able to scale our cluster without too much fuss from day 1.

It’s also a great match for our front end web application, which is based around the backbone javascript framework. Backbone integrates well with REST APIs and elastic search is a natural fit in terms of API semantics. This means we can keep our middleware very simple. Mostly it just passes through to Elastic Search after doing a bit of authentication, authorization, and validation. All we have is CRUD and a few hand crafted queries for Elastic Search.