The People Graph

Social networks have been all the rage for over a decade now. In the past years, Facebook has consolidated their leadership in personal networking and Linkedin has become the undisputed leader of business networking.

A side effect of this consolidation is that business models have also consolidated, leaving out a lot of potential applications and use-cases that do not fit the current models.

Both Facebook and Linkedin have business models that focus on monetizing their respective social graphs through advertising. LinkedIn also monetizes search, analytics, and recruiting.

To protect their data and business, they have largely cut off API access for third parties. Consequently, building third party applications that use these networks has become very hard, if not impossible.

At the same time, building a new social network has become harder due to the fact that most people already have their social networking needs covered by the existing offerings. Most wannabe new social networks face the ’empty room problem’, where they don’t become interesting until enough users are using it.

Inbot was founded on the idea that sales is fundamentally a social process where building relationships and trust between people is the key to success. Yet, Customer Relationship Management (CRM) systems used today by most salespeople are anything but social.

CRM is mostly used to manually keep track of conversations with customers. Relationship data is shared only within the sales team, and when sales people change jobs, all aggregated data in the CRM is left behind and they take their social network with them.

Marketing automation has emerged as a software-based mechanism to help companies to generate leads in a more automated fashion. The problem with it today is that the spam and noise generated by these applications is deafening, and making everyone harder to reach.

Initially, Inbot started out as a disruptive play to make CRMs find and provide links to new business opportunities. Over time, we realized that we should focus solely on social lead generation, and decouple it from teams and companies that CRM vendors target.

Since last August, we have rolled out a community that we hope will one day rival that of Linkedin but yet works very differently.

Continue reading “The People Graph”

Eventual Consistency Now! using Elasticsearch and Redis

Elasticsearch promises real-time search and nearly delivers on this promise. The problem with ‘nearly; is that in interactive systems, it is actually unacceptable to not have user changes reflect in the any query results. Eventual consistency is nice but it also means occasionally being inconsistent, which is not so nice for users, or worse, product managers, who typically don’t understand these things and report them as bugs. At Inbot, this aspect of using Elasticsearch has been keeping us busy. It would be awfully convenient if it never returned stale data.

Mostly things actually work fine but when a user updates something and then within a second navigates back to a list of stuff that includes what he/she just updated, chances are that it still has the old version because elasticsearch has not yet committed the change to the index. In any interactive system this is going to be a an issue and one way or another, a solution is needed. The reality is that elasticsearch is an eventually consistent cluster when it comes to search and not a proper transactional store that is immediately consistent after modifications. And while it is reasonably good at catching up in a second, that leaves plenty of room for inconsistencies to surface. While you can immediately get any changed document, it actually takes a bit of time for search results to get updated as well. Out of the box, the commit frequency is once every second, which is enough time for a user to click something and then something else and see results that are inconsistent with actions he/she just performed.

We started addressing this with a few client side hacks like simply replacing list results with what we just edited via the API, updating local caches, etc. Writing such code is error prone and tedious. So we came up with a better solution: use Redis. The same DAO I described in my recent article on optimistic locking with elasticsearch also stores the id of any modified documents in a shortlived data structure in Redis. Redis provides in memory data structures such as lists, sets, and hash maps and comes with a ton of options. The nice thing about Redis is that it scales quite well for small things and has a very low latency API. So, it is quite cheap to use it for things like caching.

So, our idea was very simple: use Redis to keep track of recently changed documents and change any results that include these objects on the fly with the latest version of the object. The bit of Java code that we use to talk to Redis uses something called JedisPool. However, this should pretty much work in a similar way from other languages.

try(Jedis jedis = jedisPool.getResource()) {
  Transaction transaction = jedis.multi();
  transaction.lpush(key, value);
  transaction.ltrim(key, 0, capacity); // right away trim to capacity so that we can pretend it is a circular list
  transaction.expire(key, expireInSeconds); // don't keep the data forever
  transaction.exec();
}

This creates a circular list with a fixed length that expires after a few seconds. We use it to store the ids of any document ids we modify for a particular index or belonging to a particular user. Using this, we can easily find out when returning results from our API whether we should replace some of the results with newer versions. Having the list expire after a few seconds means that it is enough for elasticsearch to catch up and the list will stay short or will not be there at all. Under continuous load, it will simply be trimmed to the latest ids that were added (capacity). So, it stays fast as well.

Each of our DAOs exposes an additional function that tells you which document ids have been recently modified. When returning results, we loop over the results and check the id against this list and swap in the latest version. Simple, easy to implement, and it solves most of the problem and more importantly, it solves it on the server and does not burden our API users or clients with this.

However, It doesn’t fix the problem completely. Your query may match the old document but not the new document and replacing the old document with the new document in the results will make it appear that the changed document actually still matches the query. But it is a lot better than showing stale data to the user. Also, we’re not handling deletes currently but that is trivially supported with a similar solution.

Update 2016-02-10 I recently released our Elasticsearch client code on github. It includes support for the strategy I outline above; and loads more goodies. Simply create a dao using the CrudOperationsFactory, be sure to enable redis caching there, and use the modifiedIds() on the dao to retrieve a list of recently modified ids. If you use the pagedSearch or iterableSearch methods on the dao, you can easily create a ProcessingSearchResponse that applies a lambda function that does the lookup if the id is contained in these modified ids.

Optimistic locking for updates in Elasticsearch

In a post in 2012, I expanded a bit on the virtues of using elasticsearch as a document store, as opposed to using a separate database. To my surprise, I still get hits on that article on a daily basis. This indicates that there is some interest in using elasticsearch as described there. So, I’m planning to start blogging a bit more again after more or less being too busy with building Inbot to do so since last February.

Continue reading “Optimistic locking for updates in Elasticsearch”

Jruby and Java at Localstream

Update. I’ve uploaded a skeleton project about the stuff in this post to Github. The presentation that I gave on this at Berlin Startup Culture on May 21st can be found here.

The server side code in the Localstream platform is a mix of Jruby and Java code. Over the past few months, I’ve gained a lot of experience using the two together and making the most of idioms and patterns in both worlds.

Ruby purist might wonder why you’d want to use Java at all. Likewise, Java purists might wonder why you’d waste your time doing Jruby at all instead of more hipster friendly languages such as Scala, Clojure, or Kotlin. In this article I want to steer clear of that particular topic and instead focus on more productive things such as what we use for deployment, dependency management, dependency injection, configuration, and logging. Also it is an opportunity to introduce two of my new Github projects:

The Java ecosystem provides a lot of good, very well supported technology. This includes the jvm itself but also libraries such as Google’s guava, misc. Apache frameworks such as httpclient, commons-lang, commons-io, commons-compress, the Spring framework, icu4j, and many others. Equivalents exist for Ruby, but mostly those equivalents leave a lot to be desired in terms of features, performance, design, etc. It didn’t take me long to conclude that a lot of the ruby stuff out there is sub-standard and not quite up to my level of expectations. That’s why I use Jruby instead of Ruby: it allows me to get the best of both worlds. The value of ruby is in its simplicity and the language. The value of Java is access to an enormous amount of good software. Jruby allows me to have both.

Continue reading “Jruby and Java at Localstream”

Using Elastic Search for geo-spatial search

Over the past few months we have been quietly working on the localstre.am platform. As I have mentioned in a previous post, we are using elastic search as a key value store and that’s working pretty nicely for us. In addition to that we are also using it as a geospatial search engine.

Localstrea.am is going to be all about local and that means geospatial data and lots of it. That’s not just POIs (points of interest) but also streets, cities, and areas of interest. In geospatial terms that means shapes: points, paths, and polygons. Doing geospatial search means searching through documents that have geospatial data associated with it using a query that also contains geospatial data. So given a shape, find every document with a shape that overlaps or intersects with it.

Since elastic search is still very new and rapidly evolving (especially the geospatial functionality), I had some worries about whether it would work as advertised. So, after months of coding it was about time to see if it could actually take a decent data set and work as advertised instead of falling over and dying in a horrible way.

Continue reading “Using Elastic Search for geo-spatial search”

Using Elastic Search as a Key Value store

I have in the past used Solr as a key value store. Doing that provided me with some useful information:

  1. Using Solr as a key value store actually works reasonably well. I have in the past used indexes on a Solr two node master/slave setup with up to 90M documents (json), of roughly 1-2KB each with a total index size (unoptimized) of well over 100GB handling queries that returned tens to hundreds of documents at a 4 queries / second throughput. With writes & replication going on at the same time. In a different project we had 70M compressed (gzip) xml values of roughly 5KB each stuffed in a Solr index that managed to sustain dozens of reads per second in a prolonged load test with sub 10ms response times. Total index size was a bit over 100GB. This was competitive (slightly better actually) with a MySql based solution that we load tested under identical conditions (same machine, data, and test). So, when I say Solr is usable as a key value store, I mean I have used it and would use it again in a production setting for data intensive applications.
  2. You need to be aware of some limitations with respect to eventual consistency, lack of transactionality, and reading your own writes, and a few other things. In short, you need to make sure your writes don’t conflict, beware of a latency between the moment you write something and the moment this write becomes visible through queries, and thus not try not to read your own writes immediately after writing them.
  3. You need to be aware of the cost of certain operations in Lucene (the framework that is used by Solr). Getting stuff by id is cheap. Running queries that require Lucene to look at thousands of documents is not, especially if those things are big. Running queries that produce large result sets is not cheap either. Mixing lots of reads and writes is going to kill performance due to repeated internal cache validation.
  4. Newer versions of Lucene offer vastly improved performance due to more clever use of memory, massive optimizations related to concurrency, and a less disruptive commit phase. Particularly Lucene 4 is a huge improvement, apparently.
  5. My experience under #1 is based on Lucene 2.9.x and 3.x prior to most of the before mentioned improvements. That means I should get better results doing the same things with newer versions.

Recently, I started using Elastic Search, which is an alternative Lucene base search product, and this makes the above even more interesting. Elastic search is often depicted as simply a search engine similar to Solr. However, this is a bit of an understatement and it is quite a bit more than that.

It is better described as a schema less, multi tenant, replicating & sharding key value store that implements extensible & advanced search features (geo spatial, faceting, filtering, etc.) as well.

In more simple terms: you launch it, throw data at it, find it back querying it, and add more nodes to scale. It’s that simple. Few other products do this. And even less do it with as little ceremony as Elastic Search. This includes most common relational and nosql solutions on the market today. I’ve  looked at quite a few. None come close to the out of the box utility and usability of Elastic Search.

That’s a big claim. So, lets go through some of the basics to substantiate this a little:

  • Elastic search stores/retrieves objects via a REST API. Convenient PUT, POST, GET, and DELETE APIs are provided that implement version checks (optionally on PUT), generate ids (optionally on POST), and allow you to read your own writes (on GET). This is what makes it a key value store.
  • Objects have a type and go in an index. So, from a REST point of view, the relative uri to any object is /{index}/{type}/{id}.
  • You create indices and types at runtime via a REST API. You can create, update, retrieve and delete indices. You can configure the sharding and replication on a per index basis. That means elastic search is multi tenant and quite flexible.
  • Elastic Search indexes documents you store using either a dynamic mapping, or a mapping you provide (recommended). That means you can find back your documents via the search API as well.
  • This is where Lucene comes in. Unlike the GET, search does not allow you to read your own writes immediately because it takes time for indices to update, and replicate and doing this in bulk is more efficient.
  • The search API is exposed as a _search resource that is available in at the server level (/_search), index level (/{index}/_search, or type level (/{index}/{type}/_search). So you can search across multiple indices. And because Elastic Search is replicating and sharding, across multiple machines as well.
  • When returning search results, Elastic Search includes a _source field in the result set that by default contains the object associated with the results. This means that querying is like doing a multi-get, i.e. expensive if your documents are big, your queries expensive, and your result sets large. What this means is that you have to carefully manage how you query your dataset.
  • The search API supports the GET and POST methods. Post exists as a backup for clients that don’t allow a json body as part of a GET request. The reason you need one is that Elastic Search provides a domain specific language (json based, of course) to specify complex queries. You can also use the Lucene query language with a q=query parameter in the GET request but it’s a lot less powerful and only useful for simple stuff.
  • Elastic Search is clustered by default. That means if you start two nodes in the same network, they will hook up and become a cluster. Without configuration. So, out of the box, elastic search shards and replicates across whatever nodes are available in the network. You actually have to configure it to not do that (if you would actually want that). Typically, you configure it in different ways for  running in different environments. It’s very flexible.
  • Elastic Search is built for big deployments in e.g. Amazaon AWS, Heroku, or your own data center. That means it comes with built in monitoring features, a plugable architecture for adapting to different environments (e.g. discovery using AWS & on the fly backups & restores to/from S3), and a lot of other stuff you need to run in such environments. This is a nice contrast to solr, which doesn’t do any of these things out of the box (or well for that matter) without a lot of fiddling with poorly documented stuff. I speak from experience here.
  • Elastic Search is also designed to be developer friendly. Include it as a dependency in your pom file and start it programmatically, or simply take the tar ball and use the script to launch it. Spinning up an Elastic Search as part of a unit test is fairly straightforward and it works the same as a full blown cluster in e.g. AWS.
  • Configuration is mostly runtime. There are only a few settings you have to decide before launching. The out the box experience is sensible enough that you don’t have to worry about it during development.

In summary: Elastic Search is a pretty damn good key value store with a lot of properties that make it highly desirable if you are looking for a scalable solution to store and query your json data without spending a lot of time and effort on such things as configuration, monitoring, figuring out how to cluster, shard, and replicate, and getting it to do sensible things, etc.

There are a few caveats of course:

  • It helps to understand the underlying data structures used by Lucene. Some things come cheap, some other things don’t. In the end it is computer science and not magic. That means certain laws of physics still apply.
  • Lucene is memory and IO intensive. That means things generally are a lot faster if everything fits into memory and if you have memory to spare for file caching. This is true for most storage solutions btw. For example with MySql you hit a brick wall once your indexes no longer fit in memory. Insert speeds go through the roof basically and mixed inserts/selects become a nightmare scenario.
  • Mixed key value reads/writes combined with lots of expensive queries is going to require some tuning. You might want to specialize some of the nodes in your cluster for reads, for writes, and for querying load. You might want to think a bit about how many shards you need and how many replicas. You might want to think a bit about how you allocate memory to your nodes, and you might want to think a lot about which parts of your documents actually need to be indexed.
  • Elastic Search is not a transactional data store. If you need a transactional database, you might want to consider using one.
  • Elastic Search is a distributed, non transactional system. That means getting comfortable with the notion of eventual consistency.
  • Using Elastic Search like you would use a relational databases is a very bad idea. Particularly, you don’t want to do joins or update multiple objects in big transactions. Joins translate to doing multiple expensive queries and then doing some in memory stuff to throw away most of the results that just invalidated your internal caching. Doing many small interdependent writes is not a great idea either since that tends to be a bit unpredictable in terms of which writes go in first and when they get reflected in your query results. Besides, you want to write in bulk with Lucene and avoid the overhead of doing many small writes.
  • Key value stores are all about de-normalizing and getting clever about how you query. It’s better to have big documents in one index than to have many small documents spread over several indices. Because having many different indices probably means you have some logic somewhere that fires of a shit load of queries to combine things from those indices. Exactly the kind of thing you shouldn’t be doing.  If you start thinking about indices and types in database terms (i.e. tables), that is a good indicator that you are on the wrong track here.

So, we’re going to use Elastic Search at LocalStre.am. We’re a small setup with modest needs for the foreseeable future. Those needs are easily met with a generic elastic search setup (bearing in mind the caveats listed above). Most of our data is going to be fairly static and we like the idea of being able to scale our cluster without too much fuss from day 1.

It’s also a great match for our front end web application, which is based around the backbone javascript framework. Backbone integrates well with REST APIs and elastic search is a natural fit in terms of API semantics. This means we can keep our middleware very simple. Mostly it just passes through to Elastic Search after doing a bit of authentication, authorization, and validation. All we have is CRUD and a few hand crafted queries for Elastic Search.

 

 

On Java, Json, and complexity

Time for a long overdue Saturday morning blog post.

Lately, I’ve been involved in some interesting new development at work that is all about key value stores, Json, Solr and a few other technologies I might have mentioned a few times. In other words, I’m having loads of fun at work doing stuff I really like. We’ve effectively tossed out XML and SQL and are now all about Json blobs that live in KVs and are indexed in Solr. Nothing revolutionary, except it is if you have been exposed to the wonderfully bloated and complicated world of Java Enterprise Edition and associated technologies. I’ve found that there is a bunch of knee-jerk reflexes that causes Java developers to be biased to be doing all the wrong things here. This is symptomatic of the growing gap between the enterprise Java people on one side and the cool hip kids wielding cool new languages and tools on the other hand.

Continue reading “On Java, Json, and complexity”

N900 tweaking

I’ve been tweaking my N900 quite a bit (just because I can).

Power management. Sadly there are some issues with some wifi routers related to power management. If you find yourself with connections timing out, the solution is going to settings, internet connections. Then edit the problematic connection and go to the last page which features an advanced button. Then under ‘other’ set power management to intermediate or off.

With that sorted out, you’ll want to be offline most of the day. So don’t turn on sip/im/facebook unless you need it and switch it off right after you’re done. Push email is nice but with 15/30 minute polling your battery will last longer.

To gain insight, of course install battery-eye. This plots a graph of your batteries power reserves. Finally, you may want to install a few applets to dim the screen, turn on/off wifi, and switch between 2G and 3G. You can find these in the extras repository that is enabled by default in the application manager.

Apt-get. The application manager is nice but a bit sluggish and it insists on refreshing catalogs after just about each tap. Use it to install openssh and make sure to pick a good password (or set up key authentication). Then ssh into your n900 and use apt-get update and apt-get install just like you would on any decent Debian box. This is why you got this device.

Finding stuff to install. Instead of listing all the crap I installed, I’ll provide something more useful: ways of finding crap to install.

  • Ovi store. Small selection of goodies. Check it out but don’t count on finding too much there. Included for completeness
  • Misc sites with the latest cool stuff:
  • Advanced (i.e. don’t come crying when you mess up and have to reflash): enable the extras, extras-testing, extras-devel repositories from here. Many useful things are provided here. Some of them have the potential to seriously mess up your device. Extras-devel is where all the good stuff comes from but it’s very much like Debian unstable.

Browser extensions. The N900 browser supports extensions. Install the adblock and maemo geolocation extensions through the application manager.

Use the browser. Instead of applications, you can use the browser and rely on web applications instead:

  • Cotchin. A web based foursquare client. Relies on the geolocation API for positioning.
  • Google Reader for touch screen phones.
  • Google maps mobile. Includes latitude, routing and other cool features. Relies on the geolocation API for positioning.
  • Maemaps. Pretty cool N900 optimized unofficial frontend for Google Maps.
  • Hahlo. A nice twitter client in the browser.