Accessing Elasticsearch clusters via a localhost node

I’m a regular at the Elasticsearch meetup here in Berlin and there are always lots of recent converts that are trying to wrap their head around the ins and outs of what it means to run an elasticsearch cluster. One issue that seems to baffle a lot of new users is the question of which node in the cluster has the master role. The correct answer is that it depends on what you mean by master. Yes, there is a master node in elasticsearch but that does not mean what you think it means: it merely means that a single node is elected to be the node that holds the truth about which nodes have which data and crucially where the master copies of shards live. What it does NOT mean is that that node has the master copy of all the data in the cluster. It also does NOT mean that you have to talk to specifically this node when writing data. Data in elasticsearch is sharded and replicated and shards and their replicas are copied all over the cluster and out of the box clients can talk to any of the nodes for both read and write traffic. You can literally put a load balancer in front of your cluster and round robin all the requests across all the nodes.

When nodes go down or are added to the cluster, shards may be moved around. All nodes synchronize information about which nodes have which shards and replicas of those shards. The elasticsearch master merely is the ultimate authority on this information. Elasticsearch masters are elected at runtime by the nodes in the cluster. So, by default, any of the nodes in the cluster can become elected as the master. By default, all nodes know how to look up information about which shards live where and know how to route requests around in the cluster.

A common pattern in larger cluster is to reserve the master role for nodes that do not have any data. You can specialize what nodes do via configuration. Having three or more such nodes means that if one of them goes down, the remaining ones can elect a new master and the rest of the cluster can just continue spinning. Having an odd number of nodes is a good thing when you are holding elections since you always have an obvious majority of n/2 + 1. With an even number you can end up with two equally sized network partitions.

The advantage of not having data on a node is that it is far less likely for such nodes to get into trouble with e.g. OutOfMemoryExceptions, excessive IO that slows the machine, or excessive CPU usage due to expensive queries. If that happens, the availability of the node becomes an issue and the risk emerges for bad things to happen. This is a bad thing on a node that is supposed to hold the master data for your cluster configuration. It becoming unavailable will cause other nodes to elect a new node as the maste. There’s a fine line between being unavailable and slow to respond, which makes this a particularly hard problem. The now, infamous call me maybe article highlights several different cluster failure scenarios abd most of these involve some sort of network partioning due to temporary master node failures or unavailability. If you are worried about this, also be sure to read the Elasticsearch response to this article. The botton line is that most of the issues have by now been addressed and are now far less likely to become an issue. Also, if you have declined to update to Elasticsearch 1.4.x with your production setup, now might be a good time to read up on the many known ways in which things can go bad for you.

In any case, empty nodes still do useful work. They can for example be used to serve traffic to elasticsearch clients. Most things that happen in Elasticsearch involve internal node communication since the data can be anywhere in the cluster. So, there are typically two or more network hops involved one from the client to what is usually called a routing node and from there to any other nodes that hold shards needed to complete the request that perform the logic for either writing new data to the shard or retrieving data from the shard.

Another common pattern in the Elasticsearch world is to implements clients in Java and make the embedd a cluster node inside the process. This embedded node is typically configured to be a routing only node. The big advantage of this is that it saves you from having to do one network hop. The embedded node already knows where all the shards live so application servers with an embedded node already know where all the shards are in the cluster and can talk directly to the nodes with these shards using the more efficient network protocol that the Elasticsearch nodes use to communicate with each other.

A few months ago in one of the meetups I was discussing this topic with one of the organizers of the meetup, Felix Gilcher. He mentioned an interesting variant of this pattern. Embedding a node inside an application only works for Java nodes and this is not possible if you use something else. Besides, dealing with the Elasticsearch internal API can be quite a challenge as well. So it would be convenient if non Java applications could get similar benefits. Then he suggested the obvious solution that actually you get most of the same benefits of embedding a node simply running a standalone, routing only elasticsearch node on each application server. The advantage of this approach is that each of the application servers communicates with elasticsearch via localhost, which is a lot faster than sending REST requests over the network. You still have a bit of overhead related to serializing and deserializing requests and doing the REST requests. However, all of that happens on localhost and you avoid the network hop. So, effectively, you get most of the benefits of the embedded node approach.

We recently implemented this at Inbot. We now have a cluster of three elasticsearch nodes and two application servers that each run two additional nodes that talk to the three nodes. We use a mix of Java, Javascript and ruby components on our server and doing this allows us to keep things simple. The eleasticsearch nodes on the application server have a comparatively small heap of only 1GB and typically consume few resources. We could probably reduce the heap size a bit further to 512MB or even 256MB since all these nodes do is pass around requests and data from the cluster to the application server. However, we have plenty of memory and have so far had little need to tune this. Meanwhile, our elasticsearch cluster nodes run on three fast 32GB machines and we allocate half of the memory for heap and reserve the rest for file caching (as per the Elasticsearch recommendations). This works great and it also simplifies application configuration since you can simply configure all applications to talk to localhost and elasticsearch takes care of the cluster management.

Using Elastic Search as a Key Value store

I have in the past used Solr as a key value store. Doing that provided me with some useful information:

  1. Using Solr as a key value store actually works reasonably well. I have in the past used indexes on a Solr two node master/slave setup with up to 90M documents (json), of roughly 1-2KB each with a total index size (unoptimized) of well over 100GB handling queries that returned tens to hundreds of documents at a 4 queries / second throughput. With writes & replication going on at the same time. In a different project we had 70M compressed (gzip) xml values of roughly 5KB each stuffed in a Solr index that managed to sustain dozens of reads per second in a prolonged load test with sub 10ms response times. Total index size was a bit over 100GB. This was competitive (slightly better actually) with a MySql based solution that we load tested under identical conditions (same machine, data, and test). So, when I say Solr is usable as a key value store, I mean I have used it and would use it again in a production setting for data intensive applications.
  2. You need to be aware of some limitations with respect to eventual consistency, lack of transactionality, and reading your own writes, and a few other things. In short, you need to make sure your writes don’t conflict, beware of a latency between the moment you write something and the moment this write becomes visible through queries, and thus not try not to read your own writes immediately after writing them.
  3. You need to be aware of the cost of certain operations in Lucene (the framework that is used by Solr). Getting stuff by id is cheap. Running queries that require Lucene to look at thousands of documents is not, especially if those things are big. Running queries that produce large result sets is not cheap either. Mixing lots of reads and writes is going to kill performance due to repeated internal cache validation.
  4. Newer versions of Lucene offer vastly improved performance due to more clever use of memory, massive optimizations related to concurrency, and a less disruptive commit phase. Particularly Lucene 4 is a huge improvement, apparently.
  5. My experience under #1 is based on Lucene 2.9.x and 3.x prior to most of the before mentioned improvements. That means I should get better results doing the same things with newer versions.

Recently, I started using Elastic Search, which is an alternative Lucene base search product, and this makes the above even more interesting. Elastic search is often depicted as simply a search engine similar to Solr. However, this is a bit of an understatement and it is quite a bit more than that.

It is better described as a schema less, multi tenant, replicating & sharding key value store that implements extensible & advanced search features (geo spatial, faceting, filtering, etc.) as well.

In more simple terms: you launch it, throw data at it, find it back querying it, and add more nodes to scale. It’s that simple. Few other products do this. And even less do it with as little ceremony as Elastic Search. This includes most common relational and nosql solutions on the market today. I’ve  looked at quite a few. None come close to the out of the box utility and usability of Elastic Search.

That’s a big claim. So, lets go through some of the basics to substantiate this a little:

  • Elastic search stores/retrieves objects via a REST API. Convenient PUT, POST, GET, and DELETE APIs are provided that implement version checks (optionally on PUT), generate ids (optionally on POST), and allow you to read your own writes (on GET). This is what makes it a key value store.
  • Objects have a type and go in an index. So, from a REST point of view, the relative uri to any object is /{index}/{type}/{id}.
  • You create indices and types at runtime via a REST API. You can create, update, retrieve and delete indices. You can configure the sharding and replication on a per index basis. That means elastic search is multi tenant and quite flexible.
  • Elastic Search indexes documents you store using either a dynamic mapping, or a mapping you provide (recommended). That means you can find back your documents via the search API as well.
  • This is where Lucene comes in. Unlike the GET, search does not allow you to read your own writes immediately because it takes time for indices to update, and replicate and doing this in bulk is more efficient.
  • The search API is exposed as a _search resource that is available in at the server level (/_search), index level (/{index}/_search, or type level (/{index}/{type}/_search). So you can search across multiple indices. And because Elastic Search is replicating and sharding, across multiple machines as well.
  • When returning search results, Elastic Search includes a _source field in the result set that by default contains the object associated with the results. This means that querying is like doing a multi-get, i.e. expensive if your documents are big, your queries expensive, and your result sets large. What this means is that you have to carefully manage how you query your dataset.
  • The search API supports the GET and POST methods. Post exists as a backup for clients that don’t allow a json body as part of a GET request. The reason you need one is that Elastic Search provides a domain specific language (json based, of course) to specify complex queries. You can also use the Lucene query language with a q=query parameter in the GET request but it’s a lot less powerful and only useful for simple stuff.
  • Elastic Search is clustered by default. That means if you start two nodes in the same network, they will hook up and become a cluster. Without configuration. So, out of the box, elastic search shards and replicates across whatever nodes are available in the network. You actually have to configure it to not do that (if you would actually want that). Typically, you configure it in different ways for  running in different environments. It’s very flexible.
  • Elastic Search is built for big deployments in e.g. Amazaon AWS, Heroku, or your own data center. That means it comes with built in monitoring features, a plugable architecture for adapting to different environments (e.g. discovery using AWS & on the fly backups & restores to/from S3), and a lot of other stuff you need to run in such environments. This is a nice contrast to solr, which doesn’t do any of these things out of the box (or well for that matter) without a lot of fiddling with poorly documented stuff. I speak from experience here.
  • Elastic Search is also designed to be developer friendly. Include it as a dependency in your pom file and start it programmatically, or simply take the tar ball and use the script to launch it. Spinning up an Elastic Search as part of a unit test is fairly straightforward and it works the same as a full blown cluster in e.g. AWS.
  • Configuration is mostly runtime. There are only a few settings you have to decide before launching. The out the box experience is sensible enough that you don’t have to worry about it during development.

In summary: Elastic Search is a pretty damn good key value store with a lot of properties that make it highly desirable if you are looking for a scalable solution to store and query your json data without spending a lot of time and effort on such things as configuration, monitoring, figuring out how to cluster, shard, and replicate, and getting it to do sensible things, etc.

There are a few caveats of course:

  • It helps to understand the underlying data structures used by Lucene. Some things come cheap, some other things don’t. In the end it is computer science and not magic. That means certain laws of physics still apply.
  • Lucene is memory and IO intensive. That means things generally are a lot faster if everything fits into memory and if you have memory to spare for file caching. This is true for most storage solutions btw. For example with MySql you hit a brick wall once your indexes no longer fit in memory. Insert speeds go through the roof basically and mixed inserts/selects become a nightmare scenario.
  • Mixed key value reads/writes combined with lots of expensive queries is going to require some tuning. You might want to specialize some of the nodes in your cluster for reads, for writes, and for querying load. You might want to think a bit about how many shards you need and how many replicas. You might want to think a bit about how you allocate memory to your nodes, and you might want to think a lot about which parts of your documents actually need to be indexed.
  • Elastic Search is not a transactional data store. If you need a transactional database, you might want to consider using one.
  • Elastic Search is a distributed, non transactional system. That means getting comfortable with the notion of eventual consistency.
  • Using Elastic Search like you would use a relational databases is a very bad idea. Particularly, you don’t want to do joins or update multiple objects in big transactions. Joins translate to doing multiple expensive queries and then doing some in memory stuff to throw away most of the results that just invalidated your internal caching. Doing many small interdependent writes is not a great idea either since that tends to be a bit unpredictable in terms of which writes go in first and when they get reflected in your query results. Besides, you want to write in bulk with Lucene and avoid the overhead of doing many small writes.
  • Key value stores are all about de-normalizing and getting clever about how you query. It’s better to have big documents in one index than to have many small documents spread over several indices. Because having many different indices probably means you have some logic somewhere that fires of a shit load of queries to combine things from those indices. Exactly the kind of thing you shouldn’t be doing.  If you start thinking about indices and types in database terms (i.e. tables), that is a good indicator that you are on the wrong track here.

So, we’re going to use Elastic Search at LocalStre.am. We’re a small setup with modest needs for the foreseeable future. Those needs are easily met with a generic elastic search setup (bearing in mind the caveats listed above). Most of our data is going to be fairly static and we like the idea of being able to scale our cluster without too much fuss from day 1.

It’s also a great match for our front end web application, which is based around the backbone javascript framework. Backbone integrates well with REST APIs and elastic search is a natural fit in terms of API semantics. This means we can keep our middleware very simple. Mostly it just passes through to Elastic Search after doing a bit of authentication, authorization, and validation. All we have is CRUD and a few hand crafted queries for Elastic Search.

 

 

CouchDB

We did a little exercise at work to come up with a plan to scale to absolutely massive levels. Not an entirely academic problem where I work. One of the options I am (strongly) in favor of is using something like couchdb to scale out. I was aware of couchdb before this but over the past few days I have learned quite a bit more that and am now even more convinced that couchdb is a perfect match for our needs. For obvious reasons I can’t dive in what we want to do with it exactly. But of course itemizing what I like in couchdb should give you a strong indication that it involves shitloads (think hundreds of millions) of data items served up to shitloads of users (likewise). Not unheard of in this day and age (e.g. Facebook, Google). But also not something any off the shelf solution is going to handle just like that.

Or so I thought …

The couchdb wiki has a nice two line description:

Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.

This is not the whole story but it gives a strong indication that quite a lot is being claimed here. So, lets dig into the details a bit.

Document oriented and schema less storage. CouchDB stores json documents. So, a document is nothing more than a JSON data structure. Fair enough. No schemas to worry about, just data. A tree with nodes, attributes and values. Up to you to determine what goes in the tree.

Conflict resolution. It has special attributes for the identity and revision of a document and some other couchdb stuff. Both id and revision are globally unique uuids. UPDATE revision is not a uuid (thanks Matt).That means that any document stored in any instance of couchdb anywhere on this planet is uniquely identifiable and that any revision of such a document in any instance of couchdb is also uniquely identifiable. Any conflicts are easily identified by simply examining the id and revision attributes. A (simple) conflict resolution mechanism is part of couchdb. Simple but effective for simple day to day replication.

Robust incremental replication. Two couchdb nodes can replicate to each other. Since documents are globally unique, it is easy to figure out which document is on which node. Additionally, the revision id allows couchdb to figure out what the correct revision is. Should you be so unlucky to have conflicting changes on both nodes, there are ways of dealing with conflict resolution as well. What this means is that any node can replicate to any other node. All it takes is bandwidth and time. It’s bidirectional so you can have a master-master setup where both nodes consume writes and propagate changes to each other. The couchdb use the concept of “eventual consistency” to emphasize the fact that a network of couchdb nodes replicating to each other will eventually have the same data and be consistent with each other, regardless of the size of the network or how out of sync the nodes are at the beginning.

Fault tolerant.Couchdb uses a file as its datastore. Any write to a couchdb instance appends stuff to this file. Data in the file already is never overwritten. That’s why it is fault tolerant. The only part of the file that can possibly get corrupted is at the end of the file, which is easily detected (on startup). Aside from that, couchdb is rock solid and guaranteed to never touch your data once it has been committed to disk. New revisions don’t overwrite old ones, they are simply appended to the file (in full) to the end of the file with a new revision id. You. Never. Overwrite. Existing. Data. Ever. Fair enough, it doesn’t get more robust than that. Allegedly, kill -9 is a supported shutdown mechanism.

Cleanup by replicating. Because it is append only, a lot of cruft can accumulate in the bit of the file that is never touched again. Solution: add an empty node, tell the others to replicate to it. Once they are done replicating, you have a clean node and you can start cleaning up the old ones. Easy to automate. Data store cleanup is not an issue. Update. As Jan and Matt point out in the comments, you can use a compact function, which would be a bit more efficient.

Restful. CouchDBs native protocol is REST operations over HTTP. This means several things. First of all, there are no dedicated binary protocols, couchdb clients, drivers, etc. Instead you use normal REST and service related tooling to access couchdb. This is good because this is exactly what has made the internet work for all these years. Need caching? Pick your favorite caching proxy. Need load balancing? Same thing. Need access from language x on platform y? If it came with http support you are ready to roll.

Incremental map reduce. Map reduce is easy to explain if you understand functional programming. If you’re not familiar with that, it’s a divide and conquer type strategy to calculate stuff concurrently from lists of items. Very long lists with millions/billions of items. How it works is as follows: the list is chopped into chunks. The chunks are processed concurrently in a (large) cluster to calculate something. This is called the map phase. Then the results are combined by collecting the results from processing each of the chunks. This is called the reduce phase. Basically, this is what Google uses to calculate e.g. pagerank and many thousands of other things on their local copy of the web (which they populate by crawling) the web regularly. CouchDB uses the same strategy as a generic querying mechanism. You define map and reduce functions in Javascript and couchdb takes care of applying them to the documents in its store. Moreover, it is incremental. So if you have n documents and those have been map reduced and you add another document, it basically incrementally calculates the map reduce stuff. I.e. it catches up real quick. Using this feature you can define views and query simply by accessing the views. The views are calculated on write (Update. actually it’s on read), so accessing a view is cheap whereas writing involves the cost of storing and the background task of updating all the relevant views, which you control yourself by writing good map reduce functions. It’s concurrent, so you can simply add nodes to scale. You can use views to index specific attributes, run clustering algorithms, implement join like query views, etc. Anything goes here. MS at one point had an experimental query optimizer backend for ms sql that was implemented using map reduce. Think expensive datamining SQL queries running as map reduce jobs on a generic map reduce cluster.

It’s fast. It is implemented in erlang which is a language that is designed from the ground up to scale on massively parallel systems. It’s a bit of a weird language but one with a long and very solid track record in high performance, high throughput type systems. Additionally, couchdb’s append only and lock free files are wickedly fast. Basically, the primary bottleneck is the available IO to disk. Couchdb developers are actually claiming sustained write throughput that is above 80% of the IO bandwidth to disk. Add nodes to scale out.

So couchdb is an extremely scalable & fast storage system for documents that provides incremental map reduce for querying and mining the data; http based access and replication; and a robust append only, overwrite never, and lock free storage.

Is that all?

No.

Meebo decided that this was all nice and dandy but they needed to partition and shard their data instead of having all their data in every couchdb node. So they came up with CouchDB Lounge. Basically what couchdb lounge does is enabled by the REST like nature of couchdb. It’s a simple set of scripts on top of nginx (a popular http proxy) and the python twisted framework (a popular IO oriented framework for python) that dynamically routes HTTP messages to the right couchdb node. Each node hosts not one but several (configurable) couchdb shards. As the shards fill up, new nodes can be added and the existing shards are redistributed among them. Each shard calculates its map reduce views, the scripts in front of the loadbalancer take care of reducing these views across all nodes to a coherent ‘global’ view. I.e. from the outside world, a couchdb lounge cluster looks just like any other couchdb node. It’s sharded nature is completely transparent. Except it is effectively infinitely scalable both in the number of documents it can store as well in the read/write throughput. Couchdb looks just like any other couchdb instance in the sense that you can run the full test suite that comes with couchdb against and it will basically pass all tests. There’s no difference from a functional perspective.

So, couchdb with couchdb lounge provides an off the shelf solution for storing, accessing and querying shitloads of documents. Precisely what we need. If shitloads of users come that need access, we can give them all the throughput they could possibly need by throwing more hardware in the mix. If shitloads is redefined to mean billions instead of millions, same solution. I’m sold. I want to get my hands dirty now. I’m totally sick and tired of having to deal with retarded ORM solutions that are neither easy, scalable, fast, robust, or even remotely convenient. I have some smart colleagues who are specialized in this stuff and way more who are not. The net result is a data layer that requires constant fire fighting to stay operational. The non experts routinely do stuff they shouldn’t be doing that then requires lots of magic from our DB & ORM gurus. And to be fair, I’m not an expert. CouchDB is so close to being a silver bullet here that you’d have to be a fool to ignore the voices telling you that it is all too good to be true. But then again, I’ve been looking for flaws and so far have not come up with something substantial.

Sure, I have lots of unanswered questions and I’m hardly a couchdb expert since technically, any newby with more than an hour experience coding stuff for the thing outranks me here. But if you put it all together you have an easy to understand storage solution that is used successfully by others in rather large deployments that seem to be doing quite well. If there are any limits in terms of the number of nodes, the number of documents, or indeed the read/write throughput, I’ve yet to identify it. All the available documentation seems to suggest that there are no such limits, by design.

Some good links:

Server side development sucks

Warning: long rant 🙂

The past half year+ I’ve been ‘enjoying’ myself with lots of technical things related to working with Java, the spring framework, maven, unit testing and lots of command line stuff. And while I like my job, I have to say: it can be enormously tedious from time to time.

If you are coding Java in Eclipse, life is good. Eclipse is enormously helpful and gives you more or less real time feedback on how you are doing with respect to warnings, compilation errors, etc. This is great because it saves you from manual edit compile debug fix cycles, which take time and are frustrating. Frustrating however is what I’d call the current state of server side Java, which takes place mostly outside Eclipse. I spend a shit load of time on a day to day basis waiting for my application server to catch up with what I did in Eclipse, only to find that some minor typo is blocking my progress. So shut down server, edit, compile, package, deploy, wait a minute or so, get the server in the same state you had before and see if it works. Repeat, endlessly. That’s more or less my day.

There’s lots of tools to make this easier. Maven is nice, but it is also a huge time waster with its insistence on checking for updates of everything you have every time you try to use it (20 dependencies is 20 GET requests to your maven repository) and on top of that running the test suite as well. Useful, except if you’ve done this 20 times in the last hour already and you are pretty damn sure you are not interested in whether it will pass this time since you just want to know if that one typo you fixed was actually good enough. Then there are ways of hooking up application servers to eclipse and making them be somewhat more reasonable about requiring full shutdown, deploy and restart. Still it is tedious. And it doesn’t help that maven is completely oblivious about this feature, leaving you to set up things manually. Or not, since that’s not exactly trivial.

The maven way is “our way or the highway”. Eclipse and application servers are part of the highway and while there is the maven eclipse plugin and the eclipse maven plugin (yes, you read that right), the point of both is that there is a (huge) gap to bridge. They each have their own idea of where source code and binaries go and where dependencies come from, or indeed where stuff gets deployed to be debugged. Likewise, maven’s idea of application server integration is wrapping the start up process with some plugin. It does nothing to speed up the actual process of getting the application server up and running. It just saves you from having to start it manually. Eclipse plugins exist to do the same. And of course getting the maven and eclipse plugins to play nice with each other is kind of a challenge. Essentially, there are three worlds: maven, eclipse, and the application server and you’ll lose shit loads of development time watching one catch up with the other.

I have in the back of my mind still last year’s project which was based on python Django. Last year, my job was like this: edit, alt+tab to browser, F5, test, alt+tab back to eclipse, fix, etc. I had 1-3 second roundtrips between my browser and editor, apache was loading the python files straight from my svn work directory. We had a staging server that updated with a cron job that did nothing but “svn up”. Since then I’ve had the pleasure of debating the merits of using dynamic languages in a server side environment in a place where everybody is partying on the Java bandwagon like it’s 1999. Well here it is: you spend a shitload less time waiting for maven or application servers to finish whatever they think needs to be done. On top of that, quite a lot of server side Java is actually scripting. Don’t fool yourself into believing otherwise. We use spring web flow, which means my application logic is part Java, part xhtml with jsf (and a half dozen xml name spaces for that), part xml definitions for webflow, part definitions for spring beans and part Javascript. Guess where you can find bugs: all of them. Guess which ones Eclipse actually provides real time feedback on: Java only. So basically all the disadvantages of using a scripting environment (run-time bug discovery) without most of the advantages (like fast edit-test round trips).

So it is not surprising to me that scripting languages are winning over a lot of people lately. You write less code, in less languages, and on top you get more time to spend editing it because you are not waiting for tools and servers to catch up with your editor. This matters a lot and the performance and scalability benefits of Java are melting away rapidly lately.

On top of that, when I look at what we do, which is really straightforward web & REST stuff with a mysql db, and what a shitload of code, magic producing tools and frameworks, complexity, etc. we end up with something feels terribly wrong. We use hibernate for our database layer. Great stuff, until it doesn’t do what you want and starts basically throwing pretty random RunTimeExceptions at you because you forgot to add a column in your database schema (thus reducing Java to a scripting language because all of this happens run-time).

We have sort of a three way impedence mismatch going here straight out of the cookbook of Enterprise Architecture. Our services speak DTOs (data transfer objects), the database layer needs model classes and the database itself speaks SQL. So a typical REST call will go like this: xml/json comes in, gets translated to dtos, which are manipulated and get translated into model objects, which are manipulated, which results in sql being sent to the database, records coming back and translated into model objects, dto’s and back to xml/json. Basically stuff can go wrong in any of these transitions and a lot of development is basically babysitting your application through all these transitions instead of writing actual application logic.

To make this work, we need mappings from dto’s to model classes and from there to the database. So to add 1 field: I have to edit a model class, update the dto class, edit the mapping from models to dtos, the mapping from models to databases, the database schema itself. Then I have to write tests that verify all the mappings still work correctly and adapt any depending tests. 1 field, about a dozen of files touched. Re-fucking-diculous in my view. This is made ‘easier’ with Dozer that maps object hierarchies to each other, hibernate that takes care of the database and which uses annotations that make magic happen around the places where you use them, resteasy that does all of the incoming and outgoing xml/json magic, maven to download the world (the number of dependencies we have 30+). All to get one really straightforward REST service + 8 table mysql database going.

In short, I kind of miss the days where server side java development meant servlets+jdbc: read parameters from request, do some select/update query, write some stuff to the response. It might lack elegance but you get the same job done with a fraction of the number of components and without having to wait for Spring to figure out how to instantiate a bazillion little objects that go into your application context. I kind of miss the simplicity of edit, f5, edit.

Anyway, end of rant.