Functional tests and flakyness

I just stumbled on a nice article that Martin Fowler has had on his website for a few years about non deterministic tests. It’s a good read and it addresses something that I have encountered in multiple projects. Flaky test are indeed a problem in many places and I’ve had the ‘pleasure’ of dealing with such tests myself on a couple of occasions (often of my own making even).

Martin Fowler lists a few ways to mitigate this problem and his suggestions are excellent and well worth reading. But I have a few things to add that are not covered in that article.
Continue reading “Functional tests and flakyness”

GeoKV

As you may have guessed from my previous posts on my new company/project LocalStream and iterables-support, the two topics are related. A lot of the open source work I’ve been doing on GitHub lately actually relates to LocalStream. The post on iterables-support was intended as the first in a series of posts that highlight the different GitHub projects I’ve created while working on LocalStream. In the  second post of this serie, I want to highlight GeoKV, a persistent key value store for geospatial data that is optimised for processing and iterating over this data.

Continue reading “GeoKV”

CouchDB

We did a little exercise at work to come up with a plan to scale to absolutely massive levels. Not an entirely academic problem where I work. One of the options I am (strongly) in favor of is using something like couchdb to scale out. I was aware of couchdb before this but over the past few days I have learned quite a bit more that and am now even more convinced that couchdb is a perfect match for our needs. For obvious reasons I can’t dive in what we want to do with it exactly. But of course itemizing what I like in couchdb should give you a strong indication that it involves shitloads (think hundreds of millions) of data items served up to shitloads of users (likewise). Not unheard of in this day and age (e.g. Facebook, Google). But also not something any off the shelf solution is going to handle just like that.

Or so I thought …

The couchdb wiki has a nice two line description:

Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.

This is not the whole story but it gives a strong indication that quite a lot is being claimed here. So, lets dig into the details a bit.

Document oriented and schema less storage. CouchDB stores json documents. So, a document is nothing more than a JSON data structure. Fair enough. No schemas to worry about, just data. A tree with nodes, attributes and values. Up to you to determine what goes in the tree.

Conflict resolution. It has special attributes for the identity and revision of a document and some other couchdb stuff. Both id and revision are globally unique uuids. UPDATE revision is not a uuid (thanks Matt).That means that any document stored in any instance of couchdb anywhere on this planet is uniquely identifiable and that any revision of such a document in any instance of couchdb is also uniquely identifiable. Any conflicts are easily identified by simply examining the id and revision attributes. A (simple) conflict resolution mechanism is part of couchdb. Simple but effective for simple day to day replication.

Robust incremental replication. Two couchdb nodes can replicate to each other. Since documents are globally unique, it is easy to figure out which document is on which node. Additionally, the revision id allows couchdb to figure out what the correct revision is. Should you be so unlucky to have conflicting changes on both nodes, there are ways of dealing with conflict resolution as well. What this means is that any node can replicate to any other node. All it takes is bandwidth and time. It’s bidirectional so you can have a master-master setup where both nodes consume writes and propagate changes to each other. The couchdb use the concept of “eventual consistency” to emphasize the fact that a network of couchdb nodes replicating to each other will eventually have the same data and be consistent with each other, regardless of the size of the network or how out of sync the nodes are at the beginning.

Fault tolerant.Couchdb uses a file as its datastore. Any write to a couchdb instance appends stuff to this file. Data in the file already is never overwritten. That’s why it is fault tolerant. The only part of the file that can possibly get corrupted is at the end of the file, which is easily detected (on startup). Aside from that, couchdb is rock solid and guaranteed to never touch your data once it has been committed to disk. New revisions don’t overwrite old ones, they are simply appended to the file (in full) to the end of the file with a new revision id. You. Never. Overwrite. Existing. Data. Ever. Fair enough, it doesn’t get more robust than that. Allegedly, kill -9 is a supported shutdown mechanism.

Cleanup by replicating. Because it is append only, a lot of cruft can accumulate in the bit of the file that is never touched again. Solution: add an empty node, tell the others to replicate to it. Once they are done replicating, you have a clean node and you can start cleaning up the old ones. Easy to automate. Data store cleanup is not an issue. Update. As Jan and Matt point out in the comments, you can use a compact function, which would be a bit more efficient.

Restful. CouchDBs native protocol is REST operations over HTTP. This means several things. First of all, there are no dedicated binary protocols, couchdb clients, drivers, etc. Instead you use normal REST and service related tooling to access couchdb. This is good because this is exactly what has made the internet work for all these years. Need caching? Pick your favorite caching proxy. Need load balancing? Same thing. Need access from language x on platform y? If it came with http support you are ready to roll.

Incremental map reduce. Map reduce is easy to explain if you understand functional programming. If you’re not familiar with that, it’s a divide and conquer type strategy to calculate stuff concurrently from lists of items. Very long lists with millions/billions of items. How it works is as follows: the list is chopped into chunks. The chunks are processed concurrently in a (large) cluster to calculate something. This is called the map phase. Then the results are combined by collecting the results from processing each of the chunks. This is called the reduce phase. Basically, this is what Google uses to calculate e.g. pagerank and many thousands of other things on their local copy of the web (which they populate by crawling) the web regularly. CouchDB uses the same strategy as a generic querying mechanism. You define map and reduce functions in Javascript and couchdb takes care of applying them to the documents in its store. Moreover, it is incremental. So if you have n documents and those have been map reduced and you add another document, it basically incrementally calculates the map reduce stuff. I.e. it catches up real quick. Using this feature you can define views and query simply by accessing the views. The views are calculated on write (Update. actually it’s on read), so accessing a view is cheap whereas writing involves the cost of storing and the background task of updating all the relevant views, which you control yourself by writing good map reduce functions. It’s concurrent, so you can simply add nodes to scale. You can use views to index specific attributes, run clustering algorithms, implement join like query views, etc. Anything goes here. MS at one point had an experimental query optimizer backend for ms sql that was implemented using map reduce. Think expensive datamining SQL queries running as map reduce jobs on a generic map reduce cluster.

It’s fast. It is implemented in erlang which is a language that is designed from the ground up to scale on massively parallel systems. It’s a bit of a weird language but one with a long and very solid track record in high performance, high throughput type systems. Additionally, couchdb’s append only and lock free files are wickedly fast. Basically, the primary bottleneck is the available IO to disk. Couchdb developers are actually claiming sustained write throughput that is above 80% of the IO bandwidth to disk. Add nodes to scale out.

So couchdb is an extremely scalable & fast storage system for documents that provides incremental map reduce for querying and mining the data; http based access and replication; and a robust append only, overwrite never, and lock free storage.

Is that all?

No.

Meebo decided that this was all nice and dandy but they needed to partition and shard their data instead of having all their data in every couchdb node. So they came up with CouchDB Lounge. Basically what couchdb lounge does is enabled by the REST like nature of couchdb. It’s a simple set of scripts on top of nginx (a popular http proxy) and the python twisted framework (a popular IO oriented framework for python) that dynamically routes HTTP messages to the right couchdb node. Each node hosts not one but several (configurable) couchdb shards. As the shards fill up, new nodes can be added and the existing shards are redistributed among them. Each shard calculates its map reduce views, the scripts in front of the loadbalancer take care of reducing these views across all nodes to a coherent ‘global’ view. I.e. from the outside world, a couchdb lounge cluster looks just like any other couchdb node. It’s sharded nature is completely transparent. Except it is effectively infinitely scalable both in the number of documents it can store as well in the read/write throughput. Couchdb looks just like any other couchdb instance in the sense that you can run the full test suite that comes with couchdb against and it will basically pass all tests. There’s no difference from a functional perspective.

So, couchdb with couchdb lounge provides an off the shelf solution for storing, accessing and querying shitloads of documents. Precisely what we need. If shitloads of users come that need access, we can give them all the throughput they could possibly need by throwing more hardware in the mix. If shitloads is redefined to mean billions instead of millions, same solution. I’m sold. I want to get my hands dirty now. I’m totally sick and tired of having to deal with retarded ORM solutions that are neither easy, scalable, fast, robust, or even remotely convenient. I have some smart colleagues who are specialized in this stuff and way more who are not. The net result is a data layer that requires constant fire fighting to stay operational. The non experts routinely do stuff they shouldn’t be doing that then requires lots of magic from our DB & ORM gurus. And to be fair, I’m not an expert. CouchDB is so close to being a silver bullet here that you’d have to be a fool to ignore the voices telling you that it is all too good to be true. But then again, I’ve been looking for flaws and so far have not come up with something substantial.

Sure, I have lots of unanswered questions and I’m hardly a couchdb expert since technically, any newby with more than an hour experience coding stuff for the thing outranks me here. But if you put it all together you have an easy to understand storage solution that is used successfully by others in rather large deployments that seem to be doing quite well. If there are any limits in terms of the number of nodes, the number of documents, or indeed the read/write throughput, I’ve yet to identify it. All the available documentation seems to suggest that there are no such limits, by design.

Some good links:

Crypto Crap in Python

I’m looking into doing a little cryptographic stuff in python. Nothing fancy, just some standard stuff. Not for the first time I’m bumping into this brick wall of “batteries included”, the notion that the python library comes with a lot of stuff that should be good enough for whatever you need to do. Only problem is that it doesn’t. XML parsing stinks in Python; http IO stinks (need lots of third party stuff to make that usable); no UTF-8 by default; etc.

Out of the box python is bloody useless unless you want to do some very simplistic stuff. So basically my problem is very simple: I need to be able to sign stuff and verify signatures in a way that is compatible with how stuff like this stuff is commonly done on the internet ™. I.e. you’d expect some pretty mature, well tested libraries to be around for whatever programming language you’d like to use. I know exactly where to go to get this stuff for Java, for example.

So we’re looking at some very basic capability to do stuff with algorithms like RSA, SHA1, MD5 etc. Batteries not included with python at all so I Google a bit to find out what people commonly use for this in python and stumble upon what seems to be the most popular library pycrypto. It seems to have all the algorithms, great! Only one minor detail that has had me crawl all over Google for the entire afternoon:

Public keys usually come as base64 encoded thingies: how the hell do I get them in and out of the functions/classes and what not provided by pycrypto. Batteries not included. After a long search, I find this nice post.

Basically it’s telling me that various people have bothered to provide nice libraries with relevant code for python but somehow all of them have neglected to provide this very basic functionality that you will need 100% guaranteed. That just sucks. In the hypothetical case that you’d actually want to use this stuff to do hypothetically useful things like verifying a signature attached to some http request you will basically find yourself reverse engineering this poorly documented library and figuring out how to get from a base 64 encoded RSA key to a properly configured RSA class instance and back again. I had lots of fun (not) reading about the details of RSA, x.509, etc.

Eventually I found some sample code here that seems to half do what I need. But I’d just prefer to be able to reuse something that is hassle free instead of copy pasting somebody else’s code and debugging it until it works as expected and basically reinventing the wheel by making what would amount to Jilles private little python crypto library. I have better things to do.

jedit plugin manager

I tried to install some plugins in jEdit, my favourite programming editor (for things other than Java, for that I use eclipse of course). I got some IO errors trying to install some plugins in the plugin manager. Since this has happened before, I looked into it and found a solution to the problem:

  • go to Utilities->Global options
  • Select plugin manager
  • click ‘update mirror list’
  • select one of the alternatives

Apparently the problem is that the default repository url in jEdit is no longer ok. Changing it to another one fixes that problem. Since the whole point of jEdit is using the many plugins that are available this is a pretty critical thing to fix.

Anyway, I’m glad to see that development for jEdit seems to be picking up again. I noticed that the 4.3pre4 release is fairly recent. Also the sourceforge page shows that there is a healthy activity on the core jEdit source code. I’m glad because it started to feel like this project was more or less dead. Jedit is pretty unique, all the other editors have (much) less features.

Java theory and practice: Urban performance legends, revisited

I don’t comment very often on java performance myth busting articles anymore (I used to do this a lot at slashdot and the javalobby). However, this: Java theory and practice: Urban performance legends, revisited is an exceptionally good and informative article that focuses on memory management. The article explains how both memoray allocation and deallocation and stack usage vs heap usage are smarter in Java then with the default C/C++ mechanisms for memory allocation.

Of course the issue is that whereas there are a lot of performance myths, it is also undeniably true that many Java desktop applications are perceived as slow. On the server the debate is pretty much over. Not in the least because C/C++ is not very suitable for building secure network applications for reasons which have a lot to do with how memory is managed (or rather not managed) in C/C++. All the competing server languages are either interpreted or dynamically compiled.

There are a lot of myths about dynamic compilation too. A lot of people state that “because java is interpreted it will always be slower than a compiled language”. Such statements are usually based on a poor understanding of compiler technology.

In short, a compiler is a program that translates a program to running code. A dynamic compiler does this at the latest possible/convenient time (after starting the program, before executing the code). A static compiler on the other hand does this before the program is started.

So why is the above statement wrong: a dynamic compiler has the same bag of tricks to make code run fast that a static compiler has. Wait that’s not true either: it has a larger bag of tricks because it can observe the code while it is running and optimize for the runtime conditions. All this requires some overhead of course: a minor static effort to do simple compilation + whatever optimizations are worthwhile to do (this to can be determined dynamically).

On modern computer systems, there is plenty of time for this. Proportional to the total cpu time used, the overhead for compilation and optimization is neglegible. Only in constrained environments such as low end mobile phones this is not (yet) true. High end phones on the other hand are already much faster than the PCs I used to do swing development on in 1998!

So memory management and dynamic compilations are the source of many performance myths and yet are also arguably making Java a faster environment than plain old C/C++. So where does the slowness come from. It has a lot to do with program design and library design.

Java makes it easy to do things that used to be really difficult (IO, threading, databases). But you need to know what you are doing to make it perform well.

Especially when trying to make a GUI application, a lot of these things tend to come together. If the application freezes a lot (a common complaint), that is not because java is garbage collecting but because the developer did not understand threading. If the application uses a lot of memory, the reason is probably that it is allocating a lot of it (duh). If that’s a problem, good developers have a large bag of tricks at their disposal to detect and fix memory issues.