CouchDB

We did a little exercise at work to come up with a plan to scale to absolutely massive levels. Not an entirely academic problem where I work. One of the options I am (strongly) in favor of is using something like couchdb to scale out. I was aware of couchdb before this but over the past few days I have learned quite a bit more that and am now even more convinced that couchdb is a perfect match for our needs. For obvious reasons I can’t dive in what we want to do with it exactly. But of course itemizing what I like in couchdb should give you a strong indication that it involves shitloads (think hundreds of millions) of data items served up to shitloads of users (likewise). Not unheard of in this day and age (e.g. Facebook, Google). But also not something any off the shelf solution is going to handle just like that.

Or so I thought …

The couchdb wiki has a nice two line description:

Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.

This is not the whole story but it gives a strong indication that quite a lot is being claimed here. So, lets dig into the details a bit.

Document oriented and schema less storage. CouchDB stores json documents. So, a document is nothing more than a JSON data structure. Fair enough. No schemas to worry about, just data. A tree with nodes, attributes and values. Up to you to determine what goes in the tree.

Conflict resolution. It has special attributes for the identity and revision of a document and some other couchdb stuff. Both id and revision are globally unique uuids. UPDATE revision is not a uuid (thanks Matt).That means that any document stored in any instance of couchdb anywhere on this planet is uniquely identifiable and that any revision of such a document in any instance of couchdb is also uniquely identifiable. Any conflicts are easily identified by simply examining the id and revision attributes. A (simple) conflict resolution mechanism is part of couchdb. Simple but effective for simple day to day replication.

Robust incremental replication. Two couchdb nodes can replicate to each other. Since documents are globally unique, it is easy to figure out which document is on which node. Additionally, the revision id allows couchdb to figure out what the correct revision is. Should you be so unlucky to have conflicting changes on both nodes, there are ways of dealing with conflict resolution as well. What this means is that any node can replicate to any other node. All it takes is bandwidth and time. It’s bidirectional so you can have a master-master setup where both nodes consume writes and propagate changes to each other. The couchdb use the concept of “eventual consistency” to emphasize the fact that a network of couchdb nodes replicating to each other will eventually have the same data and be consistent with each other, regardless of the size of the network or how out of sync the nodes are at the beginning.

Fault tolerant.Couchdb uses a file as its datastore. Any write to a couchdb instance appends stuff to this file. Data in the file already is never overwritten. That’s why it is fault tolerant. The only part of the file that can possibly get corrupted is at the end of the file, which is easily detected (on startup). Aside from that, couchdb is rock solid and guaranteed to never touch your data once it has been committed to disk. New revisions don’t overwrite old ones, they are simply appended to the file (in full) to the end of the file with a new revision id. You. Never. Overwrite. Existing. Data. Ever. Fair enough, it doesn’t get more robust than that. Allegedly, kill -9 is a supported shutdown mechanism.

Cleanup by replicating. Because it is append only, a lot of cruft can accumulate in the bit of the file that is never touched again. Solution: add an empty node, tell the others to replicate to it. Once they are done replicating, you have a clean node and you can start cleaning up the old ones. Easy to automate. Data store cleanup is not an issue. Update. As Jan and Matt point out in the comments, you can use a compact function, which would be a bit more efficient.

Restful. CouchDBs native protocol is REST operations over HTTP. This means several things. First of all, there are no dedicated binary protocols, couchdb clients, drivers, etc. Instead you use normal REST and service related tooling to access couchdb. This is good because this is exactly what has made the internet work for all these years. Need caching? Pick your favorite caching proxy. Need load balancing? Same thing. Need access from language x on platform y? If it came with http support you are ready to roll.

Incremental map reduce. Map reduce is easy to explain if you understand functional programming. If you’re not familiar with that, it’s a divide and conquer type strategy to calculate stuff concurrently from lists of items. Very long lists with millions/billions of items. How it works is as follows: the list is chopped into chunks. The chunks are processed concurrently in a (large) cluster to calculate something. This is called the map phase. Then the results are combined by collecting the results from processing each of the chunks. This is called the reduce phase. Basically, this is what Google uses to calculate e.g. pagerank and many thousands of other things on their local copy of the web (which they populate by crawling) the web regularly. CouchDB uses the same strategy as a generic querying mechanism. You define map and reduce functions in Javascript and couchdb takes care of applying them to the documents in its store. Moreover, it is incremental. So if you have n documents and those have been map reduced and you add another document, it basically incrementally calculates the map reduce stuff. I.e. it catches up real quick. Using this feature you can define views and query simply by accessing the views. The views are calculated on write (Update. actually it’s on read), so accessing a view is cheap whereas writing involves the cost of storing and the background task of updating all the relevant views, which you control yourself by writing good map reduce functions. It’s concurrent, so you can simply add nodes to scale. You can use views to index specific attributes, run clustering algorithms, implement join like query views, etc. Anything goes here. MS at one point had an experimental query optimizer backend for ms sql that was implemented using map reduce. Think expensive datamining SQL queries running as map reduce jobs on a generic map reduce cluster.

It’s fast. It is implemented in erlang which is a language that is designed from the ground up to scale on massively parallel systems. It’s a bit of a weird language but one with a long and very solid track record in high performance, high throughput type systems. Additionally, couchdb’s append only and lock free files are wickedly fast. Basically, the primary bottleneck is the available IO to disk. Couchdb developers are actually claiming sustained write throughput that is above 80% of the IO bandwidth to disk. Add nodes to scale out.

So couchdb is an extremely scalable & fast storage system for documents that provides incremental map reduce for querying and mining the data; http based access and replication; and a robust append only, overwrite never, and lock free storage.

Is that all?

No.

Meebo decided that this was all nice and dandy but they needed to partition and shard their data instead of having all their data in every couchdb node. So they came up with CouchDB Lounge. Basically what couchdb lounge does is enabled by the REST like nature of couchdb. It’s a simple set of scripts on top of nginx (a popular http proxy) and the python twisted framework (a popular IO oriented framework for python) that dynamically routes HTTP messages to the right couchdb node. Each node hosts not one but several (configurable) couchdb shards. As the shards fill up, new nodes can be added and the existing shards are redistributed among them. Each shard calculates its map reduce views, the scripts in front of the loadbalancer take care of reducing these views across all nodes to a coherent ‘global’ view. I.e. from the outside world, a couchdb lounge cluster looks just like any other couchdb node. It’s sharded nature is completely transparent. Except it is effectively infinitely scalable both in the number of documents it can store as well in the read/write throughput. Couchdb looks just like any other couchdb instance in the sense that you can run the full test suite that comes with couchdb against and it will basically pass all tests. There’s no difference from a functional perspective.

So, couchdb with couchdb lounge provides an off the shelf solution for storing, accessing and querying shitloads of documents. Precisely what we need. If shitloads of users come that need access, we can give them all the throughput they could possibly need by throwing more hardware in the mix. If shitloads is redefined to mean billions instead of millions, same solution. I’m sold. I want to get my hands dirty now. I’m totally sick and tired of having to deal with retarded ORM solutions that are neither easy, scalable, fast, robust, or even remotely convenient. I have some smart colleagues who are specialized in this stuff and way more who are not. The net result is a data layer that requires constant fire fighting to stay operational. The non experts routinely do stuff they shouldn’t be doing that then requires lots of magic from our DB & ORM gurus. And to be fair, I’m not an expert. CouchDB is so close to being a silver bullet here that you’d have to be a fool to ignore the voices telling you that it is all too good to be true. But then again, I’ve been looking for flaws and so far have not come up with something substantial.

Sure, I have lots of unanswered questions and I’m hardly a couchdb expert since technically, any newby with more than an hour experience coding stuff for the thing outranks me here. But if you put it all together you have an easy to understand storage solution that is used successfully by others in rather large deployments that seem to be doing quite well. If there are any limits in terms of the number of nodes, the number of documents, or indeed the read/write throughput, I’ve yet to identify it. All the available documentation seems to suggest that there are no such limits, by design.

Some good links:

Running Tomcat on N800

I’ve been doing some N800 hacking at work recently and have been toying with getting various programming environments going. Including the one I know best: serverside Java.

To get Java on the N800, you will need jamvm. Some packages for that are provided here: http://www.internettablettalk.com/forums/showthread.php?t=2896. The packages are unsupported and not available elsewhere. Also there’s some bugs (which is why it’s not in official maemo repositories yet). The bugs mainly concern AWT and swing stuff which I don’t really care about anyway.

To install (as root):

$ dpkg -i jamvm_1.4.3-1_armel.deb
$ dpkg -i classpath_0.91-1_armel.deb
$ dpkg -i jikes_1.22-1_armel.deb
$ export PATH=/usr/local/bin:$PATH
$ export CLASSPATH=/usr/local/classpath/share/classpath/glibj.zip:.

The last two commands add jamvm and jikes to the path and add the gnu classpath to the classpath.

Now you can run tomcat on the N800:

  • download tomcat 5.0.30beta from tomcat.apache.org. This is the last version that does not require java 1.5.
  • upload it to N800
  • download this zip file with configuration for tomcat with jamvm
  • copy provided server.xml to the tomcat conf directory. This configures tomcat to be a bit more lightweight than out of the box. It also disables mbeans which seem to require classes not in the gnu classpath. You can reduce footprint further by stripping stuff from the webapps dir (e.g. documentation + examples)
  • use provided jamvmtomcat.sh script to start tomcat (you’ll need to adjust the paths inside). I don’t have a stop script but should be trivial to add.
  • On my N800, startup time after first launch is 3-4 minutes and most of the servlet and jsp examples in the default tomcat both work correctly and the whole thing is quite responsive. The Jamvm process takes about 26MB with tomcat running. That’s all quite impressive considering that jamvm does not include a JIT.

    UPDATE. I’ve continued experimenting and got the startup time down to 2 minutes now. Basically I removed all webapps except the balancer app and my own war file. Some more experimenting with jamvm and other java stuff has learned me that the main performance problem is IO while loading classes. Once that is done, the interpreter is quite fast and usable.

wireless hell

This wireless shit is absolute crap. This is the second time I’m setting up a wireless network. The first time was for my father. He’d bought a wireless adsl router (smc) and a usb stick. It didn’t work. So he brought back the stick and got one from another brand (my advice). It sort of worked but the connection was very poor. So far he’d been doing stuff by the book (he knows nothing about computers) and then he called in me. First action: remove the crappy software from the previous stick. That didn’t work (the uninstaller is broken) but at least the connection was a bit more consistent now since the old software no longer intervened. Second action replace the driver for the new stick, no improvement but it made me feel better. Third action: replace the firmware of the router. That worked! Suddenly the connection was fine and it has been ever since. Conclusion: the router software was crap and the first usb stick was crap. More importantly, the manufacturer of the router knew it was crap and silently fixed the problem and released an update. Thousands of it’s customers must still be getting a really lousy end user experience. Getting this setup to work required non trivial intervention that no normal consumer would ever figure out. BTW. the troubleshooting material, website and documentation were like the software: poorly written crap.

Now fast forward to 2005. I’m buying a wireless router and a usb stick. I know I am in for trouble. But I’m an optimist so I RTFM, sit back and watch the thing install. First sign of trouble: the widgets of the wlan monitor look like the thing has been written in ancient pascal. These old borland applications had a distinctively looking, shitty look and feel that dates back to 16 bit windows. I’ve seen several such applications in my life and I now associate this look and feel with poor quality software that is extremely likely to crash, hard. Had a screenshot of this software been displayed on the box, I would have left it on the shelf. But the connection sort of worked so I leave it alone. Second sign of trouble, I now have an Odyssey thingy ontop of the tcp thingy in my network properties from an obscure company by the name of funk.com. Two layers of crappy software is like asking for trouble. It claims to do stuff with security.

Then an hour later my mouse stops moving, the system is frozen. My pc has some stability issues (cpu overheats) so I do not directly suspect the usb stick, yet. I reset. Hey the wireless network icon is missing and the monitor icon is greyed out. Hmmm. What could be wrong here? Power cycle, same thing happens again. No connection. It’s plug and play so I (un) plug and pray. Now the wireless icon reappears but this time with a nice red cross. Improvement! Right click, repair connection. And we’re back in business. That’s some nasty bugs down there. Just to be sure I downloaded the latest driver from siemens. The well hidden changelog mentions some ‘installation issues’ were fixed, yeah right. Confusing website too.

Half an hour later it happens again exact same scenario. And then again. Also I notice that GAIM keeps reconnecting every few minutes. Now I’m getting suspicious and zoom in on the monitor application: it looks like crap so it probably is crap. So uninstall the entire wireless software (monitor, funky odyssey + driver), plug and pray, and this time manually install only the driver when windows ‘detects the new hardware’. Let the windows wireless wizard thingy do it it’s thing (that’s crappy too but at least it looks like it was written this century). I seem to be back in business, for now. At least long enough to write this text. I’ll keep you posted. If it doesn’t stay online until tonight I’m bringing back this heap of smelly crap to the store.

Lesson learned: when installing wireless lan, don’t touch any of the cds that come with it. Download the latest driver, install it through the software that comes with windows and minimize the number of external software components.

UPDATE. It appears I’ve not solved the issue yet. Uptime has improved somewhat (up to 2.5 hours) but it still disconnects at random, has trouble finding my router and hangs very often. I’ve also upgraded the chipset software. The router firmware seems to be the latest version (at least, motorola doesn’t offer any upgrades). I’ve now disabled wep encryption. In other words, I’m running out of options. :-(. If this doesn’t do the trick, nothing will.

UPDATE 2. I enabled the upnp service. Seems to be working, I’ve been online for 4.5 hours now. Could this be it?

UPDATE 3. No computer freezes today. I did lose the connection a couple of times. I understand that wifi cards can operate at different channels and that channel 1 (default), 6 an 11 use different frequency ranges. I’m in an apartmentblock with several wifi routers in range and who knows what other equipment. So I changed the channel to 11. I suspect interference is part of the problem.

iTunes review

After about a day of intensive use of iTunes (5.0.1.4, win32) I have decided to stick with it for a while. However, I’m not entirely happy with it yet and I’ll list my detailed criticism here.

1) It looks nice but it is not very responsive. Especially not when managing the amounts of music the ipod is intended for. I am talking about the noticable lag when switching from one view to another, the lack of feedback what is doing when, apparently, it can’t respond right away to any mouse clicks.

2). I am used to winamp which has a nerdy interface but gets several things right in the media library that iTunes doesn’t. The most important thing, the notion of a currently playing list of songs, is missing. That means that if you are navigating your songs, you are also editing the list of songs that is currently playing (unless you are playing a playlist in a seperate window). This is extremely annoying because this means you can’t play albums and browse your other music at the same time which is the way I prefer to listen to my music.

Steps to reproduce: put the library in browse mode (so you can select artist and than album), select one of your albums, start playing the first song. Browse to some other album, click the next button. Instead of playing track 2 of the album you were listening to (IMHO the one and only desired behavior) you were playing the music stops because by now a different set of files is selected.

A solution (or rather workaround) to this would be to create playlists for each album and play those. This cannot be done automatically. I have 300+ albums. You can drag m3u files (a common playlist format that simply lists the files in the order they should be played) to itunes (good) but if you drag more than one it merges them into one big playlist (bad).

3) So if you have m3u files for your albums or other playlists, you still need to import them one by one. That sucks.

An alternative solution would be to treat albums as playlists when clicked upon.

The best solution is of course to do it like winamp. Until you start to play something new the player plays whatever is in its current playlist. If you click an album, that becomes the current playlist. So simple, intuitive and yet missing. Of course it contradicts with the misguided notion of putting checkboxes in a list of 5000 files. The browse mode sort of covers up for this design error by automatically unchecking everything hidden by the browser. That’s why your album is unchecked when you select another album.

I can guess why apple chooses to not fix this issue. It requires changing the user interface to add a list of currently selected songs. This product is for novice users and adding user interface elements makes it more complex. Incidently the ipod is much smarter! It doesn’t change the current selection until you select something new and browsing is not the same as selecting!

4) Double clicking a playlist opens a new window! The idea of a playlist is to play one song after another (like I want to do with my albums). Effectively the playlist becomes the active list once you start playing it. However, as discussed above, iTunes does not have a concept of a current playlist so they ‘fixed’ it by opening a new window. IMHO this is needlessly confusing (for windows users, I understand multiple application windows is something mac users are more used to).

5) Of course this conflicts with the minimize to traybar option which only works for the main window. You can also play playlists like albums but then you encounter issue number 2 again. Conclusion Apple’s fix for issue number 2 is a direct cause for number 4 (serious usability issue) and this issue.

6) A separate issue is album art. Many users have file based mp3 players like winamp which store album art as a separate folder.jpg file in the directory the album mp3s are in. iTunes has an album art feature but will ignore those files. Worse the only way to add album art is to add the image to each individual music file (so if your album is fifteen tracks, the same image must be added to fifteen files). Aside from the waste of diskspace (or worse flash drive space), this is just to cumbersome to manage. I found a neat tool that can automate fetching and adding album art for albums.

7) Finally some issues with the help system. I normally do not refer to help files unless I need them. A day of using iTunes has forced me to do this several times because the user interface has a lot of obscure buttons and options that are not always self explaining. For example the menu option “consolidate library” sounds rather scary and, as I found out by reading the help file, you probably don’t want to click it. Another beautiful option is “group compilations when browsing”. This is a bit harder to figure out because the help search feature returns one result for ‘compilation’ which is a huge list of tips.

The problem: the help information is not organized around the userinterface like it should be. Task based documentation is nice to have but not if you are looking for information on button X in dialog Y.

So why do I still continue to use it: it is integrated in a clever way with my ipod 🙂 and I hope to find some solutions to the problems above using 3rd party tools. Ipod integration seems to work rather nicely, just plug it in and it synchronizes. I have the big version with plenty of space so I just want everything I have to be sycnhronized to it and this seems to work well. Except for one thing:

8) Apparently I have songs that the ipod can’t play that itunes can play. The synchronization process warns of this by telling me it can’t play some songs but fails to inform me which ones (so I can’t fix it)! The obvious solution would be to translate these songs to something it can play when copying them to the ipod (and keep the original in itunes). All the tools to do this are available so it should just do this, no questions asked.

UPDATE

I’ve found some more serious issues with drag and drop:
9) You can drag albums to the sidebar to create a playlist and you can drag playlists to a folder but you cannot drag albums to folders to create a playlist there.

10) Dragging multiple albums sadly creates only one playlist so this is no solution for problem 2 and probably shares the same cause as problem 3.