Jruby & UTF-8

I just spent a day trying to force jruby to do the right things with UTF-8 content. Throughout my career, I’ve dealt with UTF-8 issues on pretty much any project I’ve ever touched. It seems that world+dog just can’t be trusted to do the right things when it comes to this. If you ever see mangled content in your browser: somebody fucked it up.

Lately, I’ve been experiencing issues both with requests and responses in our sinatra based application on top of ruby rack. So despite, setting headers correctly, ensuring I hardcode everything to be UTF-8 end to end, it still mangles properly encoded utf-8 content by applying a system default encoding both on the way in and out of my system. Lets just say I’ve been cursing a lot in the past few days. In terms of WTF’s per second, I was basically not being very nice.

Anyway, Java has a notion of a default encoding that is set (inconsistently) from the environment that you start that you can only partially control with the file.encoding system property. On windows and OSX this is not going to be UTF-8. That combined with Ruby’s attitude to be generally sloppy about encodings and just hoping for the best is basically a recipe for trouble. Ruby strings just pickup whatever system encoding is default. Neither rack nor Sinatra try to align that in any way with the http headers it deals with.

When you start jruby using the provided jruby script it actually sets the file.encoding system property to utf-8. However, this does not affect java.nio.charset.Charset.defaultCharset(), which is used a lot by Jruby. IMHO that is a bug and a serious one. Let me spell this out: any code that relies on java.nio.charset.Charset.defaultCharset() is broken. Unless you explicitly want to support legacy content that is not in UTF-8, in which case you should be very explicit about exactly which of the gazillion encodings you want. The chances of that aligning with the defaults are going to be slim.

This broke in a particularly weird way for me: it was working fine on OS X (with its broken system defaults). However, it broke on our server, which is Ubuntu 12.04 LTS. If you’ve ever used ubuntu, you may have noticed that terminals use UTF-8 by default. This is great, all operating systems should do this. There’s one problem though: it’s not actually system wide and my mistake was assuming that it was. It’s fine when you log in and use a terminal. However, when starting jruby from an init.d script with a sudo, the terminal encoding reverts back to some shit ansi default from the nineteen sixties that is 100% guaranteed inappropriate for any modern web server. This default is then passed on to the jvm, which causes jruby with rack to the wrong things.

The fix, add this to your script:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

This forces the encoding to be correct in the bash script and should hopefully trickle down the broken pile of software that just assumes the defaults make sense.

Crypto Crap in Python

I’m looking into doing a little cryptographic stuff in python. Nothing fancy, just some standard stuff. Not for the first time I’m bumping into this brick wall of “batteries included”, the notion that the python library comes with a lot of stuff that should be good enough for whatever you need to do. Only problem is that it doesn’t. XML parsing stinks in Python; http IO stinks (need lots of third party stuff to make that usable); no UTF-8 by default; etc.

Out of the box python is bloody useless unless you want to do some very simplistic stuff. So basically my problem is very simple: I need to be able to sign stuff and verify signatures in a way that is compatible with how stuff like this stuff is commonly done on the internet ™. I.e. you’d expect some pretty mature, well tested libraries to be around for whatever programming language you’d like to use. I know exactly where to go to get this stuff for Java, for example.

So we’re looking at some very basic capability to do stuff with algorithms like RSA, SHA1, MD5 etc. Batteries not included with python at all so I Google a bit to find out what people commonly use for this in python and stumble upon what seems to be the most popular library pycrypto. It seems to have all the algorithms, great! Only one minor detail that has had me crawl all over Google for the entire afternoon:

Public keys usually come as base64 encoded thingies: how the hell do I get them in and out of the functions/classes and what not provided by pycrypto. Batteries not included. After a long search, I find this nice post.

Basically it’s telling me that various people have bothered to provide nice libraries with relevant code for python but somehow all of them have neglected to provide this very basic functionality that you will need 100% guaranteed. That just sucks. In the hypothetical case that you’d actually want to use this stuff to do hypothetically useful things like verifying a signature attached to some http request you will basically find yourself reverse engineering this poorly documented library and figuring out how to get from a base 64 encoded RSA key to a properly configured RSA class instance and back again. I had lots of fun (not) reading about the details of RSA, x.509, etc.

Eventually I found some sample code here that seems to half do what I need. But I’d just prefer to be able to reuse something that is hassle free instead of copy pasting somebody else’s code and debugging it until it works as expected and basically reinventing the wheel by making what would amount to Jilles private little python crypto library. I have better things to do.