Jruby & UTF-8

2013-06-14

I just spent a day trying to force jruby to do the right things with UTF-8 content. Throughout my career, I’ve dealt with UTF-8 issues on pretty much any project I’ve ever touched. It seems that world+dog just can’t be trusted to do the right things when it comes to this. If you ever see mangled content in your browser: somebody fucked it up.

Lately, I’ve been experiencing issues both with requests and responses in our sinatra based application on top of ruby rack. So despite, setting headers correctly, ensuring I hardcode everything to be UTF-8 end to end, it still mangles properly encoded utf-8 content by applying a system default encoding both on the way in and out of my system. Lets just say I’ve been cursing a lot in the past few days. In terms of WTF’s per second, I was basically not being very nice.

Anyway, Java has a notion of a default encoding that is set (inconsistently) from the environment that you start that you can only partially control with the file.encoding system property. On windows and OSX this is not going to be UTF-8. That combined with Ruby’s attitude to be generally sloppy about encodings and just hoping for the best is basically a recipe for trouble. Ruby strings just pickup whatever system encoding is default. Neither rack nor Sinatra try to align that in any way with the http headers it deals with.

When you start jruby using the provided jruby script it actually sets the file.encoding system property to utf-8. However, this does not affect java.nio.charset.Charset.defaultCharset(), which is used a lot by Jruby. IMHO that is a bug and a serious one. Let me spell this out: any code that relies on java.nio.charset.Charset.defaultCharset() is broken. Unless you explicitly want to support legacy content that is not in UTF-8, in which case you should be very explicit about exactly which of the gazillion encodings you want. The chances of that aligning with the defaults are going to be slim.

This broke in a particularly weird way for me: it was working fine on OS X (with its broken system defaults). However, it broke on our server, which is Ubuntu 12.04 LTS. If you’ve ever used ubuntu, you may have noticed that terminals use UTF-8 by default. This is great, all operating systems should do this. There’s one problem though: it’s not actually system wide and my mistake was assuming that it was. It’s fine when you log in and use a terminal. However, when starting jruby from an init.d script with a sudo, the terminal encoding reverts back to some shit ansi default from the nineteen sixties that is 100% guaranteed inappropriate for any modern web server. This default is then passed on to the jvm, which causes jruby with rack to the wrong things.

The fix, add this to your script:


export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

This forces the encoding to be correct in the bash script and should hopefully trickle down the broken pile of software that just assumes the defaults make sense.