Jruby & UTF-8

I just spent a day trying to force jruby to do the right things with UTF-8 content. Throughout my career, I’ve dealt with UTF-8 issues on pretty much any project I’ve ever touched. It seems that world+dog just can’t be trusted to do the right things when it comes to this. If you ever see mangled content in your browser: somebody fucked it up.

Lately, I’ve been experiencing issues both with requests and responses in our sinatra based application on top of ruby rack. So despite, setting headers correctly, ensuring I hardcode everything to be UTF-8 end to end, it still mangles properly encoded utf-8 content by applying a system default encoding both on the way in and out of my system. Lets just say I’ve been cursing a lot in the past few days. In terms of WTF’s per second, I was basically not being very nice.

Anyway, Java has a notion of a default encoding that is set (inconsistently) from the environment that you start that you can only partially control with the file.encoding system property. On windows and OSX this is not going to be UTF-8. That combined with Ruby’s attitude to be generally sloppy about encodings and just hoping for the best is basically a recipe for trouble. Ruby strings just pickup whatever system encoding is default. Neither rack nor Sinatra try to align that in any way with the http headers it deals with.

When you start jruby using the provided jruby script it actually sets the file.encoding system property to utf-8. However, this does not affect java.nio.charset.Charset.defaultCharset(), which is used a lot by Jruby. IMHO that is a bug and a serious one. Let me spell this out: any code that relies on java.nio.charset.Charset.defaultCharset() is broken. Unless you explicitly want to support legacy content that is not in UTF-8, in which case you should be very explicit about exactly which of the gazillion encodings you want. The chances of that aligning with the defaults are going to be slim.

This broke in a particularly weird way for me: it was working fine on OS X (with its broken system defaults). However, it broke on our server, which is Ubuntu 12.04 LTS. If you’ve ever used ubuntu, you may have noticed that terminals use UTF-8 by default. This is great, all operating systems should do this. There’s one problem though: it’s not actually system wide and my mistake was assuming that it was. It’s fine when you log in and use a terminal. However, when starting jruby from an init.d script with a sudo, the terminal encoding reverts back to some shit ansi default from the nineteen sixties that is 100% guaranteed inappropriate for any modern web server. This default is then passed on to the jvm, which causes jruby with rack to the wrong things.

The fix, add this to your script:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

This forces the encoding to be correct in the bash script and should hopefully trickle down the broken pile of software that just assumes the defaults make sense.

Jruby and Java at Localstream

Update. I’ve uploaded a skeleton project about the stuff in this post to Github. The presentation that I gave on this at Berlin Startup Culture on May 21st can be found here.

The server side code in the Localstream platform is a mix of Jruby and Java code. Over the past few months, I’ve gained a lot of experience using the two together and making the most of idioms and patterns in both worlds.

Ruby purist might wonder why you’d want to use Java at all. Likewise, Java purists might wonder why you’d waste your time doing Jruby at all instead of more hipster friendly languages such as Scala, Clojure, or Kotlin. In this article I want to steer clear of that particular topic and instead focus on more productive things such as what we use for deployment, dependency management, dependency injection, configuration, and logging. Also it is an opportunity to introduce two of my new Github projects:

The Java ecosystem provides a lot of good, very well supported technology. This includes the jvm itself but also libraries such as Google’s guava, misc. Apache frameworks such as httpclient, commons-lang, commons-io, commons-compress, the Spring framework, icu4j, and many others. Equivalents exist for Ruby, but mostly those equivalents leave a lot to be desired in terms of features, performance, design, etc. It didn’t take me long to conclude that a lot of the ruby stuff out there is sub-standard and not quite up to my level of expectations. That’s why I use Jruby instead of Ruby: it allows me to get the best of both worlds. The value of ruby is in its simplicity and the language. The value of Java is access to an enormous amount of good software. Jruby allows me to have both.

Continue reading

jsonj revisited

It’s been nearly two years since I created the jsonj project on github and blogged about it. Inspired by my observation that dealing with json in Java kind of sucks, I set out to fix it.

As part of our ongoing work to build the LocalStream platform I’ve been eating my own dogfood and this has resulted in a lot of improvements to this project. So, I thought it was about time to write about jsonj again.

Continue reading

Puppet

I recently started preparing for deploying our localstre.am codebase to an actual server. So, that means I’m currently having a lot of fun picking up some new skills and playing system administrator (and being aware of my short comings there). World plus dog seems to be recommending using either Chef or Puppet for this  and since I know few good people that are into puppet in a big way and since I’ve seen it used in Nokia, I chose the latter.

After getting things to a usable state, I have a few observations that come from my background of having engineered systems for some time and having a general gut feeling about stuff being acceptable or not that I wanted to share.

  • The puppet syntax is weird. It’s a so-called ruby DSL that tries to look a bit like json and only partially succeeds. So you end up with a strange mix of : and => depending on the context. I might be nit picking here but I don’t think this a particularly well designed DSL. It feels more like a collection of evolved conventions for naming directories and files that happen to be backed by ruby. The conventions, naming, etc. are mostly non sensical. For example the puppet notion of a class is misleading. It’s not a class. It doesn’t have state and you don’t instantiate it. No objects are involved either. It’s more close to a ruby module. But in puppet, a module is actually a directory of puppet stuff (so more like a package). Ruby leaks through in lots of places anyway so why not stick with the conventions of that language? For example by using gems instead of coming up with your own convoluted notion of a package (aka module in puppet). It feels improvised and gobbled together. Almost like the people who built this had no clue what they were doing and changed their minds several times. Apologies for being harsh if that offended you BTW ;-).
  • The default setup of puppet (and chef) assumes a lot of middleware that doesn’t make any sense whatsoever for most smaller deployments (or big deployments IMNSHO). Seriously, I don’t want a message broker anywhere near my servers any time soon, especially not ActiveMQ. The so-called masterless (puppet) or solo (chef) setups are actually much more appropriate for most people.  They are more simple and have less moving parts. That’s a good thing when it comes to deployments.
  • It tries to be declarative. This is mostly a good thing but sometimes it is just nice to have an implicit order of things following from the order in which you specify things. Puppet forces you to be explicit about order and thus ends up being very verbose about this. Most of that verbosity is actually quite pointless. Sometimes A really comes before B if I specify it in that order in one file.
  • It’s actually quite verbose compared to the equivalent bash script when it comes to basic stuff like for example starting a service, copying a file from a to b, etc. Sometimes a “cp foo bar; chmod 644 bar” just kinda does the trick. It kind of stinks that you end up with these five line blobs for doing simple things like that. Why make that so tedious?
  • Like maven and ant in the Java world it, it tries to be platform agnostic but only partially succeeds. A lot of platform dependencies creep in any way and generally puppet templates are not very portable. Things like package names, file locations, service names, etc. end up being platform specific anyway.
  • Speaking of which, puppet is more like ant than like maven. Like ant, all puppet does is provide the means to do different things. It doesn’t actually provide a notion of a sensible default way that things are done that you then customize, which is what maven does instead. Not that I’m a big fan of maven but with puppet you basically have to baby sit the deployment and worry about all the silly little details that are (or should be) bog standard between deployments: creating users, setting up & managing ssh keys, ensuring processes run with the appropriate restrictions, etc. This is a lot of work and like a lot of IT stuff it feels repetitive and thus is a good candidate for automation. Wait … wasn’t puppet supposed to be that solution? The puppet module community provides some reusable stuff but its just bits and pieces really and not nearly enough for having a sensible production ready setup for even the simplest of applications. It doesn’t look like I could get much value out of that community.

So, I think puppet at this stage is a bit of a mixed bag and I still have to do a lot of work to actually produce a production ready system. Much more than I think is justified by the simplicity of real world setups that I’ve seen in the wild. Mostly running a ruby or java application is not exactly rocket science. So, why exactly does this stuff continue to be so hard & tedious despite a multi billion dollar industry trying to fix this for the last 20 years or so?

I don’t think puppet is the final solution in devops automation. It is simply too hard to do things with puppet and way too easy to get it wrong as well. There’s too much choice, a lack of sensible policy, and way too many pitfalls. It being an improvement at all merely indicates how shit things used to be.

Puppet feels more like a tool to finish the job that linux distributors apparently couldn’t be bothered to do in arbitrary ways than like a tool to produce reliable & reproducible production quality systems at this point and I could really use a tool that does the latter without the drama and attitude. What I need is sensible out of the box experience for the following use case: here’s a  war file, deploy that on those servers.

Anyway, I started puppetizing our system last week and have gotten it to the point where I can boot a bunch of vagrant virtual machines with the latest LTS ubuntu and have them run localstre.am in a clustered setup. Not bad for a week of tinkering but I’m pretty sure I could have gotten to that point without puppet as well (possibly sooner even). And, I still have a lot of work to do to setup a wide range of things that I would have hoped would be solved problems (logging, backups, firewalls, chroot, load balancing a bog standard, stateless http server, etc). Most of this falls in the category of non value adding stuff that somebody simply has to do. Given that we are a two person company and I’m the CTO/server guy, that would be me.

I of course have the benefit of hindsight from my career in Nokia where I watched Nokia waste/invest tens of millions on deploying simple bog standard Java applications (mostly) to servers for a few years. It seems simple things like “given a war file, deploy the damn thing to a bunch of machines” get very complicated when you grow the number of people involved. I really want to avoid needing a forty people ops team to do stupid shit like that.

So, I cut some corners. My time is limited and my bash skills are adequate enough that I basically only use puppet to get the OS in a usable enough state that I can hand off to to a bash script to do the actual work of downloading, untarring, chmodding, etc. needed to  get our application running. Not very puppet like but at least it gets the job done in 40 lines of code or so without intruding too much on my time. In those 40 lines, I install the latest sun jdk (tar ball), latest jruby (another tar ball), our code base, and the two scripts to start elastic search and our jetty/mizuno based app server.

What would be actually useful is reusable machine templates for bog standard things like php and ruby capable servers, java tomcat servers, apache load balancers, etc with sensible hardened configurations, logging, monitoring, etc. The key benefit would be inheriting from a sensible setup and only changing the bits that actually need changing. It seems that is too much to ask for at this point and consequently hundreds of thousands of system administrators (or the more hipster devops if you are so inclined) continue to be busy creating endless minor variations of the same systems.

Using Java constants in JRuby

I’ve been banging my head against the wall for a while before finding a solution that doesn’t suck for a seemingly simple problem.

I have a project with mixed jruby and java code. In my Java code I have a bunch of string constants. I use these in a few places for type safety (and typo safety). Nothing sucks more than spending a whole afternoon debugging a project only to find that you swapped two letters in a string literal somewhere. Just not worth my time. Having ran into this problem with ruby a couple of times now, I decided I needed a proper fix.

So, I want to use the same constants I have in Java in jruby. Additionally, I don’t want to litter my code with Fields:: prefixes. In java I would use static imports. In ruby you can use includes, except that doesn’t quite work for java constants. And also, I want my ruby code to complain loudly when I make a typo with any of these constants. So ruby constants don’t solve my problem. Ruby symbols don’t solve my problem either (not typo proof). Attribute accessors are kind of not compatible with modules (need a class for that). So my idea was to simply add methods to the module dynamically.

So, I came up with the following which allows me to define a CommonFields module and simply include that wherever I need it.

require 'jbundler'
require 'singleton'

java_import my.package.Fields

# hack to get some level of type safety
# simply include the Fields module where you need to use 
# field names and  then you can simply use the defined 
#fields as if they were methods that return a string
module CommonFields
    @rubySpecificFields=[
      :foooo,
      :bar
    ]
    @rubySpecificFields.each do |field|
      CommonFields.send(:define_method, field.to_s) do
        return field.to_s
      end  
    end
    
    Fields.constants.each do | field |
      CommonFields.send(:define_method, Fields.const_get(field)) do
        return Fields.const_get(field)
      end
    end
end

include CommonFields
  
puts foooo, email, display_name

Basically all this does is add the Java constants (email and display_name are two of the constants) to the module dynamically when you include it. After that, you just use the constant names and it just works ™. I also added a list of ruby symbols so, I can have fields that I don’t yet have on the java side. This works pretty OK and I’m hoping jruby does the right things with respect to inlining the method calls. Most importantly, any typo will lead to an interpreter error about the method not existing. This is good enough as it will cause tests to fail and be highly visible, which is what I wanted.

Pretty neat and I guess you could easily extend this approach to implement proper enums as well. I spotted a few enum like implementations but they all suffer from the prefix verbosity problem I mentioned. The only remaining (minor) issue is that ruby constants are upper case and my Java constant names are upper case as well. Given that I want to turn them into ruby functions, that is not going to work. Luckily in my case, the constant values are actually camel cased field names that I can simply use as the function name in ruby. So that’s what I did here. I guess I could have lower cased the name as well.

I’m relatively new to ruby so I was wondering what ruby purists think of this approach.

GeoTools

In the past few weeks I’ve been working on a little project on git hub called GeoTools that you might be interested in if you are into geo spatial stuff.

GeoTools is a set of tools for manipulating geo hashes and geometric shapes in the wgs 84 coordinate system. A geo hash is a representation of a coordinate that interleaves the bit representations of the latitude and longitude and base32 encodes the result. This string representation has a very useful property: nearby coordinates will have the same prefix. As is observed in this blog post: http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves, geo hashes effectively encode the path to a leaf in a quad tree.

The algorithms used for implementing the functionality in GeoTools are mostly well known and commonly used in 2D graphics. However, I have not found any comprehensive open sourced implementation for the wgs 84 coordinate system.

The key functionality that motivated the creation of this library was the ability to cover shapes on a map with geo hashes for indexing purposes. To support that, I needed some elementary geometric operations implemented.

Features

  • GeoGeometry class with methods that allow you to:
    • Calculate distance between two coordinates using the haversine algorithm.
    • check bounding box containment for a point
    • check polygon containment for a point
    • get the center for a polygon
    • get bounding box for a polygon
    • convert circle to a polygon
    • create a polygon from a point cloud
    • translate a wgs84 coordinate by x & y meters along the latitude and longitude
  • GeoHashUtils class with methods that allow you to:
    • check containment of a point in a geohash
    • find out the boundingbox of a geohash
    • find out neighboring geohashes east, west, south, or north of a geohash
    • get the 32 sub geo hashes for a geohash, or the north/south halves, or the NE, NW, SE, SW quarters.
    • convert geo hash to a BitSet and a BitSet to a geo hash
    • cover lines, paths, polygons, or circles with geo hashes

I’ve deliberately kept the design simple and non object oriented. These classes have no external dependencies and only use the java.util package and java.lang.Math class. Consequently it should be easy to port this functionality to whatever other language. Also, I’ve not attempted to implement Point, Polygon, Path, Circle, or other classes to support this library. The reason for this is very simple: these things are commonly implemented in other frameworks and any attempt from my side to impose my own implementation would conflict with the need of others to reuse their own classes. I think it is somewhat of a design smell that world + dog feels compelled to implement their own Point class.

So instead, a point is represented as a latitude, longitude pair of doubles or as an array of two doubles (like in geo json). Likewise, paths and polygons are represented as arrays of points. A circle is a point and a radius. This should enable anyone to integrate this functionality easily.

Jsonj: a new library for working with JSon in Java

Update 11 July 2011: I’ve done several commits on github since announcing the project here. New features have been added; bugs have been fixed; you can find jsonj on maven central now as well; etc. In short, head over to github for the latest on this.

I’ve just uploaded a weekend project to github. So, here it is, jsonj. Enjoy.

If you read my post on json a few weeks ago, you may have guessed that I’m not entirely happy with the state of json in Java relative to other languages that come with native support for json. I can’t fix that entirely but I felt I could do a better job than most other frameworks I’ve been exposed to.

So, I sat down on Sunday and started pushing this idea of just taking the Collections framework and combining that with the design of the Json classes in GSon, which I use at work currently, and throwing in some useful ideas that I’ve applied at work. The result is a nice, compact little framework that does most of what I need it to do. I will probably add a few more features to it and expand some of the existing ones. I use some static methods at work that I can probably do in a much nicer way in this framework.

Continue reading