Data processing in Java: iterables-support

I’ve been doing quite a bit of data processing lately. I work with geo tagged data such as POIs, tweets, images, Wikipedia articles etc. I’m interested in processing this data to explore the data and identify relations.

It used to be that Java was a really inconvenient language for this kind of thing and thus generally frowned upon. The reasons people often cite relate to the lack of expressiveness of the language and the generally high amount of boiler plate code you need to do even the most simple stuff.

This makes a compelling argument for using alternative languages such as python or ruby or for using Hadoop with some domain specific language like Pig on top. And indeed, I’ve used python for some data processing jobs and found that while it has its nice sides (e.g. expressive syntax) it also has some pretty strong arguments against it, which include generally less capable frameworks for e.g. http connectivity (dozens of frameworks to choose from, none of them coming close to Apache’s httpclient for Java). Other issues I ran into are poor performance, very limited concurrency options (compared to e.g. the java concurrent package), a quite weak standard library, awkward handling of utf-8, a json parser that is sloooooow, an xml library that is both awkward and limited.

Hadoop is nice if you have a cluster to run it on but it is also a very complex beast that is not widely known for being particularly useable from a coding point of view (at the Java level that is) or a deployment point of view. In practice you have to use things like pig on top or as my old colleague @sthuebner prefers, something like Clojure and Cascalog.

So, there’s a tradeoff between convenience, performance, expressiveness, and other factors. Or is there? Having worked with Java extensively since 1996, I’m actually quite comfortable with the language, the standard APIs, and the misc. open source frameworks I’ve used over the years. I’ve dabbled with lots of stuff over the years but I always seem to come back to Java when I just need to get stuff done.

Java as a language is of course a bit outdated and not quite as fashionable as it once was. But despite the emergence of more powerful languages, it can be made to do quite useful things still and it can compensate for it’s lack of language hipster-ness with robustness, performance and tons of libraries and add on features and the hands down best by far IDE support for any language. It’s still got a lot going for it.

Continue reading “Data processing in Java: iterables-support”