Data processing in Java: iterables-support

I’ve been doing quite a bit of data processing lately. I work with geo tagged data such as POIs, tweets, images, Wikipedia articles etc. I’m interested in processing this data to explore the data and identify relations.

It used to be that Java was a really inconvenient language for this kind of thing and thus generally frowned upon. The reasons people often cite relate to the lack of expressiveness of the language and the generally high amount of boiler plate code you need to do even the most simple stuff.

This makes a compelling argument for using alternative languages such as python or ruby or for using Hadoop with some domain specific language like Pig on top. And indeed, I’ve used python for some data processing jobs and found that while it has its nice sides (e.g. expressive syntax) it also has some pretty strong arguments against it, which include generally less capable frameworks for e.g. http connectivity (dozens of frameworks to choose from, none of them coming close to Apache’s httpclient for Java). Other issues I ran into are poor performance, very limited concurrency options (compared to e.g. the java concurrent package), a quite weak standard library, awkward handling of utf-8, a json parser that is sloooooow, an xml library that is both awkward and limited.

Hadoop is nice if you have a cluster to run it on but it is also a very complex beast that is not widely known for being particularly useable from a coding point of view (at the Java level that is) or a deployment point of view. In practice you have to use things like pig on top or as my old colleague @sthuebner prefers, something like Clojure and Cascalog.

So, there’s a tradeoff between convenience, performance, expressiveness, and other factors. Or is there? Having worked with Java extensively since 1996, I’m actually quite comfortable with the language, the standard APIs, and the misc. open source frameworks I’ve used over the years. I’ve dabbled with lots of stuff over the years but I always seem to come back to Java when I just need to get stuff done.

Java as a language is of course a bit outdated and not quite as fashionable as it once was. But despite the emergence of more powerful languages, it can be made to do quite useful things still and it can compensate for it’s lack of language hipster-ness with robustness, performance and tons of libraries and add on features and the hands down best by far IDE support for any language. It’s still got a lot going for it.

Continue reading “Data processing in Java: iterables-support”

Disabling swap file

I have an imac from 2009. At the time, I maxed out the memory to a whopping 4GB. This may not seem much these days but it is still plenty for most things.

Over time, my disk has gotten a bit fragmented, applications have gotten a bit heavier (e.g. Firefox commonly uses well over a GB), and things have gotten a little slower.

I had already half decided to buy a new mac to ‘solve’ this problem by simply getting the latest model and maxing out the RAM. The only problem is that Apple seems to be dragging their heels refreshing their iMac line, which is now more than a year old (ancient basically). So, I’ll have to survive for a bit and figure out how to do that until they get around to fixing that.

Basically the primary reason for slowness on macs is swapping. You may think you have enough RAM, and you may in fact have plenty of available RAM. It doesn’t matter, your mac will swap applications to disk even when you don’t need the memory. That takes time. Especially on fragmented disks. Basically on my machine it got to a point where the disk would be churning for minutes every time I opened applications and there was considerable lag when switching applications that I had already opened. All this despite still having a full GB of memory available. So even though I am nowhere close to running out of memory, the OS decides to spend time swapping stuff to and from disk almost constantly. Reason: it was designed well over a decade ago during a time when 16MB was massive amount of memory. I have about 250x that amount.

I found this nice post explaining how to turn swapping off (you have to reboot for this to work). Now there are all sorts of reasons why swapping and virtual memory are good things in theory and all sorts of advice warning that all sorts of evil things can and will happen if you mess with it.

Consider yourself warned.

There are also people claiming that swapping “improves performance”. The latter point is of course massive bull shit: how does using a massively slower disk help make things faster compared to only using RAM, which has massively better bandwidth and latency?. There are also loads of people confusing virtual memory and swap space whining that applications won’t run without virtual memory. Virtual memory is the notion that applications have access to a memory space that is much larger than the available RAM. On a 64 bit machine, this typically is way more than the capacity of your disk or indeed most storage solutions (think a couple of hundred TB).

Swap space on the other hand is used for temporarily swapping memory to disk that has actually been allocated and is in use by some application to free it up for some other application. I.e. you don’t swap virtual memory but only memory that is actually used. So, swapping is only useful when you need more RAM than you have and when you somehow prefer to not close applications and instead have their memory written to disk and read from disk when you switch between them.

Provided you have enough RAM to run all the applications that you need, this should be exactly never. On servers this is the reason you generally turn swapping off because by the time you need it, your server is dead anyway. On desktops, the same argument can be used, provided you can fit your applications into memory comfortably.

So, I turned off swapping on my mac a few days ago and was rewarded with a massive reduction in disk activity and nearly full restoration of the performance I enjoyed when this machine was still new and 4GB was four times what you actually needed. I call that a huge win.

Now, there is one caveat: modern operating systems run hundreds of processes. Those processes use memory. Nasty things happen if it cannot be allocated. The price of running out of memory is that random processes may crash and fail. If that happens to a critical process, your entire OS becomes unstable and you may have to reboot. Solution: make sure you don’t run out of memory. The good news is that normally you shouldn’t run out of memory except when running buggy software with memory leaks or when running a lot of software or software with extreme memory needs (e.g. Adobe Aperture or some games). A little awareness about what is running goes a long way to avoiding any trouble.

I routinely have Firefox open with a few dozen tabs, spotify, skype, a terminal, eclipse, and maybe preview and a few other small things. With all that open, I’m still below 3GB. That leaves about 1 GB for file caches, misc other OS needs, and a little bandwidth to open extra applications. If I run low on memory, or suspect that I will, I simply close a few applications.

I used to do this with windows XP as well and only recall a hand full of cases where applications crashed because they ran out of memory. Well worth the price for a vast increase in performance. Turning off swapping on old PCs is pretty much a no brainer and especially on new ones with way more memory than you need (and considering memory prices, why would you run with less …), I cannot possibly imagine any scenario where swapping adds any value whatsoever but I guess it would be a nice fallback solution as a last resort when the only alternative would be for applications to crash. Unfortunately, neither macs nor windows pcs have a setting that would enable that. So, turning off swapping entirely is the next best thing.

wireless hell (3) & new hardware

After the zillionth crash I’ve given up. I managed to get the freezing down to once every few hours but it’s not good enough. I just got the good old cat 5 cable out of the closet and connected the pc to the cable modem the old fashioned way. That seems to work very nicely. I can still use the wireless connection with my laptop from work.

Naturally, I’ll take the siemens gigaset usb stick 108 crap back to the store. See what happens. If you’ve googled for that piece of shit and read my blog posts: don’t buy it or if you bought it, trade it in for a different brand. I’ve updated everything: drivers (kt400 chipset, wlan), bios. Nothing seems to work and the pc consistently not crashes without the usb stick. It might be a compatibility problem with my motherboard, it might be faulty hardware, it might be crappy siemens drivers. My guess is the latter.

Now the only problem is the cable connection itself. That too is not working to well. I’ve had several disconnects already. The online led just goes off and the modem tries to reconnect for a while, which, so far, it manages to do after fifteen minutes or so. I suspect the cable signal isn’t to good. This is an old building after all.

I almost ordered a new PC last friday. Unfortunately the video card I wanted had to be ordered and the guy from bt-mikro (right across the street) couldn’t guarantee he’d have it within a reasonable time frame so he recommended me not to order right away. Otherwise the setup seems nice: dual core amd 4400+ with an octek motherboard, 2 GB of memory, a nice 300GB disk, a samsung 204TS 20″ lcd screen and logitech mouse and keyboard (wired, I’ve had enough of this wireless shit). I’ll buy it when the video card is available again: a nvidia 7800 GT card. This card (and it’s slightly heavier brother the gtx) seem to be the card to have at the moment.

That should keep me happy for a few years. My current geforce 4200 still works very nicely despite its age. I recently played doom3 on it with reasonable framerates and some of the eyecandy turned off.

Update: hardware has been ordered now 🙂