log4j, maven, surefire, jetty and how to make it work

I spend some time yesterday on making log4j behave. Not for the first time (gave up on several occasions) and I was getting thoroughly frustrated with how my logs refuse to conform to my log4j configuration, or rather any type of configuration. This time, I believe I succeeded and since I know plenty of others must be facing the exact same misery and since most of the information out there is downright misleading in the sense of presenting all types of snake oil solutions that actually don’t change a thing, here’s a post that offers a proper analysis of the problem and a way out. That, and it’s a nice note to self as well. I just know that I’ll need to set this up again some day.

In a nutshell, the problem is that there are multiple ways of doing logging in Java and one in particular, Apache common-logging, is misbehaving. This trusty little library has not evolved significantly since about 2006 and is depended on by just about any dependency in the maven repository that does logging, mostly for historical reasons. Some others depend on log4j directly and yet some others depend on slf4j (Simple Logging Facade for Java). Basically, you are extremely likely to have a transitive dependency on all of these and even a few dependencies on JDK logging (introduced in Java 1.4).

The main goal of commons.logging is not having to choose log4j or JDK logging. It acts as a facade and picks one of them using some funky reflection. Nice but most sysadmins and developers I have worked with seem to favor log4j anyway and hate commons-logging with a passion. In our case, all our projects depend on log4j directly and that’s just the way it is.

One of the nasty things with commons-logging is that it behaves weirdly in complex class loading situations. Like in maven or a typical application server. As a result, it takes over orchestration of the logs for basically the whole application and wrongly assumes that you want to use jdk logging or some log4j configuration buried deep in one of your dependencies. WTF is up with that btw, don’t ship logging configuration with a library. Just don’t.

Symptoms: you configure logger foo in log4j.xml to STFU below ERROR level and when running your maven tests (or even from eclipse) or in your application server it barfs all over the place at INFO level, which is the unconfigured default. To top it off, it does this using an appender you sure as hell did not configure. Double check, yes log4j is on the classpath, it finds your configuration, it even creates the log file you defined an appender for but nothing shows up there. You can find out what log4j is up to with -Dlog4j.debug=true. So, log4j is there, configured and all but commons-logging is trying to be smart and is redirecting all logged stuff, including the stuff actually logged with log4j directly, to the wrong place. To add to your misery, you might have partly succeeded with your attempts to get log4j working so now you have stuff from different log libraries showing up in the console.

A decent workaround in that case is to define a file appender, which will be free from non log4j stuff. This more or less is the advice for production deployments: don’t use a console logger because dependencies are prone to hijacking the console for all sorts of purposes.

So, good advice, but less than satisfactory. To fix the problem properly, make sure you don’t have commons-logging on the classpath. At all. This will break all the stuff that depends on it being there. Fix that by using slf4j instead. Slf4j comes in several maven modules. I used the following ones:

  1. jcl-over-slf4j is a drop-in, API compatible replacement for commons logging. It writes messages logged through commons-logging using slf4j, which is similar to commons-logging but behaves much nicer (i.e. it actually works). It’s designed to fix the problem we are dealing with here. The only reason it exists is because commons-logging is hopelessly broken.
  2. slf4j-api is used by dependencies already depending on slf4j
  3. slf4j-log4j12 the backend for log4j. If this is on the classpath slf4j will use log4j for its output. You want this.

That’s it. Here’s what I had to do to get a properly working configuration:

  1. Use mvn dependency:tree to find out which dependencies are transitively/directly depending on commons-logging.
  2. fix all of these dependencies with a
    <exclusions>
        <exclusion>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
        </exclusion>
    </exclusions>
  3. You might have to iterate fixing the dependencies and rerunning mvn dependency:tree since only the first instance of commons-logging found will used transitively.
  4. Now add these dependencies to your pom.xml:
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>1.5.10</version>
    </dependency>                
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>jcl-over-slf4j</artifactId>
        <version>1.5.10</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.5.10</version>
    </dependency>
  5. Maven plugins have their own dependencies, separately from your normal dependencies. Make that you add the three slf4j dependencies to surefire, jetty, and other relevant plugins. At least jetty seems to already depend on slf4j.
  6. Finally make sure that your plugins have system properties defining log4j.configuration=file:[log4j config location]. Most of the googled advice on this topic covers this (and not much else). Some plugins can be a bit hard to configure due to the fact that they fork off separate processes.

That should do the trick, assuming you have log4j on the classpath of course.

Migrating iTunes windows to Mac

As part of my recent imac purchase came the chore of moving stuff over from my old hard drive which thankfully survived my old PC’s premature demise. Part of the stuff that I naturally wanted to migrate is of course my iTunes library.

Last time I moved my iTunes library is 3.5 years ago, when I bought my then new and now broken PC. At the time I wrote up a little howto on iTunes migration that over the years has gotten me tens of thousands of hits (dozens per day). Some people were kind enough to leave a message and apparently I helped out several of them with those instructions.

So, here’s the follow up.

Circumstances are slightly different now. I “consolidated” my library ages ago, which means iTunes moves all files to its own directory. Additionally, I reorganized my library, fixed tagging issues, etc. In other words, I put a lot of time into organizing my music and keeping it organized and would hate to start from scratch. Secondly, we’re several major iTunes versions down the road and they added album art fetching, genius and loads of other features.

So, I started out by following the original instructions of three years ago. This nearly works, except the music is not playable from iTunes for some vague reason. So no longer up to date despite a comment of some person who successfully managed a win to mac migration the “old” way.

Next I tried something that I didn’t believe would work but which worked anyway:

  • Copy library from former C drive to where it is supposed to be on your mac (all files & dirs)
  • Don’t modify a single file, they’re fine as they are.
  • Start itunes

Right after it starts it “updates” the library and that’s it. That’s a lot faster & easier. Play counts, play lists, ratings, you name it. It’s there. Thanks Apple, this is a lot better than three years ago and exactly the way it is supposed to work.

Failing power supply

In April 2007, I replaced a broken power supply in my PC with a Antec SmartPower 2.0 500W Power Supply. Check my review here. A few days ago, my pc has started producing a high pitched noise. Really annoying. So, I Google a little and what do I find: Antec SmartPower 2.0 500W Power Supplies apparently have 21% failure rate. Tell tale signs include the damn thing making high pitched noise.

I have to investigate a little further but probably this means the power supply is failing after less than one and a half year. Out of warranty of course. Damn it, really annoying to have to open that case again to replace the same part. Basically, the PC is now nearly three years old and maybe I should just replace it altogether. Something quiet, fast and reliable would be nice.

In a few weeks my new Macbook Pro should arrive at work (ordered yesterday). I was planning to wait and see if I like that and if so, just upgrade to a nice Mac at home as well. Not fully convinced yet.

Feel free to recommend a decent PSU. Has to power a Nvidia 7800, 2 drives, lots of USB hardware and a amd 4400+ dualcore CPU.

Update. I ended up installing a ZALMAN ZM600-HP. Seems to have a few good reviews http://www.tweaknews.net/reviews/zm600hp/. It’s expensive, over qualified for the job and supposedly really good and quiet. Sadly the rest of my machine is still rather noisy.

Lucene Custom Analyzer

A second neat trick I did with Lucene this week was to wrap the StandardAnalyzer with my own analyzer (see here for the other post on Lucene I did a few days ago).

The problem I was trying to address is very simple. I have a nice web service API for my search engine. The incoming query is handled by Lucene using the bundled QueryParser which has a quite nice and elaborate query language that covers most of my needs. However, a problem is that it uses the StandardAnalyzer on everything which means that all the terms in the query are being tokenized. For text this is a good thing. However, I also have fields in my index that are not text.

The Lucene solution to this is to use Untokenized fields in the index. Only problem, using untokenized fields in combination with the QueryParser is not recommended and tends to not work well since everything in the query is being tokenized. So, you should not use the QueryParser but programmatically construct your own Query. Nice but not what I want since it complicates my search API and I need to make complicated queries on the other end of it.

What I wanted is to match a url field against either the whole or part of the url (using wildcards). On top of that, I want to do that as part of a normal QueryParser query e.g. keyword: foo and link: “http\://example.com/foo”. I’ve been doing this the wrong way for a while and let Lucene tokenize the url. So http://example.com/foo becomes [http] [example.com] [foo] for Lucene. The StandardAnalyzer is actually quite smart about hostnames as you can see since otherwise it would treat the . as a token separator as well.

This was working reasonably well for me. However, this week I ran into a nice borderline case where my url ended in …./s.a. Tokenization happens on characters like . and /. On top of that, the StandardAnalyzer that I use with the QueryParser also filters out stopwords like a, the, etc. Normally this is good (with text at least). But in my case it meant the last a was dropped and my query was getting full matches against entries with a similar link ending in e.g. s.b. Not good.

Of course what I really wanted is to be able to use untokenized fields with the QueryParser. Instead what I did this week was create a tokenizer that for selected fields skips tokenization and treats the entire field content as a single token. I won’t put the code for that here but it is quite easy:

  • extend Analyzer
  • override tokenStream(String field, Reader r)
  • if field matches any of your special fields, return a custom TokenStream that returns the entire content of the Reader as a single Token, else just delegate to a StandardAnalyzer instance.

This is a great way to influence the tokenization and also enables a few more interesting hacks that I might explore later on.

Boosting Lucene search results using timestamps

Since I spent quite a bit of time looking into how to do this properly so here’s a solution to a little problem that has been nagging me today: how make lucene take into account timestamps when returning search results. I don’t want to sort the results (that’s easy) but instead when two results match a query and get the same score from lucene, I want to see the newest first.

Basically in lucene this means influencing how it ‘scores’ entries against a query. So far I have been relying on the lucene QueryParser that implements a nice little query language with some cool features. However, the above requirement cannot be expressed as a query in that language. At best you might work with date ranges but that is not quite what I need.

So I had to dive into lucene architecture a bit more and after lots of digging came up with the following code:

String query="foo" 
QueryParser parser =new QueryParser("name", new StandardAnalyzer());
Query q = parser.parse(query);
Sort updatedSort = new Sort();
FieldScoreQuery dateBooster = new FieldScoreQuery("timestampscore", FieldScoreQuery.Type.FLOAT);
CustomScoreQuery customQuery = new CustomScoreQuery(q, dateBooster);				
Hits results = getSearcher().search(customQuery, updatedSort);

The FieldScoreQuery is a recent addition to lucene. I had to upgrade from 2.1 to 2.3 to get it. Essentially what it does is interpret a field as a float and deriving a score from it. Then the CustomScoreQuery combines the score with the score from my original query.

So far it is working beautifully. I basically added a float field to my index which is basically “0.” + timestamp where timestamp is formatted as a yyyyMMddhhmm string (lucene only has string fields). Consequently, later timestamps have a slightly higher score. I might have to tune the query a bit further by either using a weight or by manipulating the float a bit further.

If any Lucene gurus stumble upon this and have some useful advice, please use the comments.

captcha

It seems the captcha plugin (capcc) I was using with wordpress has been broken for some time. Probably this happened when I installed wp 2.5 a few weeks ago. My friend Christian del Rosso pointed this out. I installed a different plugin now (yacaptcha) which both looks nicer and hopefully works better too.

So if you couldn’t comment because of this, try again.

Totaled my Porsche

As you might recall I bought a pair of Lacie Porsche 500GB disks some time ago. Well, a few weeks ago the power supply of one of them began making funny noises and it was getting harder to start up the Porsche. The disk worked fine with the power supply that came with the other drive so the problem was that the power supply was broken.

This happened just outside the warranty of the disk. So the fix was to order a new power supply from Lacie. I looked up how to do that on the website and discovered that these things cost around 35€ which is quite a lot to invest in a disk that now sells under 100€. Besides, I’m kind of annoyed with all those bricks below my desk anyway.

So I went for option number two: pry open the casing and put the disk in the PC. Thanks to this post, I figured out how to do this. Unfortunately, I did end up making a few scratches on the plastic. It is indeed quite tricky to do this properly. But I’m not going to use it anyway.

A quick look in the windows disk properties confirmed that my disk indeed was a sata disk (a samsung). So I took apart my PC, plugged the disk in and put everything together (after a short trip to the store for a second SATA cable). For some reason the first drive was plugged into the second SATA port so after boot, it couldn’t find the OS. A quick trip to the bios allowed me to swap which drive is treated as the first disk and I was back in business.

Let me know if you have a use for a empty Lacie Porsche case with some minor scratches and without a power supply (you can take the broken one of course). You should be able to put in any SATA disk. For 5€ or a (nice) beer it is yours.

Bad behavior

My previous post was a bit short because I was basically calling the taxi to the airport at the same time. The reason for this was “Bad Behavior“, which is a wordpress plugin that got a little overzealous and locked me out of my own site stating that I had just been blacklisted. Consequently I spent most of the half hour before I left figuring out what the hell was wrong and cursing those damn bad behavior idiots.

I didn’t manage to figure that out entirely but did figure out that the problem was not on my side of the connection at least (lucky me). Figuring that out was easy since I just used my phone’s 3G connection only to discover that that ip address also had been blacklisted. The chances of both my phone, PC and/or WELHO (cable) and Elisa (3G) providers being hacked and blacklisted correctly were pretty small. So bad behavior was misbehaving big time and software that misbehaves doesn’t really last that long with me.

Only problem: I could not actually log in to my site via the wordpress admin UI to fix it (blacklisted, doh!). Since all the advice out there on how to fix this seems to assume you do that, here’s how to kill bad behavior properly:

  1. Figure out your IP address
  2. Edit whitelist.inc.php in the bad behavior plugin directory and add your IP address to the array.
  3. Upload this file to your remote site.
  4. Login to your wordpress admin UI (should work now)
  5. Disable the plugin and delete it (unless you like pissing off randomly blacklisted users)
  6. in phpmyadmin or whatever you use to admin your wordpress db: DROP TABLE `wp_bad_behavior`;

BTW. In analyzing the behavior of the misbehaving plugin, I also discovered that it does nasty things like putting user passwords in a database for every login attempt, plain text. Bad behavior indeed, hence the drop table.

Managing wordpress deployment

This little article is a summary of how I currently manage my wordpress blog. The wordpress.org site lists some advice on how to manage a wordpress installation using subversion. However, I have a slightly more sophisticated setup that preserves my modifications (as long as they don’t conflict) that I maintain in a private branch of wordpress.

I use rsync to push and pull changes remotely (using ssh, ftp should work as well). Since a good howto seems to be lacking online and since I spent a while figuring out how to do all this, I decided to share my little setup.
Continue reading “Managing wordpress deployment”

Spam users

The latest in blog spam seems to be users signing up with obscure addresses like filesearch@o2.pl (feel free to send all sorts of crap that way) and not posting any comments. This was sort of a minor annoyance until now but over the past 24 hours I’ve had 5 new user registrations like this.

I have no idea why they do this. It might be that they are trying to gain access to my blog software for hacking/spamming purposes. Hint: my provider automatically stops attempts to abuse its servers for mass mailing attempts and I would find out pretty fast if some asshole was doing this. Whatever the reason, I delete accounts that don’t follow up with an actual comment and have installed a solution to prevent automated attempts in the future: capcc. This plugin allows wordpress to separate real people from bots trying to abuse my web infrastructure and it seems to work pretty nicely. I now have several plugins conspiring against malicious users and hope that is enough for some time.

For real users coming here, I’m sorry about all the obstacles in between you and actually providing comments on my blog. Unfortunately this is the only way for me to keep spam off my blog.

Update.
For victims of the same problem, here’s a useful query:

DELETE FROM wp_users WHERE wp_users.id NOT IN (SELECT DISTINCT user_id from wp_comments)

This will delete any user without any comments. Of course backup your database before running this and if it blows up in your face don’t come complaining here.

Update.

It just occurred to me (damn that’s a full 24 hours, WTF is wrong with me) that having capcc removes the need for me to require users to register before commenting.