Lucene Custom Analyzer

A second neat trick I did with Lucene this week was to wrap the StandardAnalyzer with my own analyzer (see here for the other post on Lucene I did a few days ago).

The problem I was trying to address is very simple. I have a nice web service API for my search engine. The incoming query is handled by Lucene using the bundled QueryParser which has a quite nice and elaborate query language that covers most of my needs. However, a problem is that it uses the StandardAnalyzer on everything which means that all the terms in the query are being tokenized. For text this is a good thing. However, I also have fields in my index that are not text.

The Lucene solution to this is to use Untokenized fields in the index. Only problem, using untokenized fields in combination with the QueryParser is not recommended and tends to not work well since everything in the query is being tokenized. So, you should not use the QueryParser but programmatically construct your own Query. Nice but not what I want since it complicates my search API and I need to make complicated queries on the other end of it.

What I wanted is to match a url field against either the whole or part of the url (using wildcards). On top of that, I want to do that as part of a normal QueryParser query e.g. keyword: foo and link: “http\://example.com/foo”. I’ve been doing this the wrong way for a while and let Lucene tokenize the url. So http://example.com/foo becomes [http] [example.com] [foo] for Lucene. The StandardAnalyzer is actually quite smart about hostnames as you can see since otherwise it would treat the . as a token separator as well.

This was working reasonably well for me. However, this week I ran into a nice borderline case where my url ended in …./s.a. Tokenization happens on characters like . and /. On top of that, the StandardAnalyzer that I use with the QueryParser also filters out stopwords like a, the, etc. Normally this is good (with text at least). But in my case it meant the last a was dropped and my query was getting full matches against entries with a similar link ending in e.g. s.b. Not good.

Of course what I really wanted is to be able to use untokenized fields with the QueryParser. Instead what I did this week was create a tokenizer that for selected fields skips tokenization and treats the entire field content as a single token. I won’t put the code for that here but it is quite easy:

  • extend Analyzer
  • override tokenStream(String field, Reader r)
  • if field matches any of your special fields, return a custom TokenStream that returns the entire content of the Reader as a single Token, else just delegate to a StandardAnalyzer instance.

This is a great way to influence the tokenization and also enables a few more interesting hacks that I might explore later on.

WP-OpenID

I’ve been enthusiastic about openid for a while but have so far not managed to openid enable my site. WP-OpenID, which is the main openid plugin for wordpress is under quite active development. Unfortunately, until recently, any version I tried of that had some issues that prevented me from using it.

The author Will Norris got hired by Vidoop the other day to continue working on wp-openid in the context of the diso project. Diso is another thing I’m pretty enthousiastic about. So, things are improving on the openid front.

Tonight, I managed to get version 2.1.9 of wp-openid to install without any issues on my wordpress 2.5.1 blog. I’ve been testing and it seems to at least accept my openid www.jillesvangurp.com (delegate to myopenid) without issues.

So finally, my blog is openid enabled.

The delegation bit is BTW courtesy of another wordpress plugin: openid delegation. I’ve been using the 0.1 version for more than a year and it just works. Delegation is an openid concept where any website can delegate openid authentication to an external openid provider. This allows you to use a URL you own as your identity and also to switch provider without losing control of your openid url.

Boosting Lucene search results using timestamps

Since I spent quite a bit of time looking into how to do this properly so here’s a solution to a little problem that has been nagging me today: how make lucene take into account timestamps when returning search results. I don’t want to sort the results (that’s easy) but instead when two results match a query and get the same score from lucene, I want to see the newest first.

Basically in lucene this means influencing how it ‘scores’ entries against a query. So far I have been relying on the lucene QueryParser that implements a nice little query language with some cool features. However, the above requirement cannot be expressed as a query in that language. At best you might work with date ranges but that is not quite what I need.

So I had to dive into lucene architecture a bit more and after lots of digging came up with the following code:

String query="foo" 
QueryParser parser =new QueryParser("name", new StandardAnalyzer());
Query q = parser.parse(query);
Sort updatedSort = new Sort();
FieldScoreQuery dateBooster = new FieldScoreQuery("timestampscore", FieldScoreQuery.Type.FLOAT);
CustomScoreQuery customQuery = new CustomScoreQuery(q, dateBooster);				
Hits results = getSearcher().search(customQuery, updatedSort);

The FieldScoreQuery is a recent addition to lucene. I had to upgrade from 2.1 to 2.3 to get it. Essentially what it does is interpret a field as a float and deriving a score from it. Then the CustomScoreQuery combines the score with the score from my original query.

So far it is working beautifully. I basically added a float field to my index which is basically “0.” + timestamp where timestamp is formatted as a yyyyMMddhhmm string (lucene only has string fields). Consequently, later timestamps have a slightly higher score. I might have to tune the query a bit further by either using a weight or by manipulating the float a bit further.

If any Lucene gurus stumble upon this and have some useful advice, please use the comments.

Cartoons

I like to read cartoons. I’m a regular reader of userfriendly.org, dilbert, the wizard of id, fokke en sukke and a few others. I can’t say that I’m a regular reader of Gregorius Nekschot’s cartoons, which cover such topics as multiculturalism, islam, and other rather controversial topics. Good satire can hurt and his cartoons definitely hits a nerve with some deeply religious individuals. His website is enitled “Gregorius Nekschot – Sick Jokes”. Lets say Nekschot is very blunt and to the point.  Anyway check here for an example.

Anyway, two weeks ago, Nekschot  put a rather visionary (in retrospect) post on his web site where he jokingly suggests that soon Ernst Hirsch Ballin (Dutch minister of justice) & his uniformed party members would be arresting free spirited people and deporting them for reeducation (in Dutch). The reason was Ballin’s apparent plans to broaden the scope of existing legislation against blasphemy and the analogy with Guantanomo that was being suggested was very to the point in my view. Also the reference to 1940’s party members of Hirsch Ballin that cooperated with the Nazi occupation or looked the other way is very much to the point since Hirsch Ballins motivation seems to display similar spinelessness and an apparent desire to follow up on rather intimidating threats/complaints coming from e.g. Iran and various islamist groups living in the Netherlands. Sort of the same groups of people that cheered Theo van Gogh’s murder a few years ago, who was BTW a friend of Nekschot apparently.

It seems Nekschot’s analysis was more accurate than he must have realized. Gregorius Nekschot was arrested last week on orders of a maverick Dutch attorney who seems to be more or less operating under direct orders from Hirsch Ballin. Nekschot was locked up for two days, with no trial, based on vague accusations regarding the general shocking, insulting and discriminating nature of some of his cartoons. This is sort of a new low. Having a cartoonist’s house searched by a 10 people strong police force and the victim subsequently deported to a prison is not something you’d expect in a modern, democratic country.  It happened last week in the Netherlands. Full thumbs up from the responsible minister apparently.

Nekschot is a pseudonym of course that refers to the “shot to the neck” execution style that very much characterizes his approach to humour. His real name is so far not revealed. A court case will change that and expose him to very real threats to his life. After all, one of his friends was murdered already for speaking freely.

I usually don’t do politics on my blog but this is a good reason to make an exception. It seems Hirsch Ballin lacks a sense of humor and an appreciation of free speech and the Dutch constitution.

captcha

It seems the captcha plugin (capcc) I was using with wordpress has been broken for some time. Probably this happened when I installed wp 2.5 a few weeks ago. My friend Christian del Rosso pointed this out. I installed a different plugin now (yacaptcha) which both looks nicer and hopefully works better too.

So if you couldn’t comment because of this, try again.

Friend Connect

Google announced their friend connect yesterday. It’s part of what is a pretty broad, and in my view really smart, strategy that they have been rolling out over the past few months bit by bit. It all started with open social which is their social network API that allows gadget creators to target any social network able to act as a open social container. By now this includes most relevant social networks except Facebook.

An issue is that open social is still a bit immature and also that compatibility between sites is not that great due to sites introducing all sorts of extensions and cherry picking features to implement, which of course leads to a great variety of circumstances to test for. However, it’s a huge improvement over having just the Facebook API (which is not that old either, or that good).

Then came google app engine, which is a ultra scalable, hassle free environment for creating and hosting simple web applications. Like for example open social gadgets. App engine is a very interesting achievement at least from an architecture and scalability point of view. Whether it will work as advertised remains to be seen of course, too early to tell. Also, it comes with lots of technical restrictions that are going to be not popular with people that have investments in existing, non compatible code.

On the other hand, there’s no way around the fact that most these limitations are more or less required for the type of scalability that Google wants to provide. So, Google App Engine lowers the barrier of entry for small parties to launch their own open social gadgets or full websites. That’s good for Google because inevitably Google ends up being a really attractive advertising partner for people choosing to sell their soul like that choosing to host their products on Google App Engine. And of course, Google gets to monitor site activity and track users which is all very valuable data from advertising point of view. And of course all those nice Google APIs are really easy to access from inside Google’s own platform.

Now yesterday they added friend connect to the mix. Friend connect does several things. First of all, it turns simple web sites into open social containers. Secondly, it comes with a few widgets that add some value to this. The most important of this is what appears to be a social network interconnect that allows for authentication of users against several popular social networks and openid thus relieving the simple website of that task. Basically visitors of a site can sign in with one or more social network credentials. Google handles all the interaction with the backend social networks which includes things such as publishing site activity to your event feed; access to your friend lists on all associated sites and that type of features.

Soon loads of blogs and other websites will start featuring nifty open social gadgets. Think wordpress sidebar widgets on steroids (checkout my frontpage to see a few in action). This will lead to a mass migration of activity from inside social networks to external websites.

I mentioned this was a very smart strategy by Google. What’s going on here? Well Google, unlike most companies relying on advertisement revenue, doesn’t care which websites you visit as long as they feature Google ads or as long as they can somehow track what you are doing. Friends connect vastly increases their ability to do so. It’s effectively as good as users visiting a Google owned site: you sign in; all sorts of complex javascript executes; AJAX calls to Google take place, etc. They might even start pushing ads this way, although I suspect that they are not that stupid (would basically alienate a lot of website maintainers). More logic is that they continue to push ads separately and instead make it more attractive for existing adsense users to also deploy friend connect.

So, Google ads + friends connect is worth billions. It basically turns all connected websites into one huge social network with Google right at the center. Facebook can’t really deliver this value because inevitably users browse to other domains than facebook.com and also because their third party website advertising marketshare is pretty much non existent: all their revenue is inside their walled garden. Same for myspace.com, or linked in.com and most other social networks. Google doesn’t really have this problem. Most of their ads are served up by third party websites anyway and more eyeballs for those means more money for them. Any way you get to see a Google ad is a good one as far as they are concerned.

Google also managed to do some interesting things here. Note that Facebook is featured on friends connect. Apparently Google is just using the public facebook APIs just like any other site. But it should be interesting to learn what’s in it for Facebook (revenue sharing?). Facebook and MySpace are also launching their connnect APIs this week BTW. However, as noted above, they currently lack the advertising solutions to make it work so it is debatable what the added value of that is going to be. It could be that they have to do do some website owner alienation by pushing ads. This is something Google can afford not to do.

Additionally, Google is actually bridging several social networks. Your myspace buddies showing up right next to your facebook buddies is somewhat of a novelty for the web (and the involved social networks). Google doesn’t care where you park your friends, as long as you expose them via Google Connect and interact with them on sites showing nice Google ads.

Very clever.

I have a few worries though. To me friends connect sounds like a rather exclusive club and huge control point. It achieves some of the goals of dataportability.org by basically introducing one big fat central control point. So it’s as open as Google wants/needs it to be. For now they seem to be doing the right way and friends connect being an openid relying party is a great example. But long term I wonder what will happen to the non Google connected web.

Update. It seems Facebook is blocking their, apparently, involuntary inclusion on Google’s friend connect citing terms of use designed to lock in users into their platform. If you are not part of the solution, you are a part of the problem. Or, as Despair.com paraphrases, If you’re not a part of the solution,there’s good money to be made in prolonging the problem. I guess, they are afraid of the walls of their garden being torn down and that their estimated value might deflate before they can capitalize on it. Rumor has it Steve Balmer is sitting on a sack of unused money due to a certain deal blowing up in his face recently. And we all know he likes to throw with what he sits on.

Java & Toys

After a few months of doing python development, which to me still feels like a straight jacket. I had some Java coding to do last week and promptly wasted a few hours checking out the latest toys, being:

  • Eclipse 3.4 M7
  • Hudson
  • Findbugs for Hudson

Eclipse 3.4 M7 is the first milestone I’ve tried for the upcoming Eclipse release. This is due to me not coding Java much lately; nothing wrong with it otherwise. Normally, I’d probably have switched around M4 already (at least did so for 3.2 and 3.3 cycles). In fact it is a great improvement and several nice productivity enhancements are included. My favorite one is the problem hover that now includes links to quick fixes. So instead of point, click, typing ctrl+0, arrow down (1..*), enter, you can now lean back and point & hover + click. Brilliant. I promptly used it to convert some 1.4 non generics based code into nice generics based code simply by tackling all the generics related warnings one by one essentially only touching the keyboard to suggest a few type parameters Eclipse couldn’t figure out. Introduce generic type parameter + infer generics refactorings are very helpful here. The code of course compiled and executed as expected. No bugs introduced and the test suite still runs fine. Around 5000 lines of code refactored in under 20 minutes. I still have some work to do to remove some redundant casts and to replace some while/for loops with foreach.

Other nice features are the new breadcrumps bar (brilliant!) and a new refactoring to create parameter classes for overly long lists of parameters on methods. Also nice is a refactoring to concatenate String concatenation into StringBuffer.append calls. Although StringBuilder is slightly faster for cases where you don’t need thread safe code (i.e most of the time). The rest is the usual amount of major and minor refinements that I care less about but are nice to have around anyway. One I imagine I might end up using a lot is quickfixes to sort out osgi bundle dependencies. You might recall me complaining about this some time ago. Also be sure to read Peter Kriens reply to this btw. Bnd is indeed nice but tools don’t solve what is in my view a kludge. Both the new eclipse feature and BND are workarounds for the problem that what OSGI is trying to do does not exist at  (and is somewhat at odd with) the Java type level.

Anyway, the second thing I looked into was Hudson, a nice server for continuous integration. It can checkout anything from a wide range of version control systems (subversion supported out of the box, several others through plugins) and run any script you like. It also understands maven and how to launch ant. With the right plugins you can then let it do quite useful things like compiling, running static code analyzers, deploying to a staging server, running test suites, etc. Unlike some stuff I evaluated a few years this actually worked right out of the box and was so easy to set up that I promptly did so for the project I’m working on. Together with loads of plugins that add all sorts of cool functionality, you have just ran out of excuses to not do continuous integration.

One of the plugins I’ve installed so far is an old favorite Findbugs which promptly drew my attention to two potentially dangerous bugs and a minor performance bug in my code reminding me that running this and making sure it doesn’t complain is actually quite important. Of all code checkers, findbugs provides the best mix between finding loads of stuff while not being obnoxious about it without a lot of configuration (like e.g. checkstyle and pmd require to shut the fuck up about stupid stuff I don’t care about) and while actually finding stuff that needs fixing.

While of course Java centric, you can teach Hudson other tricks as well.  So, next on my agenda is creating a job for our python code and hooking that up to pylint and possibly our django unit tests. There’s plugins around for both tasks.

OoO 3.0 Beta & cross references

It still looks butt ugly but at least this bug was partially addressed in the latest beta release of Open Office. The opening date for this one, “Dec 19 19:13:00 +0000 2001”. That’s more than seven years ago! This show stopper has prevented me from writing my thesis, any scientific articles, or in fact anything serious in open office since writing such things requires proper cross reference functionality. But finally, they implemented the simple feature of actually being able to refer to paragraph numbers of something elsewhere in the document using an actual cross reference. This is useful to be able to refer to numbered references, figures, tables, formulas, theorems, sections, etc.

The process for this bug went something like this “you don’t need cross references” (imagine star wars type gesture here). Really for a bunch of people implementing a word processor the mere length of the period they maintained this point of view was shocking and to me has always been a strong indication that they might not be that well suited for the job of creating an actual word processor. Then they went on to a infinite loop of “hmm maybe we can hack something for open office 1.1 2.0 2.1 2.2 2.3 2.4 3.0″ and “we need to fix this because imported word documents are breaking over this” (never mind that real authors might need this for perfectly valid reasons). This went on for a very very long time, and frankly I have long since stopped considering open office as a serious alternative for doing my word processing.

I just tried it in 3.0 beta and it actually works now, sort of. Testing new OoO releases for this has become somewhat of a ritual for me. For years, the first thing I did after downloading OoO was try to insert a few cross references before shaking my head and closing the window. The UI is still horribly unusable but at least the feature is there now if you know where to look for it.

Six years ago Framemaker was the only alternative that met my technical requirements of being an actual word processor with a UI and features that support the authoring process (unlike latex, which is a compiler),  the ability to use cross references, and flexible but very strictly applied formatting. Theoretically word can do all of this as well but I don’t recommend it for reasons of buggyness and the surprising ease with which you can lose hours of work due to word automatically rearranging & moving things for you when you e.g. insert a picture, pasting a table, etc (and yes I’ve seen documents corrupt themselves just by doing these things).

The last few years, I’ve used open office only to be able to open the odd word/powerpoint file dropping in my inbox at home. I basically have close to no office application needs here at home. For my writing at work needs, I usually adapt to what my coauthors use (i.e. word and sometimes latex).  Framemaker has basically been dying since Adobe bought it. The last version I used was 6.0 and the last occasion I used it was when writing my phd thesis.

Ubuntu at work

After my many, not so positive, reviews you might be surprised to learn that I’m actually using it at work now. Last week, a recent Mac convert dumped his ‘old’ laptop on my desk which happened to be a Lenovo T60 with a nice core duo processor, ATI graphics and 2 GB of memory. One of the reasons for the mac was that the thing kept crashing. This can either be a hardware or a software problem. I suspect the latter but I’ll have to see.

It so happens that my own windows desktop is increasingly less compatible with the linux based python development going on in the team I’m in. So even before taking the laptop, I was playing around with a vmware image to run some server stuff. My idea was to do the development on my desktop (using eclipse + pydev) and deploy on a vmware server with ubuntu and the right dependencies. Slow, but it should work, mostly.

So instead, last friday I installed Ubuntu 7.10 (only CD lying around) on the T60 and then upgraded it to 8.04 over the network. The scanning the mirror error I discribed earlier struck again. This time because of a corporate http proxy (gee only the entire fortune 500 list probably uses one: either add proxy settings to the installer or don’t attempt to use the network during installation). Solution: unplug network cable and let it time out.

Display detection actually worked this time. Anyway, I was only installing 7.10 to upgrade it to 8.10. Due to the scanning the mirror error, the installer had conveniently commented out all apt repositories. Of course there’s no GUI to fix that (except gedit). After fixing that and configuring the proxy in various places, I installed some 150MB worth of upgrades and then tried to convince the update manager to show me the upgrade to 8.04 dialog that various websites assure users should show up. It refused to in my case. So back to the commandline. Having had nasty experiences upgrading debian from the commandline inside X, I opted to do this in a terminal (alt+f2). Not sure if this is still needed but it can’t hurt. Anyway, this took more than an hour. In retrospect, downloading and burning a 8.04 image would have been faster.

So far so good. The thing booted and everything seemed to work. Except the wireless lan was nowhere to be seen (known issue with the driver apparently, haven’t managed to fix this yet). Compiz actually works and looks pretty cool. I have sound. I have network (wired).

Almost works as advertised one might say.

Until I plugged the laptop in its docking station and connected that with a dvi cable to the 1600×1200 external screen. Basically, I’m still struggling with this one. Out of the box, it seems impossible to scale beyond the native laptop screensize. What should happen is that either the dockingstation acts as a second screen or that it replaces the laptop screen with a much better resolution. Neither of this happens.

I finally edited xorg.conf to partially fix the resolution issue by adding 1600×1200 as an option. Only problem: compiz (the 3d accelerated GUI layer) doesn’t like this. I can only use this resolution with compiz disabled. If I enable it, basically it adds a black bar to the right and below. I wasted quite a bit of time trying to find a solution, so far without luck although I did manage to dig up a few links to compiz/ubuntu bugs (e.g. here) and forum posts suggesting I’m not alone. This seems to be mostly a combination of compiz immaturity and x.org autodetection having some cases where it just doesn’t work. With my home setup it didn’t get this far.

My final gripe concerns the amount of pixels Ubuntu/Gnome wastes. I ran into this running eclipse and noticing that compared to windows it includes a lot of white space, ugly fonts that seem to use a lot of space. Screen real estate really matters with eclipse due to the enormous amount of information the GUI is trying to present. Check here for some tips on how to fix eclipse. This issue was emphasized even more when I tried to access my 1400×1050 windows laptop using Ubuntu’s remote desktop vnc client and the realvnc server running on windows. The retard that designed the UI for that decided in all his wisdom to show the vnc session in a vnc application window with a huge & useless toolbar with a tab bar below that (!) with in that a single tab for my windows session. Add the Ubuntu menubar + task bar and there is no way that it can show a 1400×1050 desktop in a 1600×1200 screen without scrollbars (i.e. altogether I lose around 250-300 pixels of screen real estate). Pretty damn sad piece of UI design if you ask me. Luckily it has a full screen mode.

In case you are wondering why I bother to document this, the links are a great time saver next time I need to do this. Overall, despite all the hardware issues, I think I can agree with Mark Shuttleworth now that under controlled hardware circumstances this is a pretty good OS. Window 95 wasn’t ideal either and I managed to live with that for several years.

GX WebManager

Before joining Nokia, I worked for a small web startup in the Netherlands called <GX> Creative Online Development during  2004 and 2005. When I started there, I was employee number forty something (I like to think it was 42, but not sure anymore). When I left, they had grown to close to a hundred employees and judging from what I heard since, they’ve continued to grow roughly following Moore’s law in terms of number of employees. Also they seem to have executed the strategy that took shape while I was still their release manager.

When I joined GX, GX WebManager was a pretty advanced in house developed CMS that had gone through several years of field use and evolution already and enjoyed a rapidly growing number of deployments, including many big name Dutch institutions such as KPN, Ajax, ABN-AMRO, etc. At that time it was very much a in house developed thing that nobody from outside the company ever touched. Except through the provided UI of course which was fully AJAX based before the term became fashionable. By the time I left, we had upgraded release processes to push out regular technology releases first internally and later also outside to a growing number of partners that implemented GX WebManager for their customers.

I regularly check the GX website to see what thay have been up to and recently noticed that they pushed out a community edition of GX WebManager. They’ve spent the last few years rearchitecting what was already a pretty cool CMS to begin with to refit it with a standardized content repository (JSR 170) based on Apache Jackrabbit and a OSGI container based on Apache Felix. This architecture has been designed to allow easy creation of extensions by third parties. Martijn van Berkum and Arthur Meyer (product manager and lead architect) were already musing how to do this while I was still there and had gotten pretty far doing initial designs and prototyping . Last year they pushed out GX WebManager 9.0 based on the new architecture to their partners and now 9.4 to the internet community. They seem to have pretty big ambitions to grow internationally, and in my experience the technology and know-how to do it.

So congratulations to them on completing this. If you are in the market for a CMS, go check out their products and portfolio.