Maven: the way forward

A bit longer post today. My previous blog post set me off pondering on a couple of things that I have been pondering on before that sort of fit nicely together in a potential way forward. In this previous post and also this post, I spent a lot of words criticizing maven. People would be right to criticize me for blaming maven. However, that would be the wrong way to take my criticism. There’s nothing wrong with maven, it just annoys the hell out of me that it is needed and that I need to spend so much time waiting for it. In my view, maven is a symptom of a much bigger underlying problem: the java server side world (or rather the entire solution space for pretty much all forms of development) is bloated with tools, frameworks, application servers, and other stuff designed to address tiny problems with each other. Together, they sort of work but it isn’t pretty. What if we’d wipe all of that away, very much like the Sun people did when they designed Java 20 years ago? What would be different? What would be the same? I cannot of course see this topic separately from my previous career as a software engineering researcher. In my view there have been a lot of ongoing developments in the past 20 years that are now converging and morphing into something that could radically improve over the existing state of the art. However, I’m not aware of any specific projects taking on this issue in full even though a lot of people are working on parts of the solution. What follows is essentially my thoughts on a lot of topics centered around taking Java (the platform, not necessarily the language) as a base level and exploring how I would like to see the platform morph into something worthy of the past 40 years of research and practice.

Architecture

Lets start with the architecture level. Java packages were a mistake, which is now widely acknowledged. .Net namespaces are arguably better and OSGi bundles with explicit required and provided APIs as well as API versioning are better still. To scale software into the cloud where it must coexist with other software, including different (or identical) versions of itself, we need to get a grip on architecture.

The subject has been studied extensively (see here fore a nice survey of some description languages) and I see OSGi as the most successful implementation to date that preserves important features that most other development platforms currently lack, omit, or half improvise. The main issue with OSGi is that it layers stuff on top of Java but is not really a part of it. Hence you end up with a mix of manifest files that go into jar files; annotations that go into your source code; and cruft in the form of framework extensions to hook everything up, complete with duplicate functionality for logging, publish subscribe patterns, and even web service frameworks. The OSGi people are moving away towards a more declarative approach. Bring this to its ultimate conclusion and you end up with language level support for basically all that OSGi is trying to do. So, explicit provided and required APIs, API versioning, events, dynamic loading/unloading, isolation.

A nice feature of Java that OSGi relies on is the class loader. When used properly, it allows you to create a class loader, let it load classes, execute the functionality, and then destroy the class loader and all the stuff it loaded which is then garbage collected. This is nice for both dynamic loading and unloading of functionality as well as isolating functionality (for security and stability reasons). OSGi heavily depends on this feature and many application servers try to use this. However, the mechanisms used are not exactly bullet proof and there exist enormous problems with e.g. memory leaking which causes engineers to be very conservative with relying on these mechanisms in a live environment.

More recently, people have started to use dependency injection where the need for something is expressed in the code (e.g. with an annotation) or externally in some configuration file). Then at run time a dependency injecting container tries to fulfill the dependencies by creating the right objects and injecting dependencies. Dependency injection improves testability and modularization enormously.

A feature in maven that people seem to like is its way of dealing with dependencies. You express what you need in the pom file and maven fetches the needed stuff from a repository. The maven, osgi, & spring combo, is about to happen. When it does, you’ll be specifying dependencies in four different places: java imports; annotations, the pom file, and the osgi manifest. But still, I think the combined feature set is worth having.

Language

Twenty years ago, Java was a pretty minimalistic language that took basically the best of 20 years (before that) of OO languages and kept a useful subset. Inevitably, lots got discarded or not considered at all. Some mistakes were made, and the language over time absorbed some less than perfect versions of the stuff that didn’t make it. So, Java has no language support for properties, this was sort of added on by the setter/getter convention introduced in JavaBeans. It has inner classes instead of closures and lambda functions. It has no pure generics (parametrizable types) but some complicated syntactic sugar that gets compiled to non generic code. The initial concurrent programming concepts in the language were complex, broken, and dangerous to use. Subsequent versions tweaked the semantics and added some useful things like the java concurrent package. The language is overly verbose and 20 years after the fact there is now quite a bit of competition from languages that basically don’t suffer from all this. The good news is that most of those have implementations on top of the JVM. Lets not let this degenerate into a language war but clearly the language needs a proper upgrade. IMHO scala could be a good direction but it too has already some compromise embedded and lacks support for the architectural features discussed above. Message passing and functional programming concepts are now seen as important features for scalability. These are tedious at best in Java and Scala supports these well while simultaneously providing a much more concise syntax. Lets just say a replacement of the Java language is overdue. But on the other hand it would be wrong to pick any language as the language. Both .Net and the JVM are routinely used as generic runtimes for all sorts of languages. There’s also the LLVM project, which is a compiler tool chain that includes dynamic compilation in a vm as an option for basically anything GCC can compile.

Artifacts should be transient

So we now have a hypothetical language, with support for all of the above. Lets not linger on the details and move on to deployment and run time. Basically the word compile comes from the early days of computing when people had to punch holes into cards and than compile those into stacks and hand feed them to big, noisy machines. In other words, compilation is a tedious & necessary evil. Java popularized the notion of just in time compilation and partial, dynamic compilation. The main difference here is that just in time compilation merely moves the compilation step to the moment the class is loaded whereas dynamic compilation goes a few steps further and takes into account run-time context to decide if and how to compile. IDEs tend to compile on the fly while you edit. So why, bother with compilation after you finish editing and before you need to load your classes? There is no real technical reason to compile ahead of time beyond the minor one time effort that might affect start up. You might want the option to do this but it should not default to doing it.

So, for most applications, the notion of generating binary artifacts before they are needed is redundant. If nothing needs to be generated, nothing needs to be copied/moved either. This is true for both compiled or interpreted and interpreted languages. A modern Java system basically uses some binary intermediate format that is generated before run-time. That too is redundant. If you have dynamic compilation, you can just take the source code and execute it (while generating any needed artifacts for that on the fly). You can still do in IDE compilation for validation and static analysis purposes. The distinction between interpreted and static compiled languages has become outdated and as scripting languages show, not having to juggle binary artifacts simplifies life quite a bit. In other words, development artifacts (other than the source code) are transient and with the transformation from code to running code automated and happening at run time, they should no longer be a consideration.

That means no more build tools.

Without the need to transform artifacts ahead of run-time, the need for tools doing and orchestrating this also changes. Much of what maven does is basically generating, copying, packaging, gathering, etc. artifacts. An artifact in maven is just a euphemism for a file. Doing this is actually pretty stupid work. With all of those artifacts redundant, why keep maven around at all? The answer to that is of course testing and continuous integration as well as application life cycle management and other good practices (like generating documentation). Except, lots of other different tools are involved with that as well. Your IDE is where you’d ideally review problems and issues. Something like Hudson playing together with your version management tooling is where you’d expect continuous integration to take place and application life cycle management is something that is part of your deployment environment. Architectural features of the language and run-time combined with good built in application and component life cycle removes much of the need of external tooling to support all this and improves interoperability.

Source files need to go as well

Visual age and smalltalk pioneered the notion of non file based program storage where you modify the artifacts in some kind of DB. Intentional programming research basically is about the notion that programs are essentially just interpretations of more abstract things that get transformed (just in time) to executable code or into different views (editable in some cases). Martin Fowler has long been advocating IP and what he refers to as the language workbench. In a nut shell, if you stop thinking of development as editing a text file and start thinking of it as manipulating abstract syntax trees with a variety of tools (e.g. rename refactoring), you sort of get what IP and language workbenches are about. Incidentally, concepts such as APIs, API versions, provided & required interfaces are quite easily implemented in a language workbench like environment.

Storage, versioning, access control, collaborative editing, etc.

Once you stop thinking in terms of files, you can start thinking about other useful features (beyond tree transformations), like versioning or collaborative editing for example. There have been some recent advances in software engineering that I see as key enablers here. Number 1 is that version management systems are becoming decentralized, replicated databases. You don’t check out from git, you clone the repository and push back any changes you make. What if your IDE were working straight into your (cloned) repository? Then deployment becomes just a controlled sequence of replicating your local changes somewhere else (either push based, pull based, or combinations of that. A problem with this is of course that version management systems are still about manipulating text files. So they sort of require you to serialize your rich syntax trees to text and you need tools to unserialize them in your IDE again. So, text files are just another artifact that needs to be discarded.

This brings me to another recent advance: couchdb. Couchdb is one of the non relational databases currently experiencing lots of (well deserved) attention. It doesn’t store tables, it stores structured documents. Trees in other words. Just what we need. It has some nice properties built in, one of which is replication. Its built from the ground up to replicate all over the globe. The grand vision behind couchdb is a cloud of all sorts of data where stuff just replicates to the place it is needed. To accomplish this, it builds on REST, map reduce, and a couple of other cool technology. The point is, couchdb already implements most of what we need. Building a git like revision control system for versioning arbitrary trees or collections of trees on top can’t be that challenging.

Imagine the following sequence of events. Developer A modifies his program. Developer B working on the same part of the software sees the changes (real time of course) and adds some more. Once both are happy they mark the associated task as done. Somewhere on the other side of the planet a test server locally replicates the changes related to the task and finds everything is OK. Eventually the change and other changes are tagged off as a new stable release. A user accesses the application on his phone and at the first opportunity (i.e. connected), the changes are replicated to his local database. End to end the word file or artifact appears nowhere. Also note that the bare minimum of data is transmitted: this is as efficient as it is ever going to get.

Conclusions

Anyway, just some reflections on where we are and where we need to go. Java did a lot of pioneering work in a lot of different domains but it is time to move on from the way our grand fathers operated computers (well, mine won’t touch a computer if he can avoid it but that’s a different story). Most people selling silver bullets in the form of maven, ruby, continuous integration, etc. are stuck in the current thinking. These are great tools but only in the context of what I see as a deeply flawed end to end system. A lot of additional cruft is under construction to support the latest cloud computing trends (which is essentially about managing a lot of files in a distributed environment). My point here is that taking a step back and rethinking things end to end might be worth the trouble. We’re so close to radically changing the way developers work here. Remove files and source code from the equation and what is left for maven to do? The only right answer here is nothing.

Why do I think this needs to happen: well, developers are currently wasting enormous amounts of time on what are essentially redundant things rather than developing software. The last few weeks were pretty bad for me, I was just handling deployment and build configuration stuff. Tedious, slow, and maven is part of this problem.

Update 26 October 2009

Just around the time I was writing this, some people decided to come up with Play, a framework + server inspired by Python Django that preserves a couple of cool features. The best feature: no application server restarts required, just hit F5. Works for Java source changes as well. Clearly, I’m not alone in viewing the Java server side world as old and bloated. Obviously it lacks a bit in functionality. But that’s easily fixed. I wonder how this combines with a decent dependency injection framework. My guess is not well, because dependency injection frameworks require a context (i.e.) state to be maintained and Play is designed to be stateless (like Django). Basically, each save potentially invalidates the context require a full reload of that as well (i.e. a server restart). Seems the play guys have identified the pain point in Java: server side state comes at a price.