Git presentation

A few months ago, I posted a rather lengthy article about the use of git in combination with git-svn. I concluded that article with a vague promise to provide an update at some point when we had more experience with using git and git-svn.

We’re now a few months into adopting git as a front-end for our centralized svn setup, which as I discussed in my previous article is not something we can easily replace in the near term. I wouldn’t say this has been a smooth ride but we are getting there. Several of my team members had no git experience whatsoever, had no idea what it was or why it mattered and it took me a lot of patience to gently nudge them towards using git. We are well on the way now though and people are starting to see the benefits of adopting a new way of working.

Mostly what I wrote a few months ago is still valid. However, what we do in my team is actually slightly different. In Nokia, we have a really nice github like facility where people can create projects and repositories (of various types, including git). It also has forks and git hub style pull requests.

So what we ended up doing is this: I git svn cloned our svn trunk with its full history (17000 commits spanning four years) but without the branches and tags. I then pushed it to a remote git. That repository is read only except for me. Then I cloned that repository so I can work in a pure git environment. I typically work on branches and push them to remote, fetch them into my git-svn repository and only there integrate them on master just before I dcommit them. Other team members simply fork my repository, clone their fork, hook it up to svn like I described recently, and can be ready to start working within a few minutes (downloading 600 MB of git repo just takes a bit of time).

We’re using various mechanisms to move commits around, including git format-patch & am, cherry-pick and, fast forward merges. Given that several members of our team are new to git, I decided that learning to walk before we run, is a good thing. So, I’m trying to keep things simple. Format patch and am are easy to explain and use. Additionally it really has the least chance to end up in tears. People seem to have trouble wrapping their heads around rebasing and merging and the damage you can do with those commands.

Yesterday, I presented some stuff on this to several colleagues and I thought I’d share these slides. I’ve borrowed several pictures from various blogs and provided some links where appropriate. In general, just the process of creating this presentation forced me to read up on a lot of stuff as well and actually improved my understanding of what git is. Also, the git community is awesome. There is a wide variety of tutorials, insightful blog posts, presentation material, etc. out there. Probably git is one of the best documented version control systems at this point.

In the slides I provide a brief overview of what git is and why it matters before diving into work flows. Of course the slides don’t show the stuff I demoed on the command line.

The single best possible tip I can give to people is to read the last chapter of Git Pro, which describes the internals and plumbing of git. Understanding what git does under the hood really helps explain a lot about how certain things work on the porcelain side of git (i.e. the user facing commands you normally use). And it is actually a very elegant and straightforward design as well.

git-svn flow

In recent years, git has rapidly been adopted by software developers in the software industry. Initially it was a toy used mainly in open source projects but it is now finding its way into the rest of the software development world. At this point in time, git is emerging as a tool of choice for doing version control for a large group of developers.

A reason for the popularity of git is that it enables a lot of new and different work flows that allow development teams to change the way they deliver software and deliver better software faster.

In this blog post I present some ideas for implementing a git work flow that allows for something else than the centralized work flow that is common for projects with a central subversion repository, like the project I am currently working on, while still keeping the central subversion repository.

Continue reading “git-svn flow”

Using Git and feature branches effectively

There is a bit of a debate raging in the last few weeks that started when somebody commented on a few things that Martin Fowler wrote about git and using feature branches and feature toggles. Martin Fowler makes a few very good points about how feature branches can cause a lot of testing and integration headaches and lot of people seemed to have picked up on this one and there seems to be a bit of an anti-feature branch movement emerging.

Continue reading “Using Git and feature branches effectively”

Git svn voodoo

As discussed earlier on this site, I recently started using git in our svn centric organization. Since I’m trying to convince some co-workers to do the same, I would like to share a bit of git voodoo for working with multiple git repositories and a central svn repository. Most git tutorials don’t really show how to do this, even though it is quite easy. The approach below gives you all the flexibility with git that you need while allowing you to inter-operate seamlessly with your svn using colleagues.

Continue reading “Git svn voodoo”

Some reflections on a future of software engineering

Update. I accidentally published an early draft of this  long (for a blog post) article by clicking the wrong button in wordpress. If you read this early draft, the article is now in its finished form and I’ve added a couple of links, edited the earlier text, and added loads of links.

I have over the past year been a bit less active blogging and posting in forums. I make an exception on things related to build processes, distributed version management and Java and you may find a couple of fine rants on my website and across various blogs on this topic.

This post is intended to pull all of it together in a bit more grand vision and reflect a bit on what a post revolution software engineering world might look like. I don’t like or believe much in revolutions but I do recognize that getting from where we are to where I think we need to be amounts to one. The more likely outcome is an evolutionary development where the bits and pieces where I was not completely and utterly wrong might actually happen. At some point. Maybe.

Continue reading “Some reflections on a future of software engineering”

Re: bear shaving

I was going to submit the stuff below in a shortened form as a comment to this fun little blog post on “bear shaving” but it sort of grew into a full blown article, again. To summarize the original article, there’s this nice analogy of shaving bears to help them cope with global warming and how that is not really addressing the core issues (not to mention dangerous). The analogy is applied to integration builds and people patching things up. Then the author sort of goes off and comes up with a few arguments against git and decentralization.

While some of the criticism is valid, this of course ticked me off 🙂

I see Git as a solution to increase the amount of change and dealing more effectively with people working in parallel. Yes, this puts a strain on integrating the resulting changes. But less change is the equivalent of bear shaving here. Change is good. Change is productivity. You want more productivity not less. You want to move forward as fast as you possibly can. Integration builds breaking are a symptom of a larger problem. Bear shaving would be doing everything you can to make the integration builds work again, including forcing people to sit on their hands. The typical reflex to a crisis like this in the software industry is less change, complete with the process to ensure that people do less. This is how waterfall was born. Iterative or spiral development is about the same thing but doing it more frequently and less longer. This was generally seen as an improvement. But you are still going to sit on your hands for pro longed periods of time. The real deal these days is continuous deployment and you can’t do this if you are sitting on your hands.

Breaking integration builds have a cause: the people making the changes are piling mistake on mistake and keep bear shaving (I love the metaphor) the problem because they are under a pressure to release and deliver functionality. All a faster pace of development does is make this more obvious. Along with the increased amount of change per time-unit comes also an increased amount of mistakes per time unit. Every quick fix and every misguided commit makes the system as a whole a little less stable. That’s why the waterfall model includes a feature freeze (aka. integration) where no changes are allowed because the system would never get finished otherwise.

A long time ago I wrote an article about design erosion. It was one of the corner stones of my phd thesis (check my publication page if you are interested). In a nutshell: changes are cumulative and we take design decisions in the context of our expectations of the future. Only problem: nobody can predict the future accurately and as a consequence, there will be mistakes from time to time. It is inevitable that you will get it wrong sometimes and won’t realize right away. You can’t just rip out a single change you made months/years ago without the depending subsequent changes being affected. In other words, change is cumulative: rip one piece out and the whole sand castle collapses. Some of the decisions will be wrong or will have to be reconsidered at some point and because changes are inter dependent, fixing design erosion can be painful and expensive. Consequently, it is inevitable that all software designs erode over time as inevitably such change is delayed until the last possible moment. Design erosion is a serious problem. You can’t just fix a badly eroded system that you had for years over-night. Failing to address design erosion in time can actually kill your company, product or project. But you can delay the inevitable by dealing with the problems closer to where they originate instead of dealing with it later. Dealing with the problem close to where it originates means less subsequent changes are affected, meaning that you minimize the cost of fixing the problem. Breaking integration builds are a symptom of an eroding design. Delaying the fix makes it worse.

So, the solution is to refactor and rethink the broken parts of the system to be more robust, easier to test, more flexible to meet the requirements, etc. Easier said then done of course. However, Git is a revolutionary enabler here: you can do the more disruptive stuff on a git branch and merge it back in when it is ready instead of when you go home and break the nightly build. This way you can do big changes without destabilizing your source tree. Of course you want continuous integration on your branches too. That way, you will push less mistakes between branches, thus solving problems closer to their origin and without affecting each other. You will still have breaking builds, but they will be cheaper to fix. Decentralization is the solution here and not the problem as is suggested in the blog post I linked above!

Here’s why decentralization works: testing effort grows exponentially to the amount of change. Double the amount of change, and you quadruple the testing effort. So don’t do that and keep the testing effort low. In a centralized world you do this through feature freeze. By stopping all change, you can actually find all the problems you introduced. In a decentralized world you do this by not pushing your changes until the changes you pull are no longer breaking your local branch. Then you push your working code. Why is this better? 1) you integrate incoming changes with your changes instead of the other way around. 2) you do this continuously (every time you pull changes), so you fix problems when they happen. 3) your changes only get pushed when they are stable which means that other people have less work with #1 and #2 on their side. 4) By keeping changes isolated from each other, you make it easier to test them. Once tested, the changes are a lot easier to integrate.

Continuous integration can help here but not if you only do it on the production branch: you need to do it all over the place. Serializing all the change through 1 integration environment turns it into a bottleneck: your version system may be decentralized but if your integration process is not you are still going to be in trouble. A centralized build system works ok with a centralized version system because centralized version system serializes the changes anyway (which is a problem and not something to keep bear shaving). The whole point of decentralizing version management is decentralizing change. You need to decentralize the integration process as well.

In a nutshell, this is how the linux kernel handles thousands of kloc of changes per day with hundreds of developers. And, yes, it is no coincidence that those guys came up with git. The linux kernel deals with design erosion by a continuous re development. The change is not additive, people are literally making changes all over the linux source tree, all the time. There is no way in hell they could deal with this in a centralized version management type environment. As far as I know, the linux kernel has no automated continuous integration. But they do have thousands of developers running all sorts of developer builds and reporting bugs against them, which is really the next best thing. Nothing gets in the mainline kernel without this taking place.

Git: so far, so good

I started using git two months ago. Basically, colleagues around me fall into three categories:

  • Those that already use git or mercurial (a small minority).
  • Those that are considering to start using it like me a few months ago (a few).
  • Those that don’t get it (the majority).

To those that don’t get it: time to update your skill sets. Not getting it is never good in IT and keeping your skill set current is vital to survival long term. Git is still new enough that you can get away with not getting it but I don’t think that will last long.

The truth of the matter is that git mostly works as advertised and there are a few real benefits to using it and a few real problems with not using it. To start with the problems:

  • Not using git limits you to one branch: trunk. Don’t fool yourself into thinking otherwise. I’ve seen branching in svn a couple of times and it was not pretty.
  • Not using git forces you to either work in small, non invasive increments or accept pro-longed instability on trunk with lots of disruptive change coming in. Most teams tend to have a release heart beat where trunk is close to useless except when a release is coming.
  • Not using git limits size of the group of people that can work effectively on the same code base. Having too many people commit on the same code will increase the number conflicting changes.
  • Not using git exposes you regularly to merge problems and conflicts when you upgrade your work copy from trunk.
  • Not using git forces a style of working that avoids the above problems: you don’t branch; people get angry when trunk breaks (which it does, often); you avoid making disruptive changes and when you do, you work for prolonged periods of time without committing; when you finally commit, you find that some a**hole introduced conflicting changes on trunk while you weren’t committing; once you have committed other people find that their uncommitted work now conflicts with trunk etc.
  • Given the above problems, people avoid the type of changes that causes them to run into these problems. This is the real problem. Not refactoring because of potential conflicts is an anti-pattern. Not doing a change because it would take too long to stabilize means that necessary changes get delayed.

All of those problems are real and the worst part is that people think they are normal. Git is hardly a silver bullet but it does take away these specific problems. And that’s a real benefit. Because it is a real benefit, more and more people are starting to use git, which puts all those people not using it at a disadvantage. So, not getting it is causing you real problems now (which you may not even be aware off). Just because you don’t get it doesn’t stop people who do get it from competing with you.

In the past few weeks, I’ve been gradually expanding my use of git. I started with the basics but I now find that my work flow is changing:

I’m no longer paranoid about updating from svn regularly because the incoming changes tend to not conflict with my local work if I “git svn rebase”. Rebasing is git specific process where you pull in changes from remote and “reapply” your own local commits on top of them. Basically before you push changes to remote, you rebase them on top of the latest and greatest available remote. This way your commit to remote is guaranteed to not conflict. So “git svn rebase” pulls changes from trunk and applies my local commits on top of them. Occasionally there are conflicts of course but git tends to be pretty smart about resolving most of those. E.g. file renames tend to be no problem. In a few weeks of using git, I’ve only had to edit conflicts a couple of times and in all of these cases, this was straightforward. The frequency with which you rebase doesn’t really matter since the process works on a per commit basis and not on a merge basis like in svn.

I tend to isolate big chunks of work on their own git branch so I can switch between tasks. I have a few experimental things going on that change our production business logic in a pretty big way. Those changes live in their own git branch. Once in a while, I rebase those branches against master where I rebase against svn trunk regularly to get the latest changes from svn trunk on the branch and make sure that I can still push them back to trunk when the time comes. Simply being able to work on such changes without those changes disrupting trunk or trunk changes disrupting my changes is a great benefit. You tend to not experiment on svn trunk because this pisses people off. I can experiment all I want on a local branch though. However, most of my branches are actually short lived: just because I can sit on changes forever doesn’t mean I make a habit of doing that needlessly. The main thing for me is being able to isolate unrelated changes from each other and from trunk and switching between those changes effortlessly.

Branching and rebasing allows me to work on a group of related changes without committing back to trunk right away. I find that my svn commits tend to be bigger but less frequent now. I’ve heard people who don’t get it argue that this is a bad thing. And I agree: for svn users this would be a bad thing because of the above problems. However, I don’t have those problems anymore because I use git. So, why would I want to destabilize trunk with my incomplete work?

Whenever I get interrupted to fix some bug or address some issue, I just do the change on whatever branch I’m working on. I commit the changes in that branch. Then I do a git stash save to quickly store any uncommitted work in progress. I do a git checkout master followed by a git cherrypick to get the commit with the fix on master. Then I git svn rebase and git svn dcommit to get the change into trunk. Then I checkout my branch again and do a git stash pop to pickup where I was before I was interrupted. This may sound complicated but it means that I am no more than two commands away from having a completely clean directory that matches svn trunk exactly that I can execute at any time without losing work in progress. So, no matter how disruptive the changes are that I am working on, I can always switch to a clean replica of svn trunk to do a quick change and then pick up the work on my disruptive changes. Even better, I can work on several independent sets of changes and switch between them in a few seconds.

So those are big improvements in my workflow that have been enabled by using git svn to interface with svn. I’d love to be able to collaborate with colleagues on my experimental branches. Git would enable them to do that. This why them not getting git is a problem.

Btw. you can replace git with mercurial in the text above. They are very similar in capabilities.

Git and agile

I’ve been working with Subversion since 2004 (we used a pre 1.0 version at GX). I started hearing about git around the 2006-2007 time frame when Linus Torvalds’ replacement for Bitkeeper started maturing enough for other people to use it. I met people working on Maemo (the Debian based OS for the N770, N800, N810, and recently the N900) in Nokia who were really enthusiastic about it in 2008. They had to use it to work with all the upstream projects Maemo depends on and they loved it. When I moved to Berlin everybody there was using subversion so I just conformed and ignored git/mercurial and all those other cool versioning systems out there for an entire year. It turns out that was lost time, I should have switched around 2007/2008. I’m especially annoyed by this because I’ve been aware of decentralized versioning being superior to centralized versioning since 2006. If you don’t believe me, I had a workshop paper at SPLC 2006 on version management and variability management that pointed out the emerging of DVCSes in that context. I’ve wasted at least three years. Ages for the early adopter type guy I still consider myself to be.

Anyway, after weighing the pros and cons for way too long, I switched from subversion to git last week. What triggered me to do this was, oddly, an excellent tutorial on Mercurial by Joel Spolsky. Nothing against Mercurial, but Git has the momentum in my view and it definitely appears to be the band wagon to be jumping right now. I don’t see any big technical argument for using Mercurial instead of Git. There’s github and no mercurial hub as far as I know. So, I took Joel’s good advice on Mercurial as a hint that it was time to get off my ass and get more serious about switching to anything else than Subversion. I had already decided in favor of git based on stuff I’ve been reading on both versioning systems.

My colleagues of course haven’t switched (yet, mostly) but that is not an issue with git-svn, which allows me to interface with svn repositories. I’d like to say making the switch was an easy ride, except it wasn’t. The reason is not git but me. Git is a powerful tool that has quite a bit more features than Subversion. Martin Fowler has a nice diagram on “recommendability” and “required skill”. Git is in the top right corner (highly recommended but you’ll need to learn some new skills) and Subversion is lower right (recommended, not much skill needed). The good news is that you will need only a small subset of commands to cover the feature set provided by svn and you can gradually expand what you use from there. Even with this small subset git is worth the trouble IMHO, if only because world + dog are switching. The bad news is that you will just have to sit down and spend a few hours learning the basics. I spent a bit more than I planned to on this but in the end I got there.

I should have switched around 2007/2008

The mistake I made that caused me to delay the switch for years was not realizing that git adds loads of value even when your colleagues are not using it: you will be able to collaborate more effectively if you are the only one using git! There are two parts to my mistake.

The first part is that the whole point of git is branching. You don’t have a working copy, you have a branch. It’s exactly the same with git-svn: you don’t have a svn working copy but a branch forked of svn trunk. So what, you might think. Git excels at merging between branches. With svn branching and merging is painful, so instead of having branches and merging between them, you avoid conflicts by updating often and committing often. With git-svn, you don’t update from svn trunk, you merge its changes in your local branch. You are working on a branch by default and creating more than 1 is really not something to be scared of. It’s is painless, even if you have a large amount of uncommitted work (which would get you in trouble with svn). Even if that work includes renaming the top level directories in your project (I did this). Even if other people are doing big changes in svn trunk. That’s a really valuable feature to have around. It means I can work on big changes to the code without having to worry about upstream svn commits. The type of changes nobody dares to take on because it would be too disruptive to deal with branching and merging and because there are “more important things” to do and we don’t want to “destabilize” trunk. Well, not any more. I can work on changes locally on a git branch for weeks if needed and push it back to trunk when it is ready while at the same time me and my colleagues keep committing big changes on trunk. The reason I’m so annoyed right now is the time I spent on resolving svn conflicts in the past four years was essentially unnecessary. Not switching four years ago was a big mistake.

The second part of my mistake was assuming I needed IDE support for git to be able to deal with refactoring and particularly class renames (which I do all the time in Eclipse). While there is egit now, it is still pretty immature. It turns out that assuming I needed Eclipse support was a false assumption. If you rename a file in a git repository and commit the file, Git will automatically figure out that the file was renamed, you don’t need to tell git that the file was renamed. A simple “mv foo.java bar.java” will work. On directories too. This is a really cool feature. So I can develop in eclipse without it even being aware of any git specifics, refactor and rename as much as I like, and git will keep tracking the changes for me. Even better, certain types of refactorings that are quite tricky with subclipse and subversive just work in git. I’ve corrupted svn work directories on several occasions when trying to rename packages and moving stuff around. Git will handle this effortlessly. Merges work so well because git can handle the situation where a locally renamed file needs changes from upstream merged into it. It’s a core feature, not an argument against using it. My mistake. I probably spent even more time on corrupted svn directories than conflict resolution in the last three years.

Git is an Agile enabler

We have plenty of pending big changes and refactorings that we have been delaying because they are disruptive. Git allows me to work on these changes whenever I feel like it without having to finish them before somebody else starts introducing conflicting changes.

This is not just a technical advantage. It is a process advantage as well. Subversion forces you to serialize change so that you minimize the interactions between the changes. That’s another way of saying that subversion is all about waterfall. Git allows you to decouple change instead and parallelize the work more effectively. Think multiple teams working on the same code base on unrelated changes. Don’t believe me? The linux kernel community has thousands of developers from hundreds of companies working on the same code base touching large portions of the entire source tree. Git is why that works at all and why they push out stable releases every 6 weeks. Linux kernel development speed is measured in thousands of lines of code modified or added per day. Evaluating the incoming changes every day is a full time job for several people.

Subversion is causing us to delay necessary changes, i.e. changes that we would prefer to do if only it wouldn’t be so disruptive. Delayed changes pile up to become technical debt. Think of git as a tool to manage your technical debt. You can work on business value adding changes (and keep the managers happy) and disruptive changes at the same time without the two interfering. In other words you can be more agile. Agile has always been about technical enablers (refactoring tooling, unit testing frameworks, continuous integration infrastructure, version control, etc) as much as it was about process. Having the infrastructure to do rapid iterations and release frequently is critical to the ability to release every sprint. You can’t do one without the other. Of course, tools don’t fix process problems. But then, process tends to be about workarounds for lacking tools as well. Decentralized version management is another essential tool in this context. You can compensate not using it with process. IMHO life is to short to play bureaucrat.

Not an easy ride

But as I said, switching from svn to git wasn’t a smooth ride. Getting familiar with the various git commands and how they are different from what I am used to in svn has been taking some time despite the fact that I understand how it works and how I am supposed to use it. I’m a git newby and I’ve been making lots of beginners mistakes (mainly using the wrong git commands for the things I was trying to do). The good news is that I managed to get some pretty big changes committed back to the central svn repository without losing any work (which is the point of version management). The bad news is that I got stuck several times trying to figure out how to rebase properly, how to undo certain changes, how to recover a messed up checkout on top of my local work directory from the local git repository. In short, I learned a lot on this and I have still some more things to learn. On the other hand, I can track changes from svn trunk, have local topic branches, merge from those to the local git master, and dcommit back to trunk. That about covers all my basic needs.

Maven: the way forward

A bit longer post today. My previous blog post set me off pondering on a couple of things that I have been pondering on before that sort of fit nicely together in a potential way forward. In this previous post and also this post, I spent a lot of words criticizing maven. People would be right to criticize me for blaming maven. However, that would be the wrong way to take my criticism. There’s nothing wrong with maven, it just annoys the hell out of me that it is needed and that I need to spend so much time waiting for it. In my view, maven is a symptom of a much bigger underlying problem: the java server side world (or rather the entire solution space for pretty much all forms of development) is bloated with tools, frameworks, application servers, and other stuff designed to address tiny problems with each other. Together, they sort of work but it isn’t pretty. What if we’d wipe all of that away, very much like the Sun people did when they designed Java 20 years ago? What would be different? What would be the same? I cannot of course see this topic separately from my previous career as a software engineering researcher. In my view there have been a lot of ongoing developments in the past 20 years that are now converging and morphing into something that could radically improve over the existing state of the art. However, I’m not aware of any specific projects taking on this issue in full even though a lot of people are working on parts of the solution. What follows is essentially my thoughts on a lot of topics centered around taking Java (the platform, not necessarily the language) as a base level and exploring how I would like to see the platform morph into something worthy of the past 40 years of research and practice.

Architecture

Lets start with the architecture level. Java packages were a mistake, which is now widely acknowledged. .Net namespaces are arguably better and OSGi bundles with explicit required and provided APIs as well as API versioning are better still. To scale software into the cloud where it must coexist with other software, including different (or identical) versions of itself, we need to get a grip on architecture.

The subject has been studied extensively (see here fore a nice survey of some description languages) and I see OSGi as the most successful implementation to date that preserves important features that most other development platforms currently lack, omit, or half improvise. The main issue with OSGi is that it layers stuff on top of Java but is not really a part of it. Hence you end up with a mix of manifest files that go into jar files; annotations that go into your source code; and cruft in the form of framework extensions to hook everything up, complete with duplicate functionality for logging, publish subscribe patterns, and even web service frameworks. The OSGi people are moving away towards a more declarative approach. Bring this to its ultimate conclusion and you end up with language level support for basically all that OSGi is trying to do. So, explicit provided and required APIs, API versioning, events, dynamic loading/unloading, isolation.

A nice feature of Java that OSGi relies on is the class loader. When used properly, it allows you to create a class loader, let it load classes, execute the functionality, and then destroy the class loader and all the stuff it loaded which is then garbage collected. This is nice for both dynamic loading and unloading of functionality as well as isolating functionality (for security and stability reasons). OSGi heavily depends on this feature and many application servers try to use this. However, the mechanisms used are not exactly bullet proof and there exist enormous problems with e.g. memory leaking which causes engineers to be very conservative with relying on these mechanisms in a live environment.

More recently, people have started to use dependency injection where the need for something is expressed in the code (e.g. with an annotation) or externally in some configuration file). Then at run time a dependency injecting container tries to fulfill the dependencies by creating the right objects and injecting dependencies. Dependency injection improves testability and modularization enormously.

A feature in maven that people seem to like is its way of dealing with dependencies. You express what you need in the pom file and maven fetches the needed stuff from a repository. The maven, osgi, & spring combo, is about to happen. When it does, you’ll be specifying dependencies in four different places: java imports; annotations, the pom file, and the osgi manifest. But still, I think the combined feature set is worth having.

Language

Twenty years ago, Java was a pretty minimalistic language that took basically the best of 20 years (before that) of OO languages and kept a useful subset. Inevitably, lots got discarded or not considered at all. Some mistakes were made, and the language over time absorbed some less than perfect versions of the stuff that didn’t make it. So, Java has no language support for properties, this was sort of added on by the setter/getter convention introduced in JavaBeans. It has inner classes instead of closures and lambda functions. It has no pure generics (parametrizable types) but some complicated syntactic sugar that gets compiled to non generic code. The initial concurrent programming concepts in the language were complex, broken, and dangerous to use. Subsequent versions tweaked the semantics and added some useful things like the java concurrent package. The language is overly verbose and 20 years after the fact there is now quite a bit of competition from languages that basically don’t suffer from all this. The good news is that most of those have implementations on top of the JVM. Lets not let this degenerate into a language war but clearly the language needs a proper upgrade. IMHO scala could be a good direction but it too has already some compromise embedded and lacks support for the architectural features discussed above. Message passing and functional programming concepts are now seen as important features for scalability. These are tedious at best in Java and Scala supports these well while simultaneously providing a much more concise syntax. Lets just say a replacement of the Java language is overdue. But on the other hand it would be wrong to pick any language as the language. Both .Net and the JVM are routinely used as generic runtimes for all sorts of languages. There’s also the LLVM project, which is a compiler tool chain that includes dynamic compilation in a vm as an option for basically anything GCC can compile.

Artifacts should be transient

So we now have a hypothetical language, with support for all of the above. Lets not linger on the details and move on to deployment and run time. Basically the word compile comes from the early days of computing when people had to punch holes into cards and than compile those into stacks and hand feed them to big, noisy machines. In other words, compilation is a tedious & necessary evil. Java popularized the notion of just in time compilation and partial, dynamic compilation. The main difference here is that just in time compilation merely moves the compilation step to the moment the class is loaded whereas dynamic compilation goes a few steps further and takes into account run-time context to decide if and how to compile. IDEs tend to compile on the fly while you edit. So why, bother with compilation after you finish editing and before you need to load your classes? There is no real technical reason to compile ahead of time beyond the minor one time effort that might affect start up. You might want the option to do this but it should not default to doing it.

So, for most applications, the notion of generating binary artifacts before they are needed is redundant. If nothing needs to be generated, nothing needs to be copied/moved either. This is true for both compiled or interpreted and interpreted languages. A modern Java system basically uses some binary intermediate format that is generated before run-time. That too is redundant. If you have dynamic compilation, you can just take the source code and execute it (while generating any needed artifacts for that on the fly). You can still do in IDE compilation for validation and static analysis purposes. The distinction between interpreted and static compiled languages has become outdated and as scripting languages show, not having to juggle binary artifacts simplifies life quite a bit. In other words, development artifacts (other than the source code) are transient and with the transformation from code to running code automated and happening at run time, they should no longer be a consideration.

That means no more build tools.

Without the need to transform artifacts ahead of run-time, the need for tools doing and orchestrating this also changes. Much of what maven does is basically generating, copying, packaging, gathering, etc. artifacts. An artifact in maven is just a euphemism for a file. Doing this is actually pretty stupid work. With all of those artifacts redundant, why keep maven around at all? The answer to that is of course testing and continuous integration as well as application life cycle management and other good practices (like generating documentation). Except, lots of other different tools are involved with that as well. Your IDE is where you’d ideally review problems and issues. Something like Hudson playing together with your version management tooling is where you’d expect continuous integration to take place and application life cycle management is something that is part of your deployment environment. Architectural features of the language and run-time combined with good built in application and component life cycle removes much of the need of external tooling to support all this and improves interoperability.

Source files need to go as well

Visual age and smalltalk pioneered the notion of non file based program storage where you modify the artifacts in some kind of DB. Intentional programming research basically is about the notion that programs are essentially just interpretations of more abstract things that get transformed (just in time) to executable code or into different views (editable in some cases). Martin Fowler has long been advocating IP and what he refers to as the language workbench. In a nut shell, if you stop thinking of development as editing a text file and start thinking of it as manipulating abstract syntax trees with a variety of tools (e.g. rename refactoring), you sort of get what IP and language workbenches are about. Incidentally, concepts such as APIs, API versions, provided & required interfaces are quite easily implemented in a language workbench like environment.

Storage, versioning, access control, collaborative editing, etc.

Once you stop thinking in terms of files, you can start thinking about other useful features (beyond tree transformations), like versioning or collaborative editing for example. There have been some recent advances in software engineering that I see as key enablers here. Number 1 is that version management systems are becoming decentralized, replicated databases. You don’t check out from git, you clone the repository and push back any changes you make. What if your IDE were working straight into your (cloned) repository? Then deployment becomes just a controlled sequence of replicating your local changes somewhere else (either push based, pull based, or combinations of that. A problem with this is of course that version management systems are still about manipulating text files. So they sort of require you to serialize your rich syntax trees to text and you need tools to unserialize them in your IDE again. So, text files are just another artifact that needs to be discarded.

This brings me to another recent advance: couchdb. Couchdb is one of the non relational databases currently experiencing lots of (well deserved) attention. It doesn’t store tables, it stores structured documents. Trees in other words. Just what we need. It has some nice properties built in, one of which is replication. Its built from the ground up to replicate all over the globe. The grand vision behind couchdb is a cloud of all sorts of data where stuff just replicates to the place it is needed. To accomplish this, it builds on REST, map reduce, and a couple of other cool technology. The point is, couchdb already implements most of what we need. Building a git like revision control system for versioning arbitrary trees or collections of trees on top can’t be that challenging.

Imagine the following sequence of events. Developer A modifies his program. Developer B working on the same part of the software sees the changes (real time of course) and adds some more. Once both are happy they mark the associated task as done. Somewhere on the other side of the planet a test server locally replicates the changes related to the task and finds everything is OK. Eventually the change and other changes are tagged off as a new stable release. A user accesses the application on his phone and at the first opportunity (i.e. connected), the changes are replicated to his local database. End to end the word file or artifact appears nowhere. Also note that the bare minimum of data is transmitted: this is as efficient as it is ever going to get.

Conclusions

Anyway, just some reflections on where we are and where we need to go. Java did a lot of pioneering work in a lot of different domains but it is time to move on from the way our grand fathers operated computers (well, mine won’t touch a computer if he can avoid it but that’s a different story). Most people selling silver bullets in the form of maven, ruby, continuous integration, etc. are stuck in the current thinking. These are great tools but only in the context of what I see as a deeply flawed end to end system. A lot of additional cruft is under construction to support the latest cloud computing trends (which is essentially about managing a lot of files in a distributed environment). My point here is that taking a step back and rethinking things end to end might be worth the trouble. We’re so close to radically changing the way developers work here. Remove files and source code from the equation and what is left for maven to do? The only right answer here is nothing.

Why do I think this needs to happen: well, developers are currently wasting enormous amounts of time on what are essentially redundant things rather than developing software. The last few weeks were pretty bad for me, I was just handling deployment and build configuration stuff. Tedious, slow, and maven is part of this problem.

Update 26 October 2009

Just around the time I was writing this, some people decided to come up with Play, a framework + server inspired by Python Django that preserves a couple of cool features. The best feature: no application server restarts required, just hit F5. Works for Java source changes as well. Clearly, I’m not alone in viewing the Java server side world as old and bloated. Obviously it lacks a bit in functionality. But that’s easily fixed. I wonder how this combines with a decent dependency injection framework. My guess is not well, because dependency injection frameworks require a context (i.e.) state to be maintained and Play is designed to be stateless (like Django). Basically, each save potentially invalidates the context require a full reload of that as well (i.e. a server restart). Seems the play guys have identified the pain point in Java: server side state comes at a price.