git-svn flow

In recent years, git has rapidly been adopted by software developers in the software industry. Initially it was a toy used mainly in open source projects but it is now finding its way into the rest of the software development world. At this point in time, git is emerging as a tool of choice for doing version control for a large group of developers.

A reason for the popularity of git is that it enables a lot of new and different work flows that allow development teams to change the way they deliver software and deliver better software faster.

In this blog post I present some ideas for implementing a git work flow that allows for something else than the centralized work flow that is common for projects with a central subversion repository, like the project I am currently working on, while still keeping the central subversion repository.

Continue reading “git-svn flow”

Git and agile

I’ve been working with Subversion since 2004 (we used a pre 1.0 version at GX). I started hearing about git around the 2006-2007 time frame when Linus Torvalds’ replacement for Bitkeeper started maturing enough for other people to use it. I met people working on Maemo (the Debian based OS for the N770, N800, N810, and recently the N900) in Nokia who were really enthusiastic about it in 2008. They had to use it to work with all the upstream projects Maemo depends on and they loved it. When I moved to Berlin everybody there was using subversion so I just conformed and ignored git/mercurial and all those other cool versioning systems out there for an entire year. It turns out that was lost time, I should have switched around 2007/2008. I’m especially annoyed by this because I’ve been aware of decentralized versioning being superior to centralized versioning since 2006. If you don’t believe me, I had a workshop paper at SPLC 2006 on version management and variability management that pointed out the emerging of DVCSes in that context. I’ve wasted at least three years. Ages for the early adopter type guy I still consider myself to be.

Anyway, after weighing the pros and cons for way too long, I switched from subversion to git last week. What triggered me to do this was, oddly, an excellent tutorial on Mercurial by Joel Spolsky. Nothing against Mercurial, but Git has the momentum in my view and it definitely appears to be the band wagon to be jumping right now. I don’t see any big technical argument for using Mercurial instead of Git. There’s github and no mercurial hub as far as I know. So, I took Joel’s good advice on Mercurial as a hint that it was time to get off my ass and get more serious about switching to anything else than Subversion. I had already decided in favor of git based on stuff I’ve been reading on both versioning systems.

My colleagues of course haven’t switched (yet, mostly) but that is not an issue with git-svn, which allows me to interface with svn repositories. I’d like to say making the switch was an easy ride, except it wasn’t. The reason is not git but me. Git is a powerful tool that has quite a bit more features than Subversion. Martin Fowler has a nice diagram on “recommendability” and “required skill”. Git is in the top right corner (highly recommended but you’ll need to learn some new skills) and Subversion is lower right (recommended, not much skill needed). The good news is that you will need only a small subset of commands to cover the feature set provided by svn and you can gradually expand what you use from there. Even with this small subset git is worth the trouble IMHO, if only because world + dog are switching. The bad news is that you will just have to sit down and spend a few hours learning the basics. I spent a bit more than I planned to on this but in the end I got there.

I should have switched around 2007/2008

The mistake I made that caused me to delay the switch for years was not realizing that git adds loads of value even when your colleagues are not using it: you will be able to collaborate more effectively if you are the only one using git! There are two parts to my mistake.

The first part is that the whole point of git is branching. You don’t have a working copy, you have a branch. It’s exactly the same with git-svn: you don’t have a svn working copy but a branch forked of svn trunk. So what, you might think. Git excels at merging between branches. With svn branching and merging is painful, so instead of having branches and merging between them, you avoid conflicts by updating often and committing often. With git-svn, you don’t update from svn trunk, you merge its changes in your local branch. You are working on a branch by default and creating more than 1 is really not something to be scared of. It’s is painless, even if you have a large amount of uncommitted work (which would get you in trouble with svn). Even if that work includes renaming the top level directories in your project (I did this). Even if other people are doing big changes in svn trunk. That’s a really valuable feature to have around. It means I can work on big changes to the code without having to worry about upstream svn commits. The type of changes nobody dares to take on because it would be too disruptive to deal with branching and merging and because there are “more important things” to do and we don’t want to “destabilize” trunk. Well, not any more. I can work on changes locally on a git branch for weeks if needed and push it back to trunk when it is ready while at the same time me and my colleagues keep committing big changes on trunk. The reason I’m so annoyed right now is the time I spent on resolving svn conflicts in the past four years was essentially unnecessary. Not switching four years ago was a big mistake.

The second part of my mistake was assuming I needed IDE support for git to be able to deal with refactoring and particularly class renames (which I do all the time in Eclipse). While there is egit now, it is still pretty immature. It turns out that assuming I needed Eclipse support was a false assumption. If you rename a file in a git repository and commit the file, Git will automatically figure out that the file was renamed, you don’t need to tell git that the file was renamed. A simple “mv” will work. On directories too. This is a really cool feature. So I can develop in eclipse without it even being aware of any git specifics, refactor and rename as much as I like, and git will keep tracking the changes for me. Even better, certain types of refactorings that are quite tricky with subclipse and subversive just work in git. I’ve corrupted svn work directories on several occasions when trying to rename packages and moving stuff around. Git will handle this effortlessly. Merges work so well because git can handle the situation where a locally renamed file needs changes from upstream merged into it. It’s a core feature, not an argument against using it. My mistake. I probably spent even more time on corrupted svn directories than conflict resolution in the last three years.

Git is an Agile enabler

We have plenty of pending big changes and refactorings that we have been delaying because they are disruptive. Git allows me to work on these changes whenever I feel like it without having to finish them before somebody else starts introducing conflicting changes.

This is not just a technical advantage. It is a process advantage as well. Subversion forces you to serialize change so that you minimize the interactions between the changes. That’s another way of saying that subversion is all about waterfall. Git allows you to decouple change instead and parallelize the work more effectively. Think multiple teams working on the same code base on unrelated changes. Don’t believe me? The linux kernel community has thousands of developers from hundreds of companies working on the same code base touching large portions of the entire source tree. Git is why that works at all and why they push out stable releases every 6 weeks. Linux kernel development speed is measured in thousands of lines of code modified or added per day. Evaluating the incoming changes every day is a full time job for several people.

Subversion is causing us to delay necessary changes, i.e. changes that we would prefer to do if only it wouldn’t be so disruptive. Delayed changes pile up to become technical debt. Think of git as a tool to manage your technical debt. You can work on business value adding changes (and keep the managers happy) and disruptive changes at the same time without the two interfering. In other words you can be more agile. Agile has always been about technical enablers (refactoring tooling, unit testing frameworks, continuous integration infrastructure, version control, etc) as much as it was about process. Having the infrastructure to do rapid iterations and release frequently is critical to the ability to release every sprint. You can’t do one without the other. Of course, tools don’t fix process problems. But then, process tends to be about workarounds for lacking tools as well. Decentralized version management is another essential tool in this context. You can compensate not using it with process. IMHO life is to short to play bureaucrat.

Not an easy ride

But as I said, switching from svn to git wasn’t a smooth ride. Getting familiar with the various git commands and how they are different from what I am used to in svn has been taking some time despite the fact that I understand how it works and how I am supposed to use it. I’m a git newby and I’ve been making lots of beginners mistakes (mainly using the wrong git commands for the things I was trying to do). The good news is that I managed to get some pretty big changes committed back to the central svn repository without losing any work (which is the point of version management). The bad news is that I got stuck several times trying to figure out how to rebase properly, how to undo certain changes, how to recover a messed up checkout on top of my local work directory from the local git repository. In short, I learned a lot on this and I have still some more things to learn. On the other hand, I can track changes from svn trunk, have local topic branches, merge from those to the local git master, and dcommit back to trunk. That about covers all my basic needs.

Java & Toys

After a few months of doing python development, which to me still feels like a straight jacket. I had some Java coding to do last week and promptly wasted a few hours checking out the latest toys, being:

  • Eclipse 3.4 M7
  • Hudson
  • Findbugs for Hudson

Eclipse 3.4 M7 is the first milestone I’ve tried for the upcoming Eclipse release. This is due to me not coding Java much lately; nothing wrong with it otherwise. Normally, I’d probably have switched around M4 already (at least did so for 3.2 and 3.3 cycles). In fact it is a great improvement and several nice productivity enhancements are included. My favorite one is the problem hover that now includes links to quick fixes. So instead of point, click, typing ctrl+0, arrow down (1..*), enter, you can now lean back and point & hover + click. Brilliant. I promptly used it to convert some 1.4 non generics based code into nice generics based code simply by tackling all the generics related warnings one by one essentially only touching the keyboard to suggest a few type parameters Eclipse couldn’t figure out. Introduce generic type parameter + infer generics refactorings are very helpful here. The code of course compiled and executed as expected. No bugs introduced and the test suite still runs fine. Around 5000 lines of code refactored in under 20 minutes. I still have some work to do to remove some redundant casts and to replace some while/for loops with foreach.

Other nice features are the new breadcrumps bar (brilliant!) and a new refactoring to create parameter classes for overly long lists of parameters on methods. Also nice is a refactoring to concatenate String concatenation into StringBuffer.append calls. Although StringBuilder is slightly faster for cases where you don’t need thread safe code (i.e most of the time). The rest is the usual amount of major and minor refinements that I care less about but are nice to have around anyway. One I imagine I might end up using a lot is quickfixes to sort out osgi bundle dependencies. You might recall me complaining about this some time ago. Also be sure to read Peter Kriens reply to this btw. Bnd is indeed nice but tools don’t solve what is in my view a kludge. Both the new eclipse feature and BND are workarounds for the problem that what OSGI is trying to do does not exist at  (and is somewhat at odd with) the Java type level.

Anyway, the second thing I looked into was Hudson, a nice server for continuous integration. It can checkout anything from a wide range of version control systems (subversion supported out of the box, several others through plugins) and run any script you like. It also understands maven and how to launch ant. With the right plugins you can then let it do quite useful things like compiling, running static code analyzers, deploying to a staging server, running test suites, etc. Unlike some stuff I evaluated a few years this actually worked right out of the box and was so easy to set up that I promptly did so for the project I’m working on. Together with loads of plugins that add all sorts of cool functionality, you have just ran out of excuses to not do continuous integration.

One of the plugins I’ve installed so far is an old favorite Findbugs which promptly drew my attention to two potentially dangerous bugs and a minor performance bug in my code reminding me that running this and making sure it doesn’t complain is actually quite important. Of all code checkers, findbugs provides the best mix between finding loads of stuff while not being obnoxious about it without a lot of configuration (like e.g. checkstyle and pmd require to shut the fuck up about stupid stuff I don’t care about) and while actually finding stuff that needs fixing.

While of course Java centric, you can teach Hudson other tricks as well.  So, next on my agenda is creating a job for our python code and hooking that up to pylint and possibly our django unit tests. There’s plugins around for both tasks.

Managing wordpress deployment

This little article is a summary of how I currently manage my wordpress blog. The site lists some advice on how to manage a wordpress installation using subversion. However, I have a slightly more sophisticated setup that preserves my modifications (as long as they don’t conflict) that I maintain in a private branch of wordpress.

I use rsync to push and pull changes remotely (using ssh, ftp should work as well). Since a good howto seems to be lacking online and since I spent a while figuring out how to do all this, I decided to share my little setup.
Continue reading “Managing wordpress deployment”

Using rsync for backup

As you may recall, I had a nice incident recently which made me really appreciate the fact that I was able to restore my data from a backup. Over the years I’ve sort of gobbled together my own backup solution using rsync (I use the cygwin port to windows).

First a little about hardware. Forget about using CDs or DVDs. They are just too unreliable. I’m currently recovering data from a whole bunch of CDs I had and am horrified to discover that approximately one third has CRC errors on them. Basically, the light sensitive layer has deteriorated to the point that the disc becomes unreadable. Sometimes as soon as within 2 years. I’ve used various brands of CDs over the years and some of them have higher failure rates than others but no brand seems to be 100% OK. In other words, I’ve lost data on stored on pretty much every CD brand I’ve ever tried. Particularly Fujifilm (1-48x) and unbranded CDs are bad (well over 50% failure rate) on the other hand, most of my Imation CDs seem fine so far. Luckily I didn’t lose anything valuable/irreplacable. But it has made it clear to me to not trust this medium for backups.

So, I’ve started putting money in external harddrives. External drives have several advantages: they are cheap; they are big and they are much more convenient. So far I have two usb external harddrives. I have a 300GB Maxtor drive and the 500GB Lacie Porsche drive I bought a few weeks back. Also I have a 300 GB drive in my PC. Yes that’s 1.1 TB altogether :-).

The goal of my backup procedures is to be ‘reasonably’ safe. Technically if my apartment burns down, I’ll probably lose all three drives and all data on them. Moving them offsite is the obvious solution but this also makes backups a bit harder. Reasonably safe in my view means that my backed up data survives total media failure on one of the drives and gives me an opportunity to get to the reasonably safe state again. When I say my data, I’m referring to the data that really matters to me: i.e. anything I create, movies, music, photos, bookmarks, etc.

This data is stored in specific directories on my C drive and also a directory on my big Lacie drive. I use the Maxtor drive to backup that directory and use the remaining 200GB on the lacie drive for backing up stuff from my C drive.

All this is done using commands like this:

rsync -i -v -a --delete ~/photos/ /cygdrive/e/backup/photos >> /cygdrive/e/backup/photos-rsync.txt

This probably looks a bit alien to a windows user. I use cygwin, a port of much of the gnu/linux tool chain that layers a more linux like filesystem on top of the windows filesystem. So /cygdrive/c is just the equivalent of good old c:\. One of the ported tools is ln, which I’ve used to make symbolic links in my cygwin home directory to stuff I want to backup. So ~/photos actually points to the familiar My Pictures directory.

Basically the command tries to synchronize the first directory to the second directory. The flags ensure that content of the second directory is identical to that of the first directory after execution. The –delete flag allows it to remove stuff that isn’t in the first directory. Rsync is nice because it works incrementally. I.e. it doesn’t copy data that’s already there.

The bit after the >> just redirects the output of rsync to a text file so that afterwards, you can verify what has actually been backed up. I use the -v flag to let rsync tell me exactly what it is doing.

Of course typing this command is both error prone and tedious. For that reason I’ve collected all my backup related commands in a nice script which I execute frequently. I just turn on the drives; type ./ and go get some coffee. I also use rsync to backup my remote website which is easy because rsync also works over ftp and ssh.

Part of my backup script is also creating a dump from my subversion repository. I store a lot of stuff in a subversion repository these days: my google earth placemarks; photos; documents and also some source code. The subversion work directories are spread across my harddrive but the repository itself sits in a single directory on my cdrive. Technically I could just back that up using rsync. However, using

svnadmin dump c:/svnrepo | gzip > /cygdrive/e/backup/svnrepo.gz

to dump the repository allows me to actually recreate the repository in any version of subversion from the dump. Also the dump file tends to be nicely compressed compared to either the work directory or the repository directory. Actually, the work directory is the largest because it contains 3 copies of each file. In the repository, everything is stored incrementally and in the dump gzip squeezes it even further. The nice thing of a version repository is of course that you preserve also the version history.

Subversion 1.4 windows binaries

For a few weeks I’ve been waiting for cygwin to update their subversion binaries. But for some reason they are not in a hurry. Subversion 1.4 was recently released and this time it includes some changes to both the repository format and the work directory format. If you use 1.4 binaries on your workdirectory, it will automatically be upgraded. Nice, except cygwin still has 1.3 binaries and no longer recognizes the work directory after you’ve used tortoise svn 1.4 on it. Similarly, the repository is upgraded if you use 1.4 version on it. So, upgrading is not recommended unless you can upgrade all subversion related tools to 1.4.

Naturally I found this out after upgrading tortoise svn to 1.4 :-).

Luckily, you don’t need cygwin binaries. So here’s what you can do instead:

  • download the win32 commandline subversion tools from tigris and install it.
  • modify your path to add the bin directory
  • uninstall the obsolete cygwin subversion version

Of course the win32 version doesn’t handle cygwin paths too well. Luckily subversion handles moving of repositories pretty well. In my case my repositories were in /svnrepo which in reality is this path on windows c:\cygwin\svnrepo. Since I use the svn+ssh protocol, the urls for all my workdirectories were svn+ssh://localhost/svnrepo/… These urls of course broke due to the fact that the win32 binaries interpret the path /svnrepo differently than the cygwin version. Solution: mv /svnrepo /cygdrive/c.

This allows me to continue to use the same subversion urls and all my tools now work. Also, in the future I won’t have to wait for cygwin to upgrade their subversion binaries and can get them straight from tigris.


If you use TortoiseSvn (a popular subversion frontend that integrates into explorer), you might be interested in xdocdiff. This tool can be plugged in as a diff viewer for several binary file formats, including .doc .pdf .ppt and .xls. It works by converting both revisions of a file to txt and then using the regular diff viewer to show what has changed.
I’ve been using subversion for document management for some time now. It is really easy to set up (there’s several ways actually) and really useful when working on a rapidly evolving document. Now with the ability to examine changes between revisions, it is even more useful!

why cvs sucks

Having worked with subversion quite extensively, I know what to expect from a proper versioning system. However, I’m currently working in a project with a cvs server. Here’s a few of my observations:

  • I can’t seem to get tortoisecvs and eclipse to work together on the same checked out repository. Reason: tortoisecvs does not support eclipse’s custom extssh stuff. Duh, ssh is only the most common way to access a cvs repository these days.
  • I’m unable to put things like directories under version management, only the contents.
  • I don’t seem to have an easy way of getting the version history for a directory of files with the commit messages ordered by date.  I’m sure there are scripts to do this  but it is definately not supported natively (and likely to bog down the cvs server).
  • Each file has its own version number
  • Commit messages do not specify what files were committed (unless you paste this information in the message of course)!
  • I managed on several occasions to do a partial commit. That is some files were committed and some weren’t
  • Each time I refactor, I lose version history because cvs considers the result of a copy, move or rename to be a new file.
  • My attic is full of files that now have a different name
  • Branching and tagging create full serverside copies of whatever is being branched or tagged

Subversion doesn’t have any of these problems. CVS seems to actively discourage refactoring, is dangerously unreliable for large commits and does an extremely poor job of keeping version history. The fact that I encounter all of these problems on a toy project says a lot about CVS.
This is why CVS should not be used for serious projects:

  • Refactoring now is a key part of development. Especially (good) java developers refactor all the time.
  • The versioning system should never ever end up in an inconsistent state due to partial commits. There is no excuse for this. A version management system without built in protection against should be considered seriously broken.
  • Being able to keep track of version history is crucial to monitoring project progress. I’ve been a release manager. Only now I’m starting to realize how much my job would have sucked if we’d be CVS users. I feel sorry for all these open source people who still continue to use CVS on a voluntary basis.


I found this article rather insightful -Ofun

I agree with most of it. Many software projects (commercial, oss, big & small) have strict guidelines with respect to write access to soure repositories and usage of these rights. As the author observes many of these restrictions find their roots in the limited ability of legacy revision control systems to roll back undesirable changes and to merge sets of coherent changes. And not in any inherent process advantages (like enforcing reviews, preventing malicious commits). Consequently, this practice restricts programmers in their creativity.

Inviting creative developers to commit on a source repository is a very desirable thing. It should be made as easy as possible for them to do their thing.

On more than one occasion I have spent some time looking at source code from some OSS project (to figure out what was going wrong in my own code). Very often my hands start to itch to make some trivial changes (refactor a bit, optimize a bit, add some functionality I need). In all of these cases I ended up not doing these changes because committing the change would have required a lengthy process involving:
– get on the mailing list
– figure out who to discuss the change with
– discuss the change to get permission to send the change to this person
– wait for the person to accept/reject the change

This can be a lengthy process and upfront you already feel guilty of contacting the person about this trivial change with your limited knowledge of the system. In short, the size of the project and its members scare off any interested developers except the ones determined to get their change in.

What I’d like to do is this:
– Checkout tomcat (I work with tomcat a lot, fill in your favorite OSS project)
– Make some change I think is worthwhile having without worrying about consequences, opinions of others, etc.
– Commit it with a clear message why I changed it.
– Leave it to the people who run the project to laugh away my ignorance or accept the change as they see fit.

The apache people don’t want the change, fine. Undo it, don’t merge, whatever. But don’t restrict peoples right to suggest changes/improvements in any kind of way. If you end up rejecting 50% of the commits that means you still got 50% useful stuff. The reviewing, merging workload can be distributed among people.

In my current job (for GX, the company that I am about to leave), I am the release manager. I am the guy in charge for the source repositories of the entire GX product line. I’d like to work like outlined above but we don’t. Non product developers in the company need to contact me by mail if they want to get their changes in. Some of them do, most of them don’t. I’m convinced that I’d get a lot of useful changes. We use subversion which is nice but not very suitable for the way of working outlined above and in the article I quoted. Apache also uses subversion so I can understand why they don’t want to give people like me commit rights just like that.

So why is this post labelled as software engineering science? Well I happen to believe that practice is ahead in some things over the academic community (of which I am also a part). Practicioners have a fine nose for tools and techniques that work really well. Academic software engineering researchers don’t for a variety of reasons:
– they don’t engineer that much software
– very few of them develop at all (I do, I’m an exception)
– they are not very familiar with the tools developers use

In the past two years in practice I have learned a number of things:
– version control is key to managing large software projects. Everything in a project revolves around putting stuff in and getting stuff out of the repository. If you didn’t commit it, it doesn’t exist. Committing it puts it on the radar of people who need to know about it.
– Using branches and tags is a sign the development process is getting more mature. It means you are separating development from maintenance activities.
– Doing branches and tags on the planned time and date is an even better sign: things are going according to some plan (i.e. this almost looks like engineering).
– Software design is something non software engineers (including managers and software engineering researchers) talk about, a lot. Software engineers are usually to busy to bother.
– Consequently, few software actually gets designed in the traditional sense of the word (create important looking sheets of paper with lots of models on them).
– Instead two or three developers get together for an afternoon and lock themselves up with a whiteboard and a requirements document to take the handful of important decisions that need to be taken.
– Sometimes these decisions get documented. This is called the architecture document
– Sometimes a customer/manager (same thing really) asks for pretty pictures. Only in those cases a design document is created.
– Very few new software gets build from scratch.
– The version repository is the annotated history of the software you are trying to evolve. If important information about design decisions is not part of the annotated history, it is lost forever.
– Very few software engineers bother annotating their commits properly.
– Despite the benefits, version control systems are very primitive systems. I expect much of the progress in development practice in the next few years to come from major improvements in version control systems and the way they integrate into other tools such as bug tracking systems and document management systems.

Some additional observations on OSS projects:
– Open source projects have three important tools: the mailinglist, the bug tracking system and the version control system (and to a lesser extent wikis). These tools are comparatively primitive to what is used in the commercial software industry.
– Few oss projects have explicit requirements and design phases.
– In fact all of the processes used in OSS projcets are about the use of the before mentioned tools.
– Indeed few oss projects have designs
– Instead oss projects evolve and build a reputation after an initial commit of a small group of people of some prototype.
– Most of the life cycle of an oss project consist of evolving it more or less ad hoc. Even if there is a roadmap, that usually only serves as a common frame of reference for developers rather than as a specification of things to implement.

I’m impressed by how well some OSS projects (mozilla, kde, linux) are run and think that the key to improving commercial projects is to adopt some of the better practices in these projects.

Many commercial software actually evolves in a very similar fashion despite manager types keeping up appearances by stimulating the creation of lengthy design and requirements documents, usually after the development has finished.