Re: bear shaving

I was going to submit the stuff below in a shortened form as a comment to this fun little blog post on “bear shaving” but it sort of grew into a full blown article, again. To summarize the original article, there’s this nice analogy of shaving bears to help them cope with global warming and how that is not really addressing the core issues (not to mention dangerous). The analogy is applied to integration builds and people patching things up. Then the author sort of goes off and comes up with a few arguments against git and decentralization.

While some of the criticism is valid, this of course ticked me off 🙂

I see Git as a solution to increase the amount of change and dealing more effectively with people working in parallel. Yes, this puts a strain on integrating the resulting changes. But less change is the equivalent of bear shaving here. Change is good. Change is productivity. You want more productivity not less. You want to move forward as fast as you possibly can. Integration builds breaking are a symptom of a larger problem. Bear shaving would be doing everything you can to make the integration builds work again, including forcing people to sit on their hands. The typical reflex to a crisis like this in the software industry is less change, complete with the process to ensure that people do less. This is how waterfall was born. Iterative or spiral development is about the same thing but doing it more frequently and less longer. This was generally seen as an improvement. But you are still going to sit on your hands for pro longed periods of time. The real deal these days is continuous deployment and you can’t do this if you are sitting on your hands.

Breaking integration builds have a cause: the people making the changes are piling mistake on mistake and keep bear shaving (I love the metaphor) the problem because they are under a pressure to release and deliver functionality. All a faster pace of development does is make this more obvious. Along with the increased amount of change per time-unit comes also an increased amount of mistakes per time unit. Every quick fix and every misguided commit makes the system as a whole a little less stable. That’s why the waterfall model includes a feature freeze (aka. integration) where no changes are allowed because the system would never get finished otherwise.

A long time ago I wrote an article about design erosion. It was one of the corner stones of my phd thesis (check my publication page if you are interested). In a nutshell: changes are cumulative and we take design decisions in the context of our expectations of the future. Only problem: nobody can predict the future accurately and as a consequence, there will be mistakes from time to time. It is inevitable that you will get it wrong sometimes and won’t realize right away. You can’t just rip out a single change you made months/years ago without the depending subsequent changes being affected. In other words, change is cumulative: rip one piece out and the whole sand castle collapses. Some of the decisions will be wrong or will have to be reconsidered at some point and because changes are inter dependent, fixing design erosion can be painful and expensive. Consequently, it is inevitable that all software designs erode over time as inevitably such change is delayed until the last possible moment. Design erosion is a serious problem. You can’t just fix a badly eroded system that you had for years over-night. Failing to address design erosion in time can actually kill your company, product or project. But you can delay the inevitable by dealing with the problems closer to where they originate instead of dealing with it later. Dealing with the problem close to where it originates means less subsequent changes are affected, meaning that you minimize the cost of fixing the problem. Breaking integration builds are a symptom of an eroding design. Delaying the fix makes it worse.

So, the solution is to refactor and rethink the broken parts of the system to be more robust, easier to test, more flexible to meet the requirements, etc. Easier said then done of course. However, Git is a revolutionary enabler here: you can do the more disruptive stuff on a git branch and merge it back in when it is ready instead of when you go home and break the nightly build. This way you can do big changes without destabilizing your source tree. Of course you want continuous integration on your branches too. That way, you will push less mistakes between branches, thus solving problems closer to their origin and without affecting each other. You will still have breaking builds, but they will be cheaper to fix. Decentralization is the solution here and not the problem as is suggested in the blog post I linked above!

Here’s why decentralization works: testing effort grows exponentially to the amount of change. Double the amount of change, and you quadruple the testing effort. So don’t do that and keep the testing effort low. In a centralized world you do this through feature freeze. By stopping all change, you can actually find all the problems you introduced. In a decentralized world you do this by not pushing your changes until the changes you pull are no longer breaking your local branch. Then you push your working code. Why is this better? 1) you integrate incoming changes with your changes instead of the other way around. 2) you do this continuously (every time you pull changes), so you fix problems when they happen. 3) your changes only get pushed when they are stable which means that other people have less work with #1 and #2 on their side. 4) By keeping changes isolated from each other, you make it easier to test them. Once tested, the changes are a lot easier to integrate.

Continuous integration can help here but not if you only do it on the production branch: you need to do it all over the place. Serializing all the change through 1 integration environment turns it into a bottleneck: your version system may be decentralized but if your integration process is not you are still going to be in trouble. A centralized build system works ok with a centralized version system because centralized version system serializes the changes anyway (which is a problem and not something to keep bear shaving). The whole point of decentralizing version management is decentralizing change. You need to decentralize the integration process as well.

In a nutshell, this is how the linux kernel handles thousands of kloc of changes per day with hundreds of developers. And, yes, it is no coincidence that those guys came up with git. The linux kernel deals with design erosion by a continuous re development. The change is not additive, people are literally making changes all over the linux source tree, all the time. There is no way in hell they could deal with this in a centralized version management type environment. As far as I know, the linux kernel has no automated continuous integration. But they do have thousands of developers running all sorts of developer builds and reporting bugs against them, which is really the next best thing. Nothing gets in the mainline kernel without this taking place.

Managing wordpress deployment

This little article is a summary of how I currently manage my wordpress blog. The wordpress.org site lists some advice on how to manage a wordpress installation using subversion. However, I have a slightly more sophisticated setup that preserves my modifications (as long as they don’t conflict) that I maintain in a private branch of wordpress.

I use rsync to push and pull changes remotely (using ssh, ftp should work as well). Since a good howto seems to be lacking online and since I spent a while figuring out how to do all this, I decided to share my little setup.
Continue reading “Managing wordpress deployment”

Kdiff3 to the rescue

I was struggling this evening with the default merge tool that ships with tortoise svn. It’s not bad and quite user friendly. However, I ran into trouble with it when trying to review changes in a latex file (don’t ask, I still hate the concept of debugging and compiling stuff I would normally type in word). The problem was that it doesn’t support word wrapping and that the latex file in question used one line per paragraph (works great in combination with an editor that does soft word wrapping like e.g. Jedit).

A little googling revealed that the problem had been discussed on the tortoise svn mailing list and dismissed by one of the developers (for reasons of complexity). Figuring that surely somebody must have scratched this itch I looked on and struck gold in the form of this blogpost:KDiff3 – a new favorite about KDiff3.

The name suggests that this is a linux tool. Luckily it seems there is a windows port as well so no problem here. I installed it and noticed that by default it replaces the diff editor in tortoisesvn (good in this case but I would have liked the opportunity to say no here). Anyway, problem solved :-). A new favorite indeed.

Update:

Nice little kdiff3 moment. I did an update from svn and it reported a python file was in conflicted state. So I dutifully right clicked and selected edit conflicts. This launched kdiff which reported: 4 conflicts found; 4 conflicts automatically resolved. It than opens into a four pane view (mine, base, theirs and merged) allowing you to easily see what the merged result looks like and what the conflicts were. OMFG! where were you all this time kdiff3!! Damn that is useful. The resolutions look good too. I remember using tortoise svn doing merges on very large source base in my previous job and this is exactly what made them suck so much.

Using rsync for backup

As you may recall, I had a nice incident recently which made me really appreciate the fact that I was able to restore my data from a backup. Over the years I’ve sort of gobbled together my own backup solution using rsync (I use the cygwin port to windows).

First a little about hardware. Forget about using CDs or DVDs. They are just too unreliable. I’m currently recovering data from a whole bunch of CDs I had and am horrified to discover that approximately one third has CRC errors on them. Basically, the light sensitive layer has deteriorated to the point that the disc becomes unreadable. Sometimes as soon as within 2 years. I’ve used various brands of CDs over the years and some of them have higher failure rates than others but no brand seems to be 100% OK. In other words, I’ve lost data on stored on pretty much every CD brand I’ve ever tried. Particularly Fujifilm (1-48x) and unbranded CDs are bad (well over 50% failure rate) on the other hand, most of my Imation CDs seem fine so far. Luckily I didn’t lose anything valuable/irreplacable. But it has made it clear to me to not trust this medium for backups.

So, I’ve started putting money in external harddrives. External drives have several advantages: they are cheap; they are big and they are much more convenient. So far I have two usb external harddrives. I have a 300GB Maxtor drive and the 500GB Lacie Porsche drive I bought a few weeks back. Also I have a 300 GB drive in my PC. Yes that’s 1.1 TB altogether :-).

The goal of my backup procedures is to be ‘reasonably’ safe. Technically if my apartment burns down, I’ll probably lose all three drives and all data on them. Moving them offsite is the obvious solution but this also makes backups a bit harder. Reasonably safe in my view means that my backed up data survives total media failure on one of the drives and gives me an opportunity to get to the reasonably safe state again. When I say my data, I’m referring to the data that really matters to me: i.e. anything I create, movies, music, photos, bookmarks, etc.

This data is stored in specific directories on my C drive and also a directory on my big Lacie drive. I use the Maxtor drive to backup that directory and use the remaining 200GB on the lacie drive for backing up stuff from my C drive.

All this is done using commands like this:

rsync -i -v -a --delete ~/photos/ /cygdrive/e/backup/photos >> /cygdrive/e/backup/photos-rsync.txt

This probably looks a bit alien to a windows user. I use cygwin, a port of much of the gnu/linux tool chain that layers a more linux like filesystem on top of the windows filesystem. So /cygdrive/c is just the equivalent of good old c:\. One of the ported tools is ln, which I’ve used to make symbolic links in my cygwin home directory to stuff I want to backup. So ~/photos actually points to the familiar My Pictures directory.

Basically the command tries to synchronize the first directory to the second directory. The flags ensure that content of the second directory is identical to that of the first directory after execution. The –delete flag allows it to remove stuff that isn’t in the first directory. Rsync is nice because it works incrementally. I.e. it doesn’t copy data that’s already there.

The bit after the >> just redirects the output of rsync to a text file so that afterwards, you can verify what has actually been backed up. I use the -v flag to let rsync tell me exactly what it is doing.

Of course typing this command is both error prone and tedious. For that reason I’ve collected all my backup related commands in a nice script which I execute frequently. I just turn on the drives; type ./backup.sh and go get some coffee. I also use rsync to backup my remote website which is easy because rsync also works over ftp and ssh.

Part of my backup script is also creating a dump from my subversion repository. I store a lot of stuff in a subversion repository these days: my google earth placemarks; photos; documents and also some source code. The subversion work directories are spread across my harddrive but the repository itself sits in a single directory on my cdrive. Technically I could just back that up using rsync. However, using

svnadmin dump c:/svnrepo | gzip > /cygdrive/e/backup/svnrepo.gz

to dump the repository allows me to actually recreate the repository in any version of subversion from the dump. Also the dump file tends to be nicely compressed compared to either the work directory or the repository directory. Actually, the work directory is the largest because it contains 3 copies of each file. In the repository, everything is stored incrementally and in the dump gzip squeezes it even further. The nice thing of a version repository is of course that you preserve also the version history.

Subversion 1.4 windows binaries

For a few weeks I’ve been waiting for cygwin to update their subversion binaries. But for some reason they are not in a hurry. Subversion 1.4 was recently released and this time it includes some changes to both the repository format and the work directory format. If you use 1.4 binaries on your workdirectory, it will automatically be upgraded. Nice, except cygwin still has 1.3 binaries and no longer recognizes the work directory after you’ve used tortoise svn 1.4 on it. Similarly, the repository is upgraded if you use 1.4 version on it. So, upgrading is not recommended unless you can upgrade all subversion related tools to 1.4.

Naturally I found this out after upgrading tortoise svn to 1.4 :-).

Luckily, you don’t need cygwin binaries. So here’s what you can do instead:

  • download the win32 commandline subversion tools from tigris and install it.
  • modify your path to add the bin directory
  • uninstall the obsolete cygwin subversion version

Of course the win32 version doesn’t handle cygwin paths too well. Luckily subversion handles moving of repositories pretty well. In my case my repositories were in /svnrepo which in reality is this path on windows c:\cygwin\svnrepo. Since I use the svn+ssh protocol, the urls for all my workdirectories were svn+ssh://localhost/svnrepo/… These urls of course broke due to the fact that the win32 binaries interpret the path /svnrepo differently than the cygwin version. Solution: mv /svnrepo /cygdrive/c.

This allows me to continue to use the same subversion urls and all my tools now work. Also, in the future I won’t have to wait for cygwin to upgrade their subversion binaries and can get them straight from tigris.

SPLC workshop papers

Next week, I am attending the Software Product Line Conference 2006 in Baltimore. I’m presenting a paper in on variability mechanisms in service grids (posted about this a few months ago) and two workshop papers.

I submitted a paper to the Managing Variability for Software Product Lines: Working With Variability Mechanisms. The paper, which I wrote with Christian Prehofer, discusses some ideas on using version management systems during product derivation:

Version management tools as a basis for integrating Product Derivation and Software Product Families.

This paper considers tool support for variability management, with a particular focus on product derivation, as common in industrial software development. We show that tools for software product lines and product derivation have quite different approaches and data models. We argue that combining both can be very valuable for integration and consistency of data. In our approach, we illustrate how product derivation and variability management can be based on existing version management tools.

An edited version of this paper will be published after the workshop comments have been incorporated.

The second paper is a position paper about applying open source development practices in the context of software product line development:

OSS Product Family Engineering.

Open source projects have a characteristic set of development practices that is, in many cases, very different from the way many Software Product Families are developed. Yet the problems these practices are tailored for are very similar. This paper examines what these practices are and how they might be integrated into Software Product Family development.

This paper was accepted for the First International Workshop on Open Source Software and Product Lines.

I will update my publications page after the conference after it is clear how and where the papers will be published.

xdocdiff

If you use TortoiseSvn (a popular subversion frontend that integrates into explorer), you might be interested in xdocdiff. This tool can be plugged in as a diff viewer for several binary file formats, including .doc .pdf .ppt and .xls. It works by converting both revisions of a file to txt and then using the regular diff viewer to show what has changed.
I’ve been using subversion for document management for some time now. It is really easy to set up (there’s several ways actually) and really useful when working on a rapidly evolving document. Now with the ability to examine changes between revisions, it is even more useful!

why cvs sucks

Having worked with subversion quite extensively, I know what to expect from a proper versioning system. However, I’m currently working in a project with a cvs server. Here’s a few of my observations:

  • I can’t seem to get tortoisecvs and eclipse to work together on the same checked out repository. Reason: tortoisecvs does not support eclipse’s custom extssh stuff. Duh, ssh is only the most common way to access a cvs repository these days.
  • I’m unable to put things like directories under version management, only the contents.
  • I don’t seem to have an easy way of getting the version history for a directory of files with the commit messages ordered by date.  I’m sure there are scripts to do this  but it is definately not supported natively (and likely to bog down the cvs server).
  • Each file has its own version number
  • Commit messages do not specify what files were committed (unless you paste this information in the message of course)!
  • I managed on several occasions to do a partial commit. That is some files were committed and some weren’t
  • Each time I refactor, I lose version history because cvs considers the result of a copy, move or rename to be a new file.
  • My attic is full of files that now have a different name
  • Branching and tagging create full serverside copies of whatever is being branched or tagged

Subversion doesn’t have any of these problems. CVS seems to actively discourage refactoring, is dangerously unreliable for large commits and does an extremely poor job of keeping version history. The fact that I encounter all of these problems on a toy project says a lot about CVS.
This is why CVS should not be used for serious projects:

  • Refactoring now is a key part of development. Especially (good) java developers refactor all the time.
  • The versioning system should never ever end up in an inconsistent state due to partial commits. There is no excuse for this. A version management system without built in protection against should be considered seriously broken.
  • Being able to keep track of version history is crucial to monitoring project progress. I’ve been a release manager. Only now I’m starting to realize how much my job would have sucked if we’d be CVS users. I feel sorry for all these open source people who still continue to use CVS on a voluntary basis.

-Ofun

I found this article rather insightful -Ofun

I agree with most of it. Many software projects (commercial, oss, big & small) have strict guidelines with respect to write access to soure repositories and usage of these rights. As the author observes many of these restrictions find their roots in the limited ability of legacy revision control systems to roll back undesirable changes and to merge sets of coherent changes. And not in any inherent process advantages (like enforcing reviews, preventing malicious commits). Consequently, this practice restricts programmers in their creativity.

Inviting creative developers to commit on a source repository is a very desirable thing. It should be made as easy as possible for them to do their thing.

On more than one occasion I have spent some time looking at source code from some OSS project (to figure out what was going wrong in my own code). Very often my hands start to itch to make some trivial changes (refactor a bit, optimize a bit, add some functionality I need). In all of these cases I ended up not doing these changes because committing the change would have required a lengthy process involving:
– get on the mailing list
– figure out who to discuss the change with
– discuss the change to get permission to send the change to this person
– wait for the person to accept/reject the change

This can be a lengthy process and upfront you already feel guilty of contacting the person about this trivial change with your limited knowledge of the system. In short, the size of the project and its members scare off any interested developers except the ones determined to get their change in.

What I’d like to do is this:
– Checkout tomcat (I work with tomcat a lot, fill in your favorite OSS project)
– Make some change I think is worthwhile having without worrying about consequences, opinions of others, etc.
– Commit it with a clear message why I changed it.
– Leave it to the people who run the project to laugh away my ignorance or accept the change as they see fit.

The apache people don’t want the change, fine. Undo it, don’t merge, whatever. But don’t restrict peoples right to suggest changes/improvements in any kind of way. If you end up rejecting 50% of the commits that means you still got 50% useful stuff. The reviewing, merging workload can be distributed among people.

In my current job (for GX, the company that I am about to leave), I am the release manager. I am the guy in charge for the source repositories of the entire GX product line. I’d like to work like outlined above but we don’t. Non product developers in the company need to contact me by mail if they want to get their changes in. Some of them do, most of them don’t. I’m convinced that I’d get a lot of useful changes. We use subversion which is nice but not very suitable for the way of working outlined above and in the article I quoted. Apache also uses subversion so I can understand why they don’t want to give people like me commit rights just like that.

So why is this post labelled as software engineering science? Well I happen to believe that practice is ahead in some things over the academic community (of which I am also a part). Practicioners have a fine nose for tools and techniques that work really well. Academic software engineering researchers don’t for a variety of reasons:
– they don’t engineer that much software
– very few of them develop at all (I do, I’m an exception)
– they are not very familiar with the tools developers use

In the past two years in practice I have learned a number of things:
– version control is key to managing large software projects. Everything in a project revolves around putting stuff in and getting stuff out of the repository. If you didn’t commit it, it doesn’t exist. Committing it puts it on the radar of people who need to know about it.
– Using branches and tags is a sign the development process is getting more mature. It means you are separating development from maintenance activities.
– Doing branches and tags on the planned time and date is an even better sign: things are going according to some plan (i.e. this almost looks like engineering).
– Software design is something non software engineers (including managers and software engineering researchers) talk about, a lot. Software engineers are usually to busy to bother.
– Consequently, few software actually gets designed in the traditional sense of the word (create important looking sheets of paper with lots of models on them).
– Instead two or three developers get together for an afternoon and lock themselves up with a whiteboard and a requirements document to take the handful of important decisions that need to be taken.
– Sometimes these decisions get documented. This is called the architecture document
– Sometimes a customer/manager (same thing really) asks for pretty pictures. Only in those cases a design document is created.
– Very few new software gets build from scratch.
– The version repository is the annotated history of the software you are trying to evolve. If important information about design decisions is not part of the annotated history, it is lost forever.
– Very few software engineers bother annotating their commits properly.
– Despite the benefits, version control systems are very primitive systems. I expect much of the progress in development practice in the next few years to come from major improvements in version control systems and the way they integrate into other tools such as bug tracking systems and document management systems.

Some additional observations on OSS projects:
– Open source projects have three important tools: the mailinglist, the bug tracking system and the version control system (and to a lesser extent wikis). These tools are comparatively primitive to what is used in the commercial software industry.
– Few oss projects have explicit requirements and design phases.
– In fact all of the processes used in OSS projcets are about the use of the before mentioned tools.
– Indeed few oss projects have designs
– Instead oss projects evolve and build a reputation after an initial commit of a small group of people of some prototype.
– Most of the life cycle of an oss project consist of evolving it more or less ad hoc. Even if there is a roadmap, that usually only serves as a common frame of reference for developers rather than as a specification of things to implement.

I’m impressed by how well some OSS projects (mozilla, kde, linux) are run and think that the key to improving commercial projects is to adopt some of the better practices in these projects.

Many commercial software actually evolves in a very similar fashion despite manager types keeping up appearances by stimulating the creation of lengthy design and requirements documents, usually after the development has finished.

more on subclipse

I had some more fun with subclipse today. The integration with eclipse is much better than I anticipated. Eclipse already has extensive cvs functionality. Subclipse acts as a backend for this functionality and that means you get a lot of nice features. I was also pleasantly surprised with how well subclipse performs. Overall I am not so happy with eclipse in this respect (rebuilds suck and they happen a lot). But at subclipse seems to do well (compared to tortoise svn and commandline svn). Of course the bottleneck is network and disk io and doing this in Java doesn’t seem to have much performance impact. Things like getting svn logs for directories actually seems to work faster. Also I absolutely love the diff tools in eclipse.