Some reflections on a future of software engineering

2010-09-09

Update. I accidentally published an early draft of this  long (for a blog post) article by clicking the wrong button in wordpress. If you read this early draft, the article is now in its finished form and I’ve added a couple of links, edited the earlier text, and added loads of links.

I have over the past year been a bit less active blogging and posting in forums. I make an exception on things related to build processes, distributed version management and Java and you may find a couple of fine rants on my website and across various blogs on this topic.

This post is intended to pull all of it together in a bit more grand vision and reflect a bit on what a post revolution software engineering world might look like. I don’t like or believe much in revolutions but I do recognize that getting from where we are to where I think we need to be amounts to one. The more likely outcome is an evolutionary development where the bits and pieces where I was not completely and utterly wrong might actually happen. At some point. Maybe.

The Problem: Files

Software engineering practice is stuck in abstractions and concepts dating back decades. Because it is stuck, it is not really moving forward as fast as it could be towards being more capable of dealing with scale in numbers of people, amount of software, and amount of evolutionary change per time unit. This annoys me and has done so for years. It is the main motivation for writing this article.

The core abstraction I strongly believe is very wrong is the file. A file is a binary or text blob that lives on a file system that has some name. It has some content of some type that is generally associated with a suffix in the name (e.g. *.java).

Software engineers mistakenly believe their output consists of software artifacts that are stored as files. It doesn’t. Their output consists of software that happens to be stored in files. If you were born in the fifties, you’d be serializing your software from your head onto paper and manually flip switches to de-serialize. This was tedious but it worked. People like Ada Lovelace were writing software long before there were machines capable of even executing the software (other than the human brain, which is not very good at running our own software). The sixties made IO quite a bit easier with consoles, punch cards and tapes. People would transfer programs to and from RAM buffers by streaming them from tape or by feeding piles of cards to a computer. Somewhere in this era the notion of transforming these software data blobs using other software blobs also became popular: I’m talking about editors, compilers and interpreters. Finally somewhere in the seventies floppies and hard disks started to be used and everybody started putting their data blobs in files. Sadly this is where evolution stops.

At least when it comes to storing computer programs. For other types of data we have gotten a lot smarter. We have graph databases, object databases, tuple stores, relational databases, document stores, etc. None of these are commonly used to store computer programs.

This is weird and in my view it is holding software engineering practice back. A big reason is that we have built a huge sand castle of technology around the notion of files as the primary means of storing computer programs. For example we have version control systems that store different versions of files. We have built tools that consume and produce files. We have source code analyzers that expect to be working on files. We have IDEs that offer us convenient ways to edit files. Once we stop putting software in files, that technology needs to start changing as well.

Smalltalk and Eclipse

Actually it is not strictly true that evolution stopped with the file. Smalltalk and its descendants have a very interesting approach where the entire software system is stored in one big image file. Effectively smalltalk is using a database concept. This allowed for some interesting innovations, like meta programming and software transformations (aka refactoring) that require some readily available in memory representation of the software that is generally very rich and graph like (i.e. an abstract syntax tree) and can be introspected. Smalltalk developers realized that it didn’t make much sense to fragment bits and pieces of the software over multiple files. You might as well just use a single file.

This inspired IBM to create a nice IDE for Java at some point: Visual Age. Visual Age was very much a reborn small talk style IDE and crucially stored its Java code not in files but in the form of a database that effectively was a serialization of the abstract syntax tree of the entire software system. It needed that tree in order to be able to power important features like refactoring, code browsers, etc. Later it became Eclipse and they dropped the notion of storing the serialized AST opting instead for a combination of on the fly construction of an in memory AST, on the fly compilation and analysis, and reading to and from files the good old fashioned way.

Failed experiment: intentional programming?

Some late nineties papers by Charles Simonyi on Intentional Programming and persistent rumors about him actually being very close to launching related products was about taking the whole smalltalk/visual age thing to the next level. It’s too early to call this a failed experiment because Simonyi never really delivered the goods. His company (Intentional Software) is still hyping intentional programming but has yet to ship a product. Seriously, this has been in the making longer than Duke Nukem Forever. In a nutshell Simonyi’s very brilliant idea is that creating software is about coming up with abstractions that are represented in the form of abstract syntax trees that can be translated into other, more general abstractions in multiple iterations until you end up with a syntax tree that can simply be serialized to executable code. His core idea was to treat the transformations and not the abstractions as the first class entities. In a intentional programming world you start with really simple abstractions that you can translate into executable code and you build increasingly more complex and specialized abstractions that can be used for specialist or domain specific things. The traditional notion of compiling is very similar except it is a bit limited in the number of transformations and the abstractness of the abstractions involved. Basically most languages go to roughly 2 or 3 transformations: source code to abstract syntax tree to assembly to executable bits and bytes. There are lots of variations here but it is essentially a pipe line. Intentional software’s product works differently. The user manipulates the AST of whatever abstractions he’s working with in some rich editing environment that presents appropriate views on the AST and tools to edit the AST. Applicable transformations and optimizations kick in on the fly to keep the executable code up to date. It probably does a bit of serialization here and there but probably not in a format that is very notepad friendly.

Another failed experiment: Model Driven Architecture (MDA)

Readers of the previous may be tempted to sweep this in the same bucket as Model Driven Architecture (MDA). MDA was something that emerged out of the lovely world of UML. In MDA one defines systems using a standardized meta language, generally inside a dedicated tool (e.g. Rational Rose). The idea is on the surface very similar to intentional programming except it is more artifact centric. The weak spot of MDA has always been that the transformations from models to software is kind of a hacky process that is generally locked up in some proprietary tool. Basically there is the meta language and the UML language (of course defined in the meta language) and UML models. Where the whole thing becomes messy is the transformation to actual software. This bit is more or less completely proprietary and your mileage may vary though apparently some tools are quite good at this. Early MDA environments were monstrous tools with loads of J2EE, application server madness and a rich sauce of UML and XML spread on top. It’s sort of a lot of not so great things stuck together and people have wasted enormous amounts of cash on making sense of the whole thing, which continues to feed hordes of very expensive consultants. MDA never really dragged itself outside the domain of finance and banking applications.

There is a lot of hype currently around language benchmarks and similar technologies that is partly fueled by Martin Fowler who is about to publish a book on this and other topics related to Domain Specific Languages. He also has several articles worth reading on his website on this topic.

The point about ASTs

The whole point about bringing up MDA and ASTs is that software is made out of abstractions. The serialization of abstractions to and from source code in e.g. C or Java is lossy in the sense that some of these abstractions are no longer explicit when the serialization is complete. For example you might have a nice program that implements the visitor pattern, which is an abstraction many OO programmers would understand. This might be obvious to the reader of the source code and it might have been documented in some comment but it is no longer explicit whereas when the developer had the design in his head, or on a white board, or possibly even in a UML diagram, it was explicit. Intentional programming and MDA are about giving developers the option to keep those abstractions around and to define their systems in terms of richer abstractions than are found in typical programming languages.

The point about files

A useful representation of the abstractions is the abstract syntax tree, which is an internal data structure used by compilers, IDEs, source code analyzers, etc. There are of course many interesting ways to serialize tree like data structures like ASTs. For example, XML is pretty popular (though not for its readability or editability). And of course source code is nothing more than the serialization of an AST that conforms to a particular grammar. The point about files however is that they are used as containers of abstractions. Generally you need multiple such containers to get a working system. They might be of different types even (your average J2EE architecture comes with dozens of them). Finally, the notion of a file as the boundary on a container of abstractions is rather arbitrary. In some languages you put a single class in a file. In others files are called modules and can contain multiple classes, functions, procedures, etc. Additionally files live in directories, which are containers of files. In many languages directories have semantics as well. For example Java uses nested directories that must match the package structure (which is also nested).

I’d argue that files and directories are poorly suited for modularizing your system. Also as a boundary of abstraction between modules, they are both too arbitrary and too coarse grained. For example when you consider versioning, you are stuck with files and directories even though many version systems don’t even have a first class representation for directories (git) or have trouble keeping track of name changes to files (cvs).

Development is a collaborative process

Development is essentially a collaborative process. Except for very small systems, most software needs to be developed together with other people that may even be spread all over the world in some cases. It is no surprise that there is a huge amount of technology available to facilitate collaborative development. I already touched on version control systems. In addition to that there are bug management systems; collaborative editors; source code review systems; tools to examine differences between files. If development is a collaborative process, why can’t we version, review, triage bugs, exchange, communicate, etc. via the internet. Why do I have to bother serializing and deserializing trees (aka. saving and opening) ASTs in order to do stuff?

To the point …

This brings me to the point, finally. Files don’t make sense at all in many ways, as I’ve argued so far. So, lets not use files then. Lets look at a possible alternative instead:

(Disclaimer: very likely some people have implemented bits and pieces of what I’m about to outline. I’m aware of this but still want to provide a bigger picture here.)

  • Cloud storage. Software is a data structure, not a file. Data structures should of course live somewhere. These days the appropriate place for data to live would be the cloud. By cloud I mean that programmers should be accessing their ASTs via tools that are connected to the network and that automatically sync back and forth. The tool could be browser based or it could be some native UI. It might even resemble a code editor.
  • Versioning and other content management features. Versioning is not a feature that is limited to version control systems. For example, many content management systems do a decent job of versioning the objects they manage, which in the case of systems like Drupal are actually trees of objects. Versioning is just one of the many useful features you can find in content management systems. Others include auditing, security features, syndication of updates, workflows, etc. There is no reason why a cloud based software store should not have these features. In fact, they might prove to be highly useful. So, naturally our cloud based AST store should come with all these features.
  • Collaborative editing. Collaborative text editors have been around forever but somehow never really caught on in the software world. Also browser based office tools like Google docs and the recently discontinued Google wave have some really interesting collaborative editing features. I still think Wave could have been a really nice eclipse replacement. Naturally we want collaborative editing tools for our cloud based ASTs.
  • Offline & synchronization. Being connected to the cloud is of course nice but sometimes people need to be offline (e.g. a plane) or there are disruptions in the network. In a cloud based world that could end up being highly disruptive for developers. However, there exist several good technical solutions. One of the trendy nosql solutions out there, couchdb, is all about storing stuff in a decentralized way and syncing over http. Ubuntu is essentially the worlds largest data store where each of its users has a private couchdb instance that (optionally) syncs with an Ubuntu provided central couchdb store. Yesterday a company called couchio announced a couchdb based product, CoucheOne, that is specifically designed to address data synchronization on mobile phones. The point being here that even though software needs to live in the cloud, it should be possible to go offline and sync back to the cloud later.
  • Integrated artifact generation and workflow management. Much of the development process follows pretty strict workflows that are commonly automated using continuous integration servers and build tools such as ant, rake, or maven. Most of the build tools are just trying to hide excessive amounts of pushing around of files, which we are trying to get rid off here. But the notion of automatically generating artifacts at various points in well defined work flows is of course something that is worthwhile keeping. Artifact generation could be plugin based or even similar to intentional programming style transformations.
  • Distribution. When I say the cloud, I don’t mean that there should be some big monster repository of software. I’m thinking more of a REST style architecture of resources on the internet that are linked to each other. Also I like the notion of Git/Mercurial style distributed repositories of stuff. The cloud should very much be a distributed thing as opposed to some monolithic monstrous content management system.

What does all this mean in practical terms?

Having everything in the cloud like outlined above allows for the elimination of a lot of cruft and overhead in the current software engineering practice. A few examples of activities that will work in a very different way:

  • Checking out code and committing changes. This is no longer needed. Since you are working with a cloud based system, all your edits go straight to the cloud. You might still want to annotate some milestones with a “commit message” of course.
  • Branching and tagging. Everybody working on the same software sounds nice of course but sometimes you want to isolate changes on a branch or take a snapshot of a particular revision of the software. There’s no good reason why a cloud based system could not do this. Additionally, this removes the need for having local work copies that are ‘dirty’ instead you could just save to a private branch.
  • Changes. Like in git, the cloud stores chains of incremental changes to the AST (rather than to files).
  • Merging. If you have branches, there will be the need to merge changes back and forth. Since we are now all cloud based, this no longer requires pushing around vast amounts of data. Instead since all branches are just chains of changes, the process could be very similar to what git does when merging and rebasing.
  • Compilation. Compiling code is the process of generating artifacts from ASTs. Since the ASTs live in the cloud and compiling is just a special case of transforming one AST to another (see intentional programming), this should pretty much be an on the fly kind of thing where the right artifacts are simply generated on the fly when they are needed. This could be an intentional programming like extensible system, which would be really nice of course. However, a plain old compiler might do the trick as well for many.
  • Running tests. Running tests is an essential part of the work flow in any decent software engineering practice. Since we now have a cloud based system with explicit support for common CMS features like work flows, it is probably a good idea for the cloud to automatically run tests at the appropriate moment in a work flow (e.g. when merging from one branch to another). Some developers already use elaborate continuous integration setups involving build servers, SCM triggers, cron jobs, etc. This just takes it to the next level.
  • Dependencies and relations between different systems. Software does not exist in isolation but it depends on other software. In the conventional world, dependencies need to be managed and artifacts need to be dowloaded locally. A dependency is nothing more than a special case of a relation between two nodes in an AST. The cloud is full of relations. You might know them as URLs. So, in the cloud, a dependency is just a link. Relations like this may be dependencies but there may be many types of useful relations like for example, this class node here contains tests for that class there. Links are a much more powerful construct than textual references in some file.
  • Sharing software. This one is easy: you send somebody a link. That’s it, we all connect to the same cloud. The rest is just credentials, access rights, etc.
  • Releasing artifacts. Since the generation of artifacts (aka. compiling) is happening on the fly, a release is nothing but the act of publishing, which in turn is a step in a workflow. Probably you will want to release related artifacts such as change logs, documentation, upgrade guides as well. These of course also live in the cloud.
  • Filing bugs. Bugs are just a special case of a particular artifact that can be represented as an AST. In other words, it can be part of the cloud and be accessed and managed in the same way. There is no need for integtation of legacy bug tracking systems. Having both code and issue tracking in the same cloud of course allows for some nice workflow enhancements. Think Mylyn style workflows.
  • Writing documentation. Documentation is just another example of an artifact. That too can live in the cloud and leverage the ability to have links to and from the documentation and the documented artifact(s).

Conclusion

This concludes what may seem like a pretty random collection of thoughts but is in fact something that has been brewing in my mind for about ten years in various forms.

In summary, what I’m trying to propose here is a cloud based system that has lots of fancy features like you would find in content management systems, collaborative editing tools, and can be used online and offline. For me this is not science fiction since all the bits and pieces exist already. It’s just that in the software tools used today we are stuck with nineteen seventies technology and elaborate workarounds to compensate for some of the bigger problems.

I believe that if properly implemented, this cloud based software development environment would be vastly superior to the loosely connected collection of tools, systems, and technologies that dominate the current practice. It would stimulate more effective collaboration, would vastly cut down on the tedious bureaucracy associated with software development (waiting for builds, checking out code, dealing with conflicts when synchronizing changes, etc.) and would lead to more productive development where developers are able to focus on the process of creation more than is the case now.