Some reflections on a future of software engineering

Update. I accidentally published an early draft of this  long (for a blog post) article by clicking the wrong button in wordpress. If you read this early draft, the article is now in its finished form and I’ve added a couple of links, edited the earlier text, and added loads of links.

I have over the past year been a bit less active blogging and posting in forums. I make an exception on things related to build processes, distributed version management and Java and you may find a couple of fine rants on my website and across various blogs on this topic.

This post is intended to pull all of it together in a bit more grand vision and reflect a bit on what a post revolution software engineering world might look like. I don’t like or believe much in revolutions but I do recognize that getting from where we are to where I think we need to be amounts to one. The more likely outcome is an evolutionary development where the bits and pieces where I was not completely and utterly wrong might actually happen. At some point. Maybe.

The Problem: Files

Software engineering practice is stuck in abstractions and concepts dating back decades. Because it is stuck, it is not really moving forward as fast as it could be towards being more capable of dealing with scale in numbers of people, amount of software, and amount of evolutionary change per time unit. This annoys me and has done so for years. It is the main motivation for writing this article.

The core abstraction I strongly believe is very wrong is the file. A file is a binary or text blob that lives on a file system that has some name. It has some content of some type that is generally associated with a suffix in the name (e.g. *.java).

Software engineers mistakenly believe their output consists of software artifacts that are stored as files. It doesn’t. Their output consists of software that happens to be stored in files. If you were born in the fifties, you’d be serializing your software from your head onto paper and manually flip switches to de-serialize. This was tedious but it worked. People like Ada Lovelace were writing software long before there were machines capable of even executing the software (other than the human brain, which is not very good at running our own software). The sixties made IO quite a bit easier with consoles, punch cards and tapes. People would transfer programs to and from RAM buffers by streaming them from tape or by feeding piles of cards to a computer. Somewhere in this era the notion of transforming these software data blobs using other software blobs also became popular: I’m talking about editors, compilers and interpreters. Finally somewhere in the seventies floppies and hard disks started to be used and everybody started putting their data blobs in files. Sadly this is where evolution stops.

At least when it comes to storing computer programs. For other types of data we have gotten a lot smarter. We have graph databases, object databases, tuple stores, relational databases, document stores, etc. None of these are commonly used to store computer programs.

This is weird and in my view it is holding software engineering practice back. A big reason is that we have built a huge sand castle of technology around the notion of files as the primary means of storing computer programs. For example we have version control systems that store different versions of files. We have built tools that consume and produce files. We have source code analyzers that expect to be working on files. We have IDEs that offer us convenient ways to edit files. Once we stop putting software in files, that technology needs to start changing as well.

Smalltalk and Eclipse

Actually it is not strictly true that evolution stopped with the file. Smalltalk and its descendants have a very interesting approach where the entire software system is stored in one big image file. Effectively smalltalk is using a database concept. This allowed for some interesting innovations, like meta programming and software transformations (aka refactoring) that require some readily available in memory representation of the software that is generally very rich and graph like (i.e. an abstract syntax tree) and can be introspected. Smalltalk developers realized that it didn’t make much sense to fragment bits and pieces of the software over multiple files. You might as well just use a single file.

This inspired IBM to create a nice IDE for Java at some point: Visual Age. Visual Age was very much a reborn small talk style IDE and crucially stored its Java code not in files but in the form of a database that effectively was a serialization of the abstract syntax tree of the entire software system. It needed that tree in order to be able to power important features like refactoring, code browsers, etc. Later it became Eclipse and they dropped the notion of storing the serialized AST opting instead for a combination of on the fly construction of an in memory AST, on the fly compilation and analysis, and reading to and from files the good old fashioned way.

Failed experiment: intentional programming?

Some late nineties papers by Charles Simonyi on Intentional Programming and persistent rumors about him actually being very close to launching related products was about taking the whole smalltalk/visual age thing to the next level. It’s too early to call this a failed experiment because Simonyi never really delivered the goods. His company (Intentional Software) is still hyping intentional programming but has yet to ship a product. Seriously, this has been in the making longer than Duke Nukem Forever. In a nutshell Simonyi’s very brilliant idea is that creating software is about coming up with abstractions that are represented in the form of abstract syntax trees that can be translated into other, more general abstractions in multiple iterations until you end up with a syntax tree that can simply be serialized to executable code. His core idea was to treat the transformations and not the abstractions as the first class entities. In a intentional programming world you start with really simple abstractions that you can translate into executable code and you build increasingly more complex and specialized abstractions that can be used for specialist or domain specific things. The traditional notion of compiling is very similar except it is a bit limited in the number of transformations and the abstractness of the abstractions involved. Basically most languages go to roughly 2 or 3 transformations: source code to abstract syntax tree to assembly to executable bits and bytes. There are lots of variations here but it is essentially a pipe line. Intentional software’s product works differently. The user manipulates the AST of whatever abstractions he’s working with in some rich editing environment that presents appropriate views on the AST and tools to edit the AST. Applicable transformations and optimizations kick in on the fly to keep the executable code up to date. It probably does a bit of serialization here and there but probably not in a format that is very notepad friendly.

Another failed experiment: Model Driven Architecture (MDA)

Readers of the previous may be tempted to sweep this in the same bucket as Model Driven Architecture (MDA). MDA was something that emerged out of the lovely world of UML. In MDA one defines systems using a standardized meta language, generally inside a dedicated tool (e.g. Rational Rose). The idea is on the surface very similar to intentional programming except it is more artifact centric. The weak spot of MDA has always been that the transformations from models to software is kind of a hacky process that is generally locked up in some proprietary tool. Basically there is the meta language and the UML language (of course defined in the meta language) and UML models. Where the whole thing becomes messy is the transformation to actual software. This bit is more or less completely proprietary and your mileage may vary though apparently some tools are quite good at this. Early MDA environments were monstrous tools with loads of J2EE, application server madness and a rich sauce of UML and XML spread on top. It’s sort of a lot of not so great things stuck together and people have wasted enormous amounts of cash on making sense of the whole thing, which continues to feed hordes of very expensive consultants. MDA never really dragged itself outside the domain of finance and banking applications.

There is a lot of hype currently around language benchmarks and similar technologies that is partly fueled by Martin Fowler who is about to publish a book on this and other topics related to Domain Specific Languages. He also has several articles worth reading on his website on this topic.

The point about ASTs

The whole point about bringing up MDA and ASTs is that software is made out of abstractions. The serialization of abstractions to and from source code in e.g. C or Java is lossy in the sense that some of these abstractions are no longer explicit when the serialization is complete. For example you might have a nice program that implements the visitor pattern, which is an abstraction many OO programmers would understand. This might be obvious to the reader of the source code and it might have been documented in some comment but it is no longer explicit whereas when the developer had the design in his head, or on a white board, or possibly even in a UML diagram, it was explicit. Intentional programming and MDA are about giving developers the option to keep those abstractions around and to define their systems in terms of richer abstractions than are found in typical programming languages.

The point about files

A useful representation of the abstractions is the abstract syntax tree, which is an internal data structure used by compilers, IDEs, source code analyzers, etc. There are of course many interesting ways to serialize tree like data structures like ASTs. For example, XML is pretty popular (though not for its readability or editability). And of course source code is nothing more than the serialization of an AST that conforms to a particular grammar. The point about files however is that they are used as containers of abstractions. Generally you need multiple such containers to get a working system. They might be of different types even (your average J2EE architecture comes with dozens of them). Finally, the notion of a file as the boundary on a container of abstractions is rather arbitrary. In some languages you put a single class in a file. In others files are called modules and can contain multiple classes, functions, procedures, etc. Additionally files live in directories, which are containers of files. In many languages directories have semantics as well. For example Java uses nested directories that must match the package structure (which is also nested).

I’d argue that files and directories are poorly suited for modularizing your system. Also as a boundary of abstraction between modules, they are both too arbitrary and too coarse grained. For example when you consider versioning, you are stuck with files and directories even though many version systems don’t even have a first class representation for directories (git) or have trouble keeping track of name changes to files (cvs).

Development is a collaborative process

Development is essentially a collaborative process. Except for very small systems, most software needs to be developed together with other people that may even be spread all over the world in some cases. It is no surprise that there is a huge amount of technology available to facilitate collaborative development. I already touched on version control systems. In addition to that there are bug management systems; collaborative editors; source code review systems; tools to examine differences between files. If development is a collaborative process, why can’t we version, review, triage bugs, exchange, communicate, etc. via the internet. Why do I have to bother serializing and deserializing trees (aka. saving and opening) ASTs in order to do stuff?

To the point …

This brings me to the point, finally. Files don’t make sense at all in many ways, as I’ve argued so far. So, lets not use files then. Lets look at a possible alternative instead:

(Disclaimer: very likely some people have implemented bits and pieces of what I’m about to outline. I’m aware of this but still want to provide a bigger picture here.)

What does all this mean in practical terms?

Having everything in the cloud like outlined above allows for the elimination of a lot of cruft and overhead in the current software engineering practice. A few examples of activities that will work in a very different way:

Conclusion

This concludes what may seem like a pretty random collection of thoughts but is in fact something that has been brewing in my mind for about ten years in various forms.

In summary, what I’m trying to propose here is a cloud based system that has lots of fancy features like you would find in content management systems, collaborative editing tools, and can be used online and offline. For me this is not science fiction since all the bits and pieces exist already. It’s just that in the software tools used today we are stuck with nineteen seventies technology and elaborate workarounds to compensate for some of the bigger problems.

I believe that if properly implemented, this cloud based software development environment would be vastly superior to the loosely connected collection of tools, systems, and technologies that dominate the current practice. It would stimulate more effective collaboration, would vastly cut down on the tedious bureaucracy associated with software development (waiting for builds, checking out code, dealing with conflicts when synchronizing changes, etc.) and would lead to more productive development where developers are able to focus on the process of creation more than is the case now.