Puppet

I recently started preparing for deploying our localstre.am codebase to an actual server. So, that means I’m currently having a lot of fun picking up some new skills and playing system administrator (and being aware of my short comings there). World plus dog seems to be recommending using either Chef or Puppet for this  and since I know few good people that are into puppet in a big way and since I’ve seen it used in Nokia, I chose the latter.

After getting things to a usable state, I have a few observations that come from my background of having engineered systems for some time and having a general gut feeling about stuff being acceptable or not that I wanted to share.

  • The puppet syntax is weird. It’s a so-called ruby DSL that tries to look a bit like json and only partially succeeds. So you end up with a strange mix of : and => depending on the context. I might be nit picking here but I don’t think this a particularly well designed DSL. It feels more like a collection of evolved conventions for naming directories and files that happen to be backed by ruby. The conventions, naming, etc. are mostly non sensical. For example the puppet notion of a class is misleading. It’s not a class. It doesn’t have state and you don’t instantiate it. No objects are involved either. It’s more close to a ruby module. But in puppet, a module is actually a directory of puppet stuff (so more like a package). Ruby leaks through in lots of places anyway so why not stick with the conventions of that language? For example by using gems instead of coming up with your own convoluted notion of a package (aka module in puppet). It feels improvised and gobbled together. Almost like the people who built this had no clue what they were doing and changed their minds several times. Apologies for being harsh if that offended you BTW ;-).
  • The default setup of puppet (and chef) assumes a lot of middleware that doesn’t make any sense whatsoever for most smaller deployments (or big deployments IMNSHO). Seriously, I don’t want a message broker anywhere near my servers any time soon, especially not ActiveMQ. The so-called masterless (puppet) or solo (chef) setups are actually much more appropriate for most people.  They are more simple and have less moving parts. That’s a good thing when it comes to deployments.
  • It tries to be declarative. This is mostly a good thing but sometimes it is just nice to have an implicit order of things following from the order in which you specify things. Puppet forces you to be explicit about order and thus ends up being very verbose about this. Most of that verbosity is actually quite pointless. Sometimes A really comes before B if I specify it in that order in one file.
  • It’s actually quite verbose compared to the equivalent bash script when it comes to basic stuff like for example starting a service, copying a file from a to b, etc. Sometimes a “cp foo bar; chmod 644 bar” just kinda does the trick. It kind of stinks that you end up with these five line blobs for doing simple things like that. Why make that so tedious?
  • Like maven and ant in the Java world it, it tries to be platform agnostic but only partially succeeds. A lot of platform dependencies creep in any way and generally puppet templates are not very portable. Things like package names, file locations, service names, etc. end up being platform specific anyway.
  • Speaking of which, puppet is more like ant than like maven. Like ant, all puppet does is provide the means to do different things. It doesn’t actually provide a notion of a sensible default way that things are done that you then customize, which is what maven does instead. Not that I’m a big fan of maven but with puppet you basically have to baby sit the deployment and worry about all the silly little details that are (or should be) bog standard between deployments: creating users, setting up & managing ssh keys, ensuring processes run with the appropriate restrictions, etc. This is a lot of work and like a lot of IT stuff it feels repetitive and thus is a good candidate for automation. Wait … wasn’t puppet supposed to be that solution? The puppet module community provides some reusable stuff but its just bits and pieces really and not nearly enough for having a sensible production ready setup for even the simplest of applications. It doesn’t look like I could get much value out of that community.

So, I think puppet at this stage is a bit of a mixed bag and I still have to do a lot of work to actually produce a production ready system. Much more than I think is justified by the simplicity of real world setups that I’ve seen in the wild. Mostly running a ruby or java application is not exactly rocket science. So, why exactly does this stuff continue to be so hard & tedious despite a multi billion dollar industry trying to fix this for the last 20 years or so?

I don’t think puppet is the final solution in devops automation. It is simply too hard to do things with puppet and way too easy to get it wrong as well. There’s too much choice, a lack of sensible policy, and way too many pitfalls. It being an improvement at all merely indicates how shit things used to be.

Puppet feels more like a tool to finish the job that linux distributors apparently couldn’t be bothered to do in arbitrary ways than like a tool to produce reliable & reproducible production quality systems at this point and I could really use a tool that does the latter without the drama and attitude. What I need is sensible out of the box experience for the following use case: here’s a  war file, deploy that on those servers.

Anyway, I started puppetizing our system last week and have gotten it to the point where I can boot a bunch of vagrant virtual machines with the latest LTS ubuntu and have them run localstre.am in a clustered setup. Not bad for a week of tinkering but I’m pretty sure I could have gotten to that point without puppet as well (possibly sooner even). And, I still have a lot of work to do to setup a wide range of things that I would have hoped would be solved problems (logging, backups, firewalls, chroot, load balancing a bog standard, stateless http server, etc). Most of this falls in the category of non value adding stuff that somebody simply has to do. Given that we are a two person company and I’m the CTO/server guy, that would be me.

I of course have the benefit of hindsight from my career in Nokia where I watched Nokia waste/invest tens of millions on deploying simple bog standard Java applications (mostly) to servers for a few years. It seems simple things like “given a war file, deploy the damn thing to a bunch of machines” get very complicated when you grow the number of people involved. I really want to avoid needing a forty people ops team to do stupid shit like that.

So, I cut some corners. My time is limited and my bash skills are adequate enough that I basically only use puppet to get the OS in a usable enough state that I can hand off to to a bash script to do the actual work of downloading, untarring, chmodding, etc. needed to  get our application running. Not very puppet like but at least it gets the job done in 40 lines of code or so without intruding too much on my time. In those 40 lines, I install the latest sun jdk (tar ball), latest jruby (another tar ball), our code base, and the two scripts to start elastic search and our jetty/mizuno based app server.

What would be actually useful is reusable machine templates for bog standard things like php and ruby capable servers, java tomcat servers, apache load balancers, etc with sensible hardened configurations, logging, monitoring, etc. The key benefit would be inheriting from a sensible setup and only changing the bits that actually need changing. It seems that is too much to ask for at this point and consequently hundreds of thousands of system administrators (or the more hipster devops if you are so inclined) continue to be busy creating endless minor variations of the same systems.

Back online

Yesterday I came home, switched on my pc, started Firefox and got a connection timeout on the start page. From there I followed my usual debug & escalate scheme: try to ping a domain, then an ip number, restart modem (no connection, no line sync), run diagnostics on modem (indeed unable to connect). I then rechecked all the modem and network settings, waited a bit to see if the problem would fix itself (this actually works sometimes). Finally, I concluded that the problem was not on my side.

So I picked up the phone to call KPN. Doh! Telephone was dead as well!. IMHO that was pretty conclusive: things were broken and definately not on my side of the connection. It took four phonecalls (on my mobile phone) to get this knowledge through to KPN who eventually sent over an engineer who indeed concluded I was right (hey phone line seems to be dead!) and then figured out it was probably a problem on the other side of the cable. So he went there, found the problem, fixed the problem and all is well.

Except that I wasted a lot of time that IMHO could have been prevented from being wasted.

What really happened: regular maintenance in my area accidentally broke some cable (the engineer fixed this). I phoned the kpn helpdesk and there they went through their usual error checking procedure: Q: what’s your phone number? A: the same I just typed while in the fucking voice operated menu. Q: did you mess in any way with your DSL connection (this is a dangerous question: if you say yes here they will try to convince you that you caused the problem)? A: no and it has worked fine for two years. This lead to the inevitable conclusion that they had to do a line check (the only technical thing they can do) which they said was fine (it was broken!). They then decided to put an analyst on it who would call me back. This didn’t happen so I called them (tip, keep calling them, it’s the only way to get things done). The next day I called them again. They then claimed that they’d tried to call me several times. To which I enquired how that was possible since my phone was dead (duh!). They explained to me that they’d been trying the mobile number I gave them (they even read it to me!). My phone is one of those modern thingies which shows you an overview of missed calls: none whatsoever. After I explained that to them, things went relatively fast. I had to go home early to meet with the before mentioned engineer.

So what went wrong here:

  • Kpn was the cause of the problem.
  • When I called them they did not detect that the line was broken (WTF!)
  • Neither did they notice that the same day there had been some maintenance in my line (at this point, dots should have been connected but weren’t)
  • They then failed to call me back and even lied to me about that when I called them!
  • I went home early needlessly and wasted a lot of time on a problem that could have been diagnosed and fixed without me coming home early from work.

Why am I posting this? I’ve been here several times. I’ve had multiple incidents over the past few years all of which took way too many phone calls to get fixed. In 2000 when my phone was installed and then my adsl connection, it took no less then five visits of an engineer to get things working. The whole process took weeks. Then when a router in my area broke down and slowed down things to mere bytes per minute, it took three weeks to convince them the problem was on their side (only when I organized myself with some other disgruntled people in my area the problem mysteriously fixed itself). Then when I moved to Nijmegen they informed me my connection had been moved when in fact it had not been. That took about two weeks to get acknowledged and fixed.

In all these cases I did everything by the book and had to endure clueless, uninformed helpdesk employees, lies, the inevitable voice menu, etc.

I’m looking forward to getting rid of them permanently when I move to Finland in a few weeks.