Functional tests and flakyness

I just stumbled on a nice article that Martin Fowler has had on his website for a few years about non deterministic tests. It’s a good read and it addresses something that I have encountered in multiple projects. Flaky test are indeed a problem in many places and I’ve had the ‘pleasure’ of dealing with such tests myself on a couple of occasions (often of my own making even).

Martin Fowler lists a few ways to mitigate this problem and his suggestions are excellent and well worth reading. But I have a few things to add that are not covered in that article.
Continue reading

Basic income

Prompted by a tweet on Y-combinator’s study on a basic income, I started pondering the notion of a basic income again. This has been on my mind lately since it seems like a cool idea and a pragmatic way to cut cost and boost the economy at the same time. One of the reasons this is on my mind is because I’m actively working to automate some of the more soulcrushing jobs many people currently have. If we take that away from people, what will happen to society?

The notion of a basic income has been floating around for a while. It sounds like a wild idea but actually makes total sense if you reflect on it a bit. If you accept the premise that we don’t let people starve, freeze to death, or die of treatable diseases, the leap to a basic income is not that much of a leap because we are effectively already providing it to most in the form of food, shelter and healthcare. Even the ones that receive nothing are generally not starving, can find shelter, and are typically able to get some amount of healthcare. We all pay for that through taxes, charity, insurances, etc. It’s just that we have a lot of hassle, begging, bureaucracy and stigma associated with depending on that. The idea of basic income is simply acknowledging the reality that the cost is there already and that a system that takes that as a starting point can be cheaper and more fair.

So, I did a bid of googling and stumbled on a baffling statistic provided by the Dutch government that totally backs up my hunch that this kind of thing might actually work:

Het totaal van de Nederlandse uitgaven aan sociale bescherming tegen ziekte, ouderdom en werkloosheid komt neer op 190 miljard euro in 2012 (het meest recente jaar waarover cijfers bekend zijn). Dat is ruim 11.200 euro per inwoner.

For you non dutch speakers:

The total of dutch expenses for social protection against sickness, getting old, and unemployment amounts to 190 billion euro in 2012 (the most recent year for which figures are known). That’s over 11200 euro per inhabitant.

That includes unemployment benefits, the state pension, healthcare cost, benefits for people that are not able to work due to sickness, and social welfare for pretty much anyone else unable, unwilling, or too old to work. In other words, the Dutch government spends an amount on helping a few million people in the Dutch society that per inhabitant actually amounts to a fairly decent basic income. Aside from a few homeless people, basically everyone is covered by this system already. People starving to death is essentially unheard off in the Netherlands (other than by choice).

I checked the math. 17M * 11200 is indeed 190 billion euro (NL has about 17M inhabitants). Where does all that money go? It’s obviously mostly not going to the people it was supposed to support. The word ‘overhead’ does not begin to describe how inefficient this sounds. Last time I checked state pensions and social wellfare was much less than 11200/year and unemployment benefits are in any case time capped and also have a hard upper limit. What am I missing here? 933 euro/month is a very decent income and would be a considerable upgrade for most.

I’d say cut that by one third, call it a basic income and lay off whatever bureaucrats we currently have overseeing the giving out of far less to some of our citizens. 620/month is still pretty good and the layed off bureaucrats would of course be covered by this as well. Maybe they could do something more productive/worthwhile that actually contributes to the economy instead of just moving pennies around in some government office.

While we are at we can abandon minimum wages (because basic income) and cut all corporate salary expenses by about 620 netto + whatever benefits else are being paid by the employer (typically about 2x the netto income). Think about that for a second, even the lowest paying job in the Netherlands sets the employer back by more than double what goes to the employed person and layoffs are still hard in the Netherlands so you are stuck with them forever. I bet a lot of corporations wouldn’t mind paying a little more tax on profits in exchange for decimating labor cost and more flexibility around hiring and firing. If we keep the tax free income limit that exists today you can double your income with a job that pays about 4 euro an hour, 40 hours/week before you even start paying taxes. I bet there is a lot of work that doesn’t get done today because it is not worth paying minimum wage that could suddenly become an attractive way for people to boost their incomes a little. Also, why cap this at age 67?

That could do wonderful things for employment and industry in NL. My guess is most people wouldn’t quit their jobs or stop being active. However, they would become more critical about the type of work they do (less of the soul crushing variety, I imagine). Self employment becomes a no brainer in this new type of economy and a perfectly safe economic choice instead of a huge financial risk. Also people currently doing worthwhile things for ‘free’ would now suddenly enjoy an income as well. I’m talking about volunteer work, parenting, taking care of the elderly, art, etc. Most of these people are currently dependent on welfare or some form of economic relationship with e.g. a ‘cost winner’, ‘sugar daddy’, or worse.

Total cost for this would be 130bn/year in the Netherlands. That’s literally everyone with the Dutch nationality. With A GDP close to 700bn that sounds doable. I’d say the same if it was double the cost but it seems 11,200×0.66×17,000,000 really is 125,664,000,000. Some more statistics on revenue. This document suggests a few interesting things: we pay more in social insurance 96M than we pay for income and profit tax combined. Also, VAT is about the same as both of those taxes combined (around 70bn). Apparently most of the revenue funds the 190M we are spending on social welfare. So we conveniently just wiped out about 1/3rd of that expense while simultaneously raising lower incomes (i.e. more VAT income) while cutting labor cost and increasing corporate profits (more profit tax). This is where my back of the envelope calculations have to stop but you can see where I’m going with this: this seems more than merely doable; it’s actually a net gain for everyone.

Am I being naive or are we just paying an insanely huge price for the illusion of a fair system today? If I look at my own situation, I’m pretty sure that health would be the only reason for me to retire from active life permanently. Though I could imagine taking a sabbatical and relaxing a bit more once in a while.

Mobile Coverage according to Deutsche Bahn

Yesterday I was traveling by train and it struck me how poor connectivity is in Germany. Pretty much when traveling from Berlin to Hengelo (first stop across the border in NL), I typically plan to have no coverage whatsoever for what I guestimate is at least 80% plus of the trip. Apperently in places like Bad Bentheim, Rheine, and Osnabruck it is normal to have little or no coverage, even when the train stops on the damn railway station.
I found this nice tweet in my twitter feed this morning mentioning that Deutsche Bahn is providing some nice open data files. One of these files maps coverage for the different mobile providers in Germany along the rail tracks. I downloaded the file and did some very low tech analysis on the file basically taking their stability metric and finding the number of non zero values for each provider using a bit of old school command line voodoo.

# metrics with non 0 value data points (higher is better)

ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep o2_stability | grep  -E -v '0,$' | wc -l
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep t-mobile_stability | grep  -E -v '0,$' | wc -l
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep e-plus_stability | grep  -E -v '0,$' | wc -l
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep vodafone_stability | grep  -E -v '0,$' | wc -l

# metrics with 0 value data points (lower is better)
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep o2_stability | grep  -E  '0,$' | wc -l
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep t-mobile_stability | grep  -E  '0,$' | wc -l
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep e-plus_stability | grep  -E  '0,$' | wc -l
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep vodafone_stability | grep  -E  '0,$' | wc -l

As I suspected, O2 is the worst and T-mobile has more than twice the coverage. However, that still amounts to pretty shit coverage since the vast majority of all metrics for all providers is 0. In fact my guestimate was quite accurate and even for t-mobile they have no connection stability for a whopping 70% of the metric points, which I assume are normally distributed along the tracks (if not, it could be worse). For O2, it is more like 85%. The total number of metrics for all providers appears to be roughly the same, which suggest that the numbers should be comparable.

Wtf Germany? Please fix your infrastructure and stop being a digital backwater.

Asana – killer issue tracker

I recently discovered Asana through @larsfronius. I have had a rocky history with issue trackers and productivity tools in general. Whether it is Jira, Trac, Bugzilla, Trello, the Github, Bitbucket, and Gitlab issue trackers, text files, excel sheets, or post its. It just doesn’t work for me; it gets in the way and stuff just starts happening outside it. I’ve devolved to the point where I can’t read my own handwriting so anything involving paper, pens, crayons and what not is a complete non starter for me. Besides it doesn’t work if you have people working remotely. The combination of too much bureaucracy, bad UX, and a constant avalanche of things landing in my lap means I have a tendency to mostly not document what I’m doing, have done, or am planning to do. This is bad; I know.

Fundamentally I don’t plan my working weeks waterfall style in terms of tickets which I then pick up and do. In many ways, writing a ticket is often half doing the work since it triggers my reflex to solve any problem in near sight. It’s what engineers do. If you have ever tried to have a conversation with an engineer you know what I am talking about. You talk challenges; they talk solutions. Why don’t you just do X or Y? What about Z? It’s hard to separate planning from the execution. So, there’s a bit of history with me starting to create a ticket for something and realizing half way that actually just solving the problem takes less time, is more fun, and probably, a better use of my time and then doing that instead.

I work in a startup company where I’ve more or less labeled myself as chief plumber. This means I’m dealing with a wide variety of topics, all the time. That means I’m often dealing with three things already and somebody comes along with a fourth and a fifth. All of them urgent. All of them unplanned. We’ve tried dealing with it the traditional ways of imposing process, tools, bureaucracy, etc. But it always boils down to answering this question: what is the single most productive thing I can do that moves us forward and acknowledging that this is not a fixed thing that we set in stone for whatever sprint length is fashionable at the time but subject to change. Me hopping from task to task continuously means I don’t get anything done. Me only doing what seems nice, means I get the wrong things done. In a nutshell, this doesn’t scale and I need a decent issue tracking tool to solve it properly.

Since my memory is flaky and tends to hold only a handful of things, I tend to write down things that seem important but not urgent so that I can focus on what I was doing and then come back to it later. This process is highly fluid. Something comes along; I write it down. Then later I look at what I’ve written and edit a bit and once in a while I actually get around to doing stuff that I wrote down but mostly the list just grows and I pick off things that seem the most urgent. The best tool for this process is necessarily something brutally simple. The main goal is to be minimally disruptive to the actually productive thing I was doing when I got interrupted while still getting the job of taking note whatever I was interrupted for so that I don’t forget about it. So, for a long time a simple text editor was my tool of choice here. Alt tab, type, edit,type, ctrl+s, alt tab back to whatever I was doing. This is minimally intrusive. My planning process consists of moving lines around inside the file and editing them. This sounds as primitive as it is and it has many drawbacks; especially in teams. But it beats having to deal with Jira’s convoluted UI or hunting for the right button in a web ui to find stuff across the dozen or so Github and Gitlab projects I work on. However, using a text editor doesn’t scale and I need a decent issue tracking tool to solve it properly.

Enter Asana. As you can probably imagine, I came to this tool with healthy bias of basically all previous tools that I’ve tried over the past decades not coming close to my preferred but imperfect tool: the text file. My first impression of this tool was wrong. The design and my bias lead me to believe that this was another convoluted, over-engineered issue tracker. It took me five minutes of using it before I got how wrong I was.

The biggest hurdle was actually migrating the hundred or so issues I was tracking. Or so I thought. I was not looking forward to clicking new, edit, ok etc. a hundred times, which I assumed would be the case because that is how basically all issue trackers I’ve worked with so far work. So, I had been putting off that job. It turns out Asana does not work that way: copy 100 lines of text, paste, job done. So, one minute into using it I had already migrated everything I had in my text editor. I was impressed by that.

Asana is a list of stuff where you can do all the things that you would expect to do in a decent UI for that. You can paste lines of text and each line becomes an issue. You can drag lines around to change the order. Organize them using sections, tags, and projects. You can multi select lines using similar mouse and keyboard commands to what you would use in say a spreadsheet and manipulate issues that way. Unlike every other issue tracker, the check box in the UI actually is there to allow you to mark things as done and not for selecting stuff. Instead CMD+a, SHIFT+click, or CMD+click selects issues and then clicking e.g. the tag field does what you’d expect. Typing @ triggers the autocomplete and you can easily refer things (people, issues, projects, etc.) by name. There are no ticket numbers in the UI but each line has a unique url of course. Editing the line updates all the @ references to that issue. There are no modal dialogues or editing screens that hijack the screen. Instead Asana has a list and a detail pane that sit side by side. Click any line and the pane updates and you do your edits there. Multi select some lines and anything you do in the pane happens to the selected issues. There are no save, OK, submit, or other buttons that add unnecessary levels of indirection. Just clicking in the field and typing is enough.

Asana is the first actually usable issue tracker that I’ve come across. I’ve had multiple occasions where I found that Asana actually works as I would want it to. As in, I wonder what happens if I press CMD+z. It actually undid what I just did. I wonder what happens if I do that again. WTF, that works as well! Multi level undo; in a web app. OK, lets paste CMD+X and CMD+C some issues between asana projects. Boom, 100 issues just moved. Of course you can also CMD+A and drag selected issues to another asana project. I wonder if I can assign them to multiple projects. Yes you can, just hit the big + button. This thing just completely fixed the UX around issue tracking for me. All the advantages of a text file combined with all the advantages of a proper issue tracker. Creating multiple issues is as simple as type, enter, type another one, enter, etc. Organizing them is a breeze. It’s like a text editor but backed by a proper issue tracker. This UI wipes out 20 years of forms based web UX madness and it is refreshing. We’ve been using it for nearly two months at Inbot and are loving it.

So, if you are stuck using something more primitive and are are hating it: give Asana a try and you might like it as well.

How to rename an index in Elasticsearch

I’ve found that Elasticsearch on startup fixes index names to reflect the directory name, which is nice.

This is useful if you want to for example change the logstash index mapping template and don’t want to lose all the data indexed so far and going through a lengthy reindex process or wait until midnight for the index to roll over.

So, this actually works:

  • configure the new index template in logstash
  • shut down cluster
  • rename todays logstash index directory to logstash-2015.03.03_beforenoon
  • restart cluster and elasticsearch figures out that logstash-2015.03.03_beforenoon probably should be opened as logstash-2015.03.03_beforenoon; logstash will notice the missing index for today and fix it with the new template

Nice & almost what I want but I was wondering if I can do the same without shutting down my cluster and restarting it, which kind of a disruptive thing to do in most real environments. After a bit of experimenting, I found that the following works:

PUT /_cluster/settings
    "transient" : {
        "discovery.zen.minimum_master_nodes" : 1

The actual settings don’t matter, as long as you have something there, any PUT to the settings will basically cause elasticsearch to reload the cluster.

Update. You may want to not do this on a index that is being updated (like typically an active logstash index) since this duplicates lock files that elasticsearch uses. I ended up removing these lock files in my index copy after which it stopped barfing errors about the duplicated lock files. But probably not nice. So probably better is to

  • mv logstash-2015.03.03 logstash-2015.03.03_moved
  • clear out any write.lock files inside the new 2015.03.03_moved dir
  • do the PUT to /_cluster/settings

Elasticsearch failed shard recovery

We have a single node test server with some useful data in there. After a unplanned reboot of the server, elasticsearch failed to recover one shard in our cluster and as a consequence the cluster went red, which means it doesn’t work until you fix it. Kind of not nice. If this was production, I’d be planning an extensive post mortem (how did it happen) and doing some kind of restore from a backup probably. However, this was a test environment. Which meant an opportunity to figure out if the problem can actually be fixed somehow.

I spent nearly two hours to figure out how to recover from this in a way that does not inolve going “ahhh whatever” and deleting the index in question. Been there done that. I suspect, I’m not the only one to get stuck in the maze of half truths, well intentioned but incorrect advice, etc. So, I decided to document the fix I pieced together since I have a hunch this won’t be the last time I have to do this.

This is one topic where the elasticsearch documentation is of little help. It vaguely suggests that this shouldn’t happen that red is a bad color to see in your cluster status. It also provides you plenty of ways to figure out that, yes, your cluster isn’t working and why in excruciating levels of detail. However, very few ways of actually recovering beyond a simple delete and restore backup are documented.

However, you can actually fix things sometimes and I was able to piece together something that works with a few hours of googling.

Step 0 – diagnose the problem

This mainly involves figuring out which shard(s) are the problem. So:

# check cluster status
curl localhost:9200/_cluster/health
# figure out which indices are in trouble
curl 'localhost:9200/_cluster/health?level=indices&pretty'
# figure out what shard is the problem
curl localhost:9200/_cat/shards

I can never remember these curl incantations so nice to have them in one place. Also, poke around in the log. Look for any errors when elasticsearch restarts.

In my case it was pretty clear about the fact that due to some obscure exception involving a “type not found [0]” it couldn’t start shard 2 in my inbot_activities_v29 index. I vaguely recall from a previous episode where I unceremoniously deleted the index and moved on with my life that the problem is probably related to some index format change in between elasticsearch updates some time ago. Doesn’t really matter: we know that somehow that shard is not happy.

Diagnosis: Elasticsearch is not starting because there is some kind of corruption with shard 2 in index inbot_activities_v29. Because of that the whole cluster is marked as red and nothing works. This is annoying and I want this problem to go away fast.

Btw. I also tried the _recovery API but it seems to lack an option to actuall recover anything. ALso, it seems to not list any information for those shards that failed to recover. In my case it listed the four other shards in the index that were indeed fine.

Step 1 – org.apache.lucene.index.CheckIndex to the rescue

We diagnosed the problem. Red index. Corrupted shard. No backups. Now what?

Ok, technically you are looking at data loss at this point. The question is how much data you are going to lose. Your last resort is deleting the affected index. Not great, but it at least gets the rest of the cluster green.

Say you don’t actually care about the 1 or 2 documents in the index that are blocking the shard from loading? Is there a way to recover the shard and nurse the broken cluster back to a working state minus those apparently corrupted documents? That might be a preferable approach to simply deleting the whole index.

The answer is yes. Lucene comes with a tool to fix corrupted indices. It’s not well integrated into elasticsearch. There’s an open ticket in elasticsearch that may involve addressing this. In any case, you can run this tool manually.

Assuming a centos based rpm install:

# OK last warning: you will probably lose data. Don't do this if you can't risk that.

# this is where the rpm dumped all the lucene jars
cd /usr/share/elasticsearch/lib

# run the tool. You may want to adapt the shard path 
java -cp lucene-core*.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /opt/elasticsearch-data/linko_elasticsearch/nodes/0/indices/inbot_activities_v29/2/index/ -fix

The tool displays some warnings about what it is about to do, and if you are lucky reports that it fixed some issues and wrote some segment. Run the tool again and it mentions everything is fine. Excellent.

Step 2 – Convincing elasticsearch everything is fine

Except, elasticsearch is still red. Restarting it doesn’t help. It stays red. This one took me a bit longer to figure out. It turns out that all those well intentioned blogposts that mention the lucene CheckIndex tool sort of leave the rest of the process as an excercise to the reader. There’s a bit more to it:

# go to wherever the translog of your problem shard is
cd /opt/elasticsearch-data/linko_elasticsearch/nodes/0/indices/inbot_activities_v29/2/translog
# note the recovery file; now would be a good time to make a backup of this file because we will remove it
sudo service elasticsearch stop
rm *recovery
sudo service elasticsearch start

After this, elasticsearch came back green for me (see step 0 for checking that). I lost a single document in the process. Very acceptable given the alternative of having to delete the entire index.

Enforcing code conventions in Java

After many years of working with Java, I finally got around to enforcing code conventions in our project. The problem with code conventions is not agreeing on them (actually this is hard since everybody seems to have their own preferences but that’s beside the point) but enforcing them. For the purpose of enforcing conventions you can choose from a wide variety of code checkers such as checkstyle, pmd, and others. My problem with this approach is that checkers usually end up being a combination of too strict, too verbose, or too annoying. In any case nobody ever checks their output and you need to have the discipline to fix things yourself for any issues detected. Most projects I’ve tried checkstyle on, it finds thousands of stupid issues using the out of the box configuration. Pretty much every Java project I’ve ever been involved with had somewhat vague guidelines on code conventions and a very loose attitude to enforcing these. So, you end up with loads of variation in whitespace, bracket placement, etc. Eventually people stop caring. It’s not a problem worthy of a lot of brain cycles and we are all busy.

Anyway, I finally found a solution to this problem that is completely unintrusive: format source code as part of your build. Simply add the following blurb to your maven build section and save some formatter settings in XML format in your source tree. It won’t fix all your issues but formatting related diffs should be a thing of the past. Either your code is fine, in which case it will pass the formatter unmodified or you messed up, in which case the formatter will fix it for you.

<plugin><!-- mvn java-formatter:format -->


This plugin formats the code using the specified formatting settings XML file and it executes every build before compilation. You can create the settings file by exporting the Eclipse code formatter settings. Intellij users can use these settings as well since recent versions support the eclipse formatter settings file format. The only thing you need to take care off is the organize imports settings in both IDEs. Eclipse comes with a default configuration that is very different from what Intellij does and it is a bit of a pain to fix on the Intellij side. Eclipse has a notion of import groups that are each sorted alphabetically. It comes with four of these groups that represent imports with different prefixes so, javax.* and java.*, etc. are different groups. This behavior is very tedious to emulate in Intellij and out of the scope of the exported formatter settings. For that reason, you may want to consider modifying things on the Eclipse side and simply remove all groups and simply sort all imports alphabetically. This behavior is easy to emulate on Intellij and you can configure both IDEs to organize imports on save, which is good practice. Also, make sure to not allow .* imports and only import what you actually use (why load classes you don’t need?). If everybody does this, the only people causing problems will be those with poorly configured IDEs and their code will get fixed automatically over time.

Anyone doing a mvn clean install to build the project will automatically fix any formatting issues that they or others introduced. Also, the formatter can be configured conservatively and if you set it up right, it won’t mess up things like manually added new lines and other manual formatting that you typically want to keep. But it will fix the small issues like using the right number of spaces (or tabs, depending on your preferences), having whitespace around brackets, braces, etc. The best part: it only adds about 1 second to your build time. So, you can set this up and it basically just works in a way that is completely unintrusive.

Compliance problems introduced by people with poor IDE configuration skills/a relaxed attitude to code conventions (you know who you are) will automatically get fixed this way. Win win. There’s always the odd developer out there who insists on using vi, emacs, notepad, or something similarly archaic that most IDE users would consider cruel and unusual punishment. Not a problem anymore, let them. These masochists will notice that whatever they think is correctly formatted Java might cause the build to create a few diffs on their edits. Ideally, this happens before they commit. And if not, you can yell at them for committing untested code: no excuses for not building your project before a commit.