The People Graph

Social networks have been all the rage for over a decade now. In the past years, Facebook has consolidated their leadership in personal networking and Linkedin has become the undisputed leader of business networking.

A side effect of this consolidation is that business models have also consolidated, leaving out a lot of potential applications and use-cases that do not fit the current models.

Both Facebook and Linkedin have business models that focus on monetizing their respective social graphs through advertising. LinkedIn also monetizes search, analytics, and recruiting.

To protect their data and business, they have largely cut off API access for third parties. Consequently, building third party applications that use these networks has become very hard, if not impossible.

At the same time, building a new social network has become harder due to the fact that most people already have their social networking needs covered by the existing offerings. Most wannabe new social networks face the ’empty room problem’, where they don’t become interesting until enough users are using it.

Inbot was founded on the idea that sales is fundamentally a social process where building relationships and trust between people is the key to success. Yet, Customer Relationship Management (CRM) systems used today by most salespeople are anything but social.

CRM is mostly used to manually keep track of conversations with customers. Relationship data is shared only within the sales team, and when sales people change jobs, all aggregated data in the CRM is left behind and they take their social network with them.

Marketing automation has emerged as a software-based mechanism to help companies to generate leads in a more automated fashion. The problem with it today is that the spam and noise generated by these applications is deafening, and making everyone harder to reach.

Initially, Inbot started out as a disruptive play to make CRMs find and provide links to new business opportunities. Over time, we realized that we should focus solely on social lead generation, and decouple it from teams and companies that CRM vendors target.

Since last August, we have rolled out a community that we hope will one day rival that of Linkedin but yet works very differently.

Continue reading “The People Graph”

Running Elasticsearch in a Docker 1.12 Swarm

My last blog post was on running consul in Docker Swarm. The reason I wanted to that is because I want to run Elasticsearch in swarm so that I can use swarm service discovery to enable other containers to use Elasticsearch. However, I’ve been having a hard time getting that up and running because of various issues and limitations in both Elasticsearch and Docker. While consul is nice, it feels kind of wrong to have two bits of infrastructure doing service discovery. Thanks to Christian Kniep’s article, I know it can be done that way.

However, I actually managed to do it without consul eventually. Since it is completely non trivial to do this, I decided to write up the process for this as well.

Assuming you have your swarm up and running, this is how you do it:

docker network create es -d overlay

docker service create --name esm1 --network es \
  -p 9201:9200 -p 9301:9301 \
   elasticsearch \ \
  -Des.discovery.zen.minimum_master_nodes=2 \

docker service create --name esm2 --network es \
  -p 9202:9200 -p 9302:9302 \
  elasticsearch \ \
  -Des.discovery.zen.minimum_master_nodes=2 \

docker service create --name esm3 --network es \
  -p 9203:9200 -p 9303:9303 \
  elasticsearch \,esm2:9302 \
  -Des.discovery.zen.minimum_master_nodes=2 \

There is a lot of stuff going on here. So, lets look at the approach in a bit more detail. First, we want to be able to talk to the cluster using the swarm registered name rather than an ip address. Secondly, there needs to be a way for each of the cluster nodes to talk to any of the other nodes. The key problem with both elasticsearch and consul is that we have no way to know up front what the ip addresses are going to be of swarm containers. Furthermore, Docker swarm does not currently support host networking so we cannot use the external ip’s of the docker hosts either.

With Consul we fired up two clusters that used each other and via its gossip protocol, all nodes eventually find each other’s ip addresses. Unfortunately, the same strategy does not work for Elasticsearch. There are several issues that make this hard:

  • The main problem with running elasticsearch is that similar to other clustered software it needs to know the where some of the other nodes in the cluster are. This means we have need a way of addressing the individual Elasticsearch containers in the swarm. We can do this using the ip address that Docker assigns to the containers, which we can’t know until the container is running. Alternatively, we can use the container DNS entry in the swarm, which we also can’t know until the container is running because it includes the container id. This is the root cause of the chicken egg problem we face when bootstrapping the Elasticsearch cluster on top of Swarm: we have no way of configuring it with the right list of nodes to talk to.

  • Elasticsearch really does not like having to deal with round robin’ed service DNS entries for it’s internal nodes. You get a log full of errors since every time Elasticsearch pings a node, it ends up talking to a different node. This rules out what we did with consul earlier where we solved the problem by running two consul services (each with multiple nodes) that talk to each other using their swarm DNS name. Consul is smart enough to figure out the ip addresses of all the containers since it’s gossip protocol ensures that the information replicates to all the nodes. This does not work with Elasticsearch.

  • DNS entries of other Elasticsearch nodes that do not resolve when Elasticsearch starts start up, causes it to crash and exit. Swarm won’t create the DNS entry for a service until after it has started.

The solution to these problems is simple but ugly: an Elasticsearch service can only have one node in Swarm. Since we want multiple nodes in our Elasticsearch cluster, we’ll need to run multiple services: one for each Elasticsearch node. This is why in the example above, we start three services, each with only 1 replica (the default). Each of them binds on eth0 which is where the Docker overlay network ends up. Finally, Elasticsearch nodes rely on the ip address that nodes advertise to talk to each other. So, the port that it advertises needs to match the service port. It took me some time to figure it out but simply doing a -p 9301:9300 is not good enough: it really needs to be -p 9301:9301. Therefore each of the Elasticsearch services is configured with a different port. For the HTTP port we don’t need to do this so we can simply map port 9200 to a different external port. Finally, the services can only talk to other services that already exist. So, what won’t work is specifying,esm2:9302,esm3:9303 on each of the services. Instead, the first service only has itself to talk to. The second one can talk to the first one, and the third one can talk to the first and second one. This also means the services have to start in the right order.

To be clear, I don’t think that this is a particularly good way of running Elasticsearch. Also, several of the problems I outlined are being worked on and I expect that future versions of Docker may make this a little easier.

Running consul in a docker swarm with docker 1.12

Recently, Docker released version 1.12 which includes swarm functionality. When I went to a meetup about this last week, Christian Kniep demoed his solution for running consul and elasticsearch using this. Unfortunately, his solution relies on some custom docker images that he created and I spent quite a bit of time replicating what he did without relying on his docker images.

In this article, I show how you can run consul using docker swarm mode using the official consul docker image. The advantage of this is that other services in the swarm can rely on the dns name that swarm associates with the consul service. This way you can integrate consul for service discovery and configuration and containers can simply ask for what they need without having to worry about where to find consul.

Note, this is a minimalistic example and probably not the best way to run things in a production environment but it proves that it is possible. In any case, docker 1.12 is rather new and they are still ironing out bugs and issues.

Before you continue, you may want to read up on how to get a docker swarm going. In my test setup, I’m using a simple vagrant cluster with three vms each running docker 1.12.1 with the docker swarm already up and running. I strongly recommend to configure a logging driver so you can see what is going on. I used the syslog driver so I can simply tail the syslog on each vm.

Briefly, this approach is based on the idea of running two docker services for consul that can find each other via their round robined service names in the swarm.

First, we create an overlay network for the consul cluster. In swarm mode, host networking is disabled. Most of the consul documentation assumes that you use that and it won’t work. So, instead we use an overlay network.

docker network create consul-net -d overlay

First we need to bootstrap the consul cluster with a single node service:

docker service create --name consul-seed \
  -p 8301:8300 \
  --network consul-net \
  consul agent -server -bootstrap-expect=3  -retry-join=consul-seed:8301 -retry-join=consul-cluster:8300

Since we are going to run two services for consul, we need to run them on different ports. So the first one we simply map the exposed port to 8301. Secondly, we need to tell consul what network interface to bind on. It seems our overlay network ends up on eth0, so we’ll use that. The environment variable is used to figure out the ip of the container on that interface by the consul start script in the official container.

Swarm will now launch the service and you should be able to find a running container on one of your vms after a few seconds. Wait for it to initialize before proceeding.

After the consul seed service is up, we can fire up the second service.

docker service create --name consul-cluster \
  -p 8300:8300 \
  --network consul-net \
  --replicas 3 \
  consul agent -server -retry-join=consul-seed:8301 -retry-join=consul-cluster:8300 

This will take half a minute or so to fire up. After it fires up, you can run this blurb to figure out the cluster status:

docker exec `docker ps | grep consul_cluster |  docker ps | grep consul-cluster  | cut -f 1 -d ' '` consul members

Now we can remove the consul-seed service and replace it with a 3 node service. So we have six healthy nodes in total. This will allow us to do rolling restarts without downtime.

docker service rm consul-seed

docker service create --name consul-seed \
  -p 8301:8300 \
  --network consul-net \
  --replicas 3 \
  consul agent -server -retry-join=consul-cluster:8300 -retry-join=consul-seed:8301

This will take another few seconds before the cluster becomes stable. At this point you should get something like this.

root@m1:/home/ubuntu# docker exec `docker ps | grep consul_cluster |  docker ps | grep consul-cluster  | cut -f 1 -d ' '` consul members
Node          Address        Status  Type    Build  Protocol  DC
0f218b44311a  alive   server  0.6.4  2         dc1
5abf4c4e7d30  alive   server  0.6.4  2         dc1
682dd5bbf0e0  alive   server  0.6.4  2         dc1
8a911956a8ef  alive   server  0.6.4  2         dc1
e15cde65d645  alive   server  0.6.4  2         dc1
e3b1ce398302  failed  server  0.6.4  2         dc1
e9054b9e590b  alive   server  0.6.4  2         dc1

Going forward, you could run docker containers that register themselves with this consul cluster. There are a few loose ends here including ensuring the containers end up on separate hosts, figuring out how to get rid of the no longer existing node that is in a perpetually failed status, figuring out if/how to persist the consul data. Another thing to figure out would be rolling restarts and upgrading the cluster. Finally, the ports in the reported members seem to be all 8301 even though three of the consul nodes should be running on 8300. That looks wrong to me. Additionally, I’ve hardcoded eth0 as the interace and this may prove to be something that you can’t rely on. What for example if you have multiple overlay networks? I’d like a more reliable way to figure this out. It would be nice if you could specify the interface name as part of the docker service call. Finally, having 6 nodes introduces the risk of a split brain if 2 or more nodes lose their connectivity for some reason. So, it would be better to run with an odd number of nodes. Also, during a rolling restart, half the nodes disappear so you can’t actually set the quorum to n/2+1 (4) which is what you would normally do.

Functional tests and flakyness

I just stumbled on a nice article that Martin Fowler has had on his website for a few years about non deterministic tests. It’s a good read and it addresses something that I have encountered in multiple projects. Flaky test are indeed a problem in many places and I’ve had the ‘pleasure’ of dealing with such tests myself on a couple of occasions (often of my own making even).

Martin Fowler lists a few ways to mitigate this problem and his suggestions are excellent and well worth reading. But I have a few things to add that are not covered in that article.
Continue reading “Functional tests and flakyness”

Basic income

Prompted by a tweet on Y-combinator’s study on a basic income, I started pondering the notion of a basic income again. This has been on my mind lately since it seems like a cool idea and a pragmatic way to cut cost and boost the economy at the same time. One of the reasons this is on my mind is because I’m actively working to automate some of the more soulcrushing jobs many people currently have. If we take that away from people, what will happen to society?

The notion of a basic income has been floating around for a while. It sounds like a wild idea but actually makes total sense if you reflect on it a bit. If you accept the premise that we don’t let people starve, freeze to death, or die of treatable diseases, the leap to a basic income is not that much of a leap because we are effectively already providing it to most in the form of food, shelter and healthcare. Even the ones that receive nothing are generally not starving, can find shelter, and are typically able to get some amount of healthcare. We all pay for that through taxes, charity, insurances, etc. It’s just that we have a lot of hassle, begging, bureaucracy and stigma associated with depending on that. The idea of basic income is simply acknowledging the reality that the cost is there already and that a system that takes that as a starting point can be cheaper and more fair.

So, I did a bid of googling and stumbled on a baffling statistic provided by the Dutch government that totally backs up my hunch that this kind of thing might actually work:

Het totaal van de Nederlandse uitgaven aan sociale bescherming tegen ziekte, ouderdom en werkloosheid komt neer op 190 miljard euro in 2012 (het meest recente jaar waarover cijfers bekend zijn). Dat is ruim 11.200 euro per inwoner.

For you non dutch speakers:

The total of dutch expenses for social protection against sickness, getting old, and unemployment amounts to 190 billion euro in 2012 (the most recent year for which figures are known). That’s over 11200 euro per inhabitant.

That includes unemployment benefits, the state pension, healthcare cost, benefits for people that are not able to work due to sickness, and social welfare for pretty much anyone else unable, unwilling, or too old to work. In other words, the Dutch government spends an amount on helping a few million people in the Dutch society that per inhabitant actually amounts to a fairly decent basic income. Aside from a few homeless people, basically everyone is covered by this system already. People starving to death is essentially unheard off in the Netherlands (other than by choice).

I checked the math. 17M * 11200 is indeed 190 billion euro (NL has about 17M inhabitants). Where does all that money go? It’s obviously mostly not going to the people it was supposed to support. The word ‘overhead’ does not begin to describe how inefficient this sounds. Last time I checked state pensions and social wellfare was much less than 11200/year and unemployment benefits are in any case time capped and also have a hard upper limit. What am I missing here? 933 euro/month is a very decent income and would be a considerable upgrade for most.

I’d say cut that by one third, call it a basic income and lay off whatever bureaucrats we currently have overseeing the giving out of far less to some of our citizens. 620/month is still pretty good and the layed off bureaucrats would of course be covered by this as well. Maybe they could do something more productive/worthwhile that actually contributes to the economy instead of just moving pennies around in some government office.

While we are at we can abandon minimum wages (because basic income) and cut all corporate salary expenses by about 620 netto + whatever benefits else are being paid by the employer (typically about 2x the netto income). Think about that for a second, even the lowest paying job in the Netherlands sets the employer back by more than double what goes to the employed person and layoffs are still hard in the Netherlands so you are stuck with them forever. I bet a lot of corporations wouldn’t mind paying a little more tax on profits in exchange for decimating labor cost and more flexibility around hiring and firing. If we keep the tax free income limit that exists today you can double your income with a job that pays about 4 euro an hour, 40 hours/week before you even start paying taxes. I bet there is a lot of work that doesn’t get done today because it is not worth paying minimum wage that could suddenly become an attractive way for people to boost their incomes a little. Also, why cap this at age 67?

That could do wonderful things for employment and industry in NL. My guess is most people wouldn’t quit their jobs or stop being active. However, they would become more critical about the type of work they do (less of the soul crushing variety, I imagine). Self employment becomes a no brainer in this new type of economy and a perfectly safe economic choice instead of a huge financial risk. Also people currently doing worthwhile things for ‘free’ would now suddenly enjoy an income as well. I’m talking about volunteer work, parenting, taking care of the elderly, art, etc. Most of these people are currently dependent on welfare or some form of economic relationship with e.g. a ‘cost winner’, ‘sugar daddy’, or worse.

Total cost for this would be 130bn/year in the Netherlands. That’s literally everyone with the Dutch nationality. With A GDP close to 700bn that sounds doable. I’d say the same if it was double the cost but it seems 11,200×0.66×17,000,000 really is 125,664,000,000. Some more statistics on revenue. This document suggests a few interesting things: we pay more in social insurance 96M than we pay for income and profit tax combined. Also, VAT is about the same as both of those taxes combined (around 70bn). Apparently most of the revenue funds the 190M we are spending on social welfare. So we conveniently just wiped out about 1/3rd of that expense while simultaneously raising lower incomes (i.e. more VAT income) while cutting labor cost and increasing corporate profits (more profit tax). This is where my back of the envelope calculations have to stop but you can see where I’m going with this: this seems more than merely doable; it’s actually a net gain for everyone.

Am I being naive or are we just paying an insanely huge price for the illusion of a fair system today? If I look at my own situation, I’m pretty sure that health would be the only reason for me to retire from active life permanently. Though I could imagine taking a sabbatical and relaxing a bit more once in a while.

Mobile Coverage according to Deutsche Bahn

Yesterday I was traveling by train and it struck me how poor connectivity is in Germany. Pretty much when traveling from Berlin to Hengelo (first stop across the border in NL), I typically plan to have no coverage whatsoever for what I guestimate is at least 80% plus of the trip. Apperently in places like Bad Bentheim, Rheine, and Osnabruck it is normal to have little or no coverage, even when the train stops on the damn railway station.
I found this nice tweet in my twitter feed this morning mentioning that Deutsche Bahn is providing some nice open data files. One of these files maps coverage for the different mobile providers in Germany along the rail tracks. I downloaded the file and did some very low tech analysis on the file basically taking their stability metric and finding the number of non zero values for each provider using a bit of old school command line voodoo.

# metrics with non 0 value data points (higher is better)

ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep o2_stability | grep  -E -v '0,$' | wc -l
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep t-mobile_stability | grep  -E -v '0,$' | wc -l
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep e-plus_stability | grep  -E -v '0,$' | wc -l
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep vodafone_stability | grep  -E -v '0,$' | wc -l

# metrics with 0 value data points (lower is better)
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep o2_stability | grep  -E  '0,$' | wc -l
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep t-mobile_stability | grep  -E  '0,$' | wc -l
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep e-plus_stability | grep  -E  '0,$' | wc -l
ip-10-0-1-28:~ $ cat connectivity_2015_09.geojson | grep vodafone_stability | grep  -E  '0,$' | wc -l

As I suspected, O2 is the worst and T-mobile has more than twice the coverage. However, that still amounts to pretty shit coverage since the vast majority of all metrics for all providers is 0. In fact my guestimate was quite accurate and even for t-mobile they have no connection stability for a whopping 70% of the metric points, which I assume are normally distributed along the tracks (if not, it could be worse). For O2, it is more like 85%. The total number of metrics for all providers appears to be roughly the same, which suggest that the numbers should be comparable.

Wtf Germany? Please fix your infrastructure and stop being a digital backwater.

Asana – killer issue tracker

I recently discovered Asana through @larsfronius. I have had a rocky history with issue trackers and productivity tools in general. Whether it is Jira, Trac, Bugzilla, Trello, the Github, Bitbucket, and Gitlab issue trackers, text files, excel sheets, or post its. It just doesn’t work for me; it gets in the way and stuff just starts happening outside it. I’ve devolved to the point where I can’t read my own handwriting so anything involving paper, pens, crayons and what not is a complete non starter for me. Besides it doesn’t work if you have people working remotely. The combination of too much bureaucracy, bad UX, and a constant avalanche of things landing in my lap means I have a tendency to mostly not document what I’m doing, have done, or am planning to do. This is bad; I know.

Fundamentally I don’t plan my working weeks waterfall style in terms of tickets which I then pick up and do. In many ways, writing a ticket is often half doing the work since it triggers my reflex to solve any problem in near sight. It’s what engineers do. If you have ever tried to have a conversation with an engineer you know what I am talking about. You talk challenges; they talk solutions. Why don’t you just do X or Y? What about Z? It’s hard to separate planning from the execution. So, there’s a bit of history with me starting to create a ticket for something and realizing half way that actually just solving the problem takes less time, is more fun, and probably, a better use of my time and then doing that instead.

I work in a startup company where I’ve more or less labeled myself as chief plumber. This means I’m dealing with a wide variety of topics, all the time. That means I’m often dealing with three things already and somebody comes along with a fourth and a fifth. All of them urgent. All of them unplanned. We’ve tried dealing with it the traditional ways of imposing process, tools, bureaucracy, etc. But it always boils down to answering this question: what is the single most productive thing I can do that moves us forward and acknowledging that this is not a fixed thing that we set in stone for whatever sprint length is fashionable at the time but subject to change. Me hopping from task to task continuously means I don’t get anything done. Me only doing what seems nice, means I get the wrong things done. In a nutshell, this doesn’t scale and I need a decent issue tracking tool to solve it properly.

Since my memory is flaky and tends to hold only a handful of things, I tend to write down things that seem important but not urgent so that I can focus on what I was doing and then come back to it later. This process is highly fluid. Something comes along; I write it down. Then later I look at what I’ve written and edit a bit and once in a while I actually get around to doing stuff that I wrote down but mostly the list just grows and I pick off things that seem the most urgent. The best tool for this process is necessarily something brutally simple. The main goal is to be minimally disruptive to the actually productive thing I was doing when I got interrupted while still getting the job of taking note whatever I was interrupted for so that I don’t forget about it. So, for a long time a simple text editor was my tool of choice here. Alt tab, type, edit,type, ctrl+s, alt tab back to whatever I was doing. This is minimally intrusive. My planning process consists of moving lines around inside the file and editing them. This sounds as primitive as it is and it has many drawbacks; especially in teams. But it beats having to deal with Jira’s convoluted UI or hunting for the right button in a web ui to find stuff across the dozen or so Github and Gitlab projects I work on. However, using a text editor doesn’t scale and I need a decent issue tracking tool to solve it properly.

Enter Asana. As you can probably imagine, I came to this tool with healthy bias of basically all previous tools that I’ve tried over the past decades not coming close to my preferred but imperfect tool: the text file. My first impression of this tool was wrong. The design and my bias lead me to believe that this was another convoluted, over-engineered issue tracker. It took me five minutes of using it before I got how wrong I was.

The biggest hurdle was actually migrating the hundred or so issues I was tracking. Or so I thought. I was not looking forward to clicking new, edit, ok etc. a hundred times, which I assumed would be the case because that is how basically all issue trackers I’ve worked with so far work. So, I had been putting off that job. It turns out Asana does not work that way: copy 100 lines of text, paste, job done. So, one minute into using it I had already migrated everything I had in my text editor. I was impressed by that.

Asana is a list of stuff where you can do all the things that you would expect to do in a decent UI for that. You can paste lines of text and each line becomes an issue. You can drag lines around to change the order. Organize them using sections, tags, and projects. You can multi select lines using similar mouse and keyboard commands to what you would use in say a spreadsheet and manipulate issues that way. Unlike every other issue tracker, the check box in the UI actually is there to allow you to mark things as done and not for selecting stuff. Instead CMD+a, SHIFT+click, or CMD+click selects issues and then clicking e.g. the tag field does what you’d expect. Typing @ triggers the autocomplete and you can easily refer things (people, issues, projects, etc.) by name. There are no ticket numbers in the UI but each line has a unique url of course. Editing the line updates all the @ references to that issue. There are no modal dialogues or editing screens that hijack the screen. Instead Asana has a list and a detail pane that sit side by side. Click any line and the pane updates and you do your edits there. Multi select some lines and anything you do in the pane happens to the selected issues. There are no save, OK, submit, or other buttons that add unnecessary levels of indirection. Just clicking in the field and typing is enough.

Asana is the first actually usable issue tracker that I’ve come across. I’ve had multiple occasions where I found that Asana actually works as I would want it to. As in, I wonder what happens if I press CMD+z. It actually undid what I just did. I wonder what happens if I do that again. WTF, that works as well! Multi level undo; in a web app. OK, lets paste CMD+X and CMD+C some issues between asana projects. Boom, 100 issues just moved. Of course you can also CMD+A and drag selected issues to another asana project. I wonder if I can assign them to multiple projects. Yes you can, just hit the big + button. This thing just completely fixed the UX around issue tracking for me. All the advantages of a text file combined with all the advantages of a proper issue tracker. Creating multiple issues is as simple as type, enter, type another one, enter, etc. Organizing them is a breeze. It’s like a text editor but backed by a proper issue tracker. This UI wipes out 20 years of forms based web UX madness and it is refreshing. We’ve been using it for nearly two months at Inbot and are loving it.

So, if you are stuck using something more primitive and are are hating it: give Asana a try and you might like it as well.

How to rename an index in Elasticsearch

I’ve found that Elasticsearch on startup fixes index names to reflect the directory name, which is nice.

This is useful if you want to for example change the logstash index mapping template and don’t want to lose all the data indexed so far and going through a lengthy reindex process or wait until midnight for the index to roll over.

So, this actually works:

  • configure the new index template in logstash
  • shut down cluster
  • rename todays logstash index directory to logstash-2015.03.03_beforenoon
  • restart cluster and elasticsearch figures out that logstash-2015.03.03_beforenoon probably should be opened as logstash-2015.03.03_beforenoon; logstash will notice the missing index for today and fix it with the new template

Nice & almost what I want but I was wondering if I can do the same without shutting down my cluster and restarting it, which kind of a disruptive thing to do in most real environments. After a bit of experimenting, I found that the following works:

PUT /_cluster/settings
    "transient" : {
        "discovery.zen.minimum_master_nodes" : 1

The actual settings don’t matter, as long as you have something there, any PUT to the settings will basically cause elasticsearch to reload the cluster.

Update. You may want to not do this on a index that is being updated (like typically an active logstash index) since this duplicates lock files that elasticsearch uses. I ended up removing these lock files in my index copy after which it stopped barfing errors about the duplicated lock files. But probably not nice. So probably better is to

  • mv logstash-2015.03.03 logstash-2015.03.03_moved
  • clear out any write.lock files inside the new 2015.03.03_moved dir
  • do the PUT to /_cluster/settings

Elasticsearch failed shard recovery

We have a single node test server with some useful data in there. After a unplanned reboot of the server, elasticsearch failed to recover one shard in our cluster and as a consequence the cluster went red, which means it doesn’t work until you fix it. Kind of not nice. If this was production, I’d be planning an extensive post mortem (how did it happen) and doing some kind of restore from a backup probably. However, this was a test environment. Which meant an opportunity to figure out if the problem can actually be fixed somehow.

I spent nearly two hours to figure out how to recover from this in a way that does not inolve going “ahhh whatever” and deleting the index in question. Been there done that. I suspect, I’m not the only one to get stuck in the maze of half truths, well intentioned but incorrect advice, etc. So, I decided to document the fix I pieced together since I have a hunch this won’t be the last time I have to do this.

This is one topic where the elasticsearch documentation is of little help. It vaguely suggests that this shouldn’t happen that red is a bad color to see in your cluster status. It also provides you plenty of ways to figure out that, yes, your cluster isn’t working and why in excruciating levels of detail. However, very few ways of actually recovering beyond a simple delete and restore backup are documented.

However, you can actually fix things sometimes and I was able to piece together something that works with a few hours of googling.

Step 0 – diagnose the problem

This mainly involves figuring out which shard(s) are the problem. So:

# check cluster status
curl localhost:9200/_cluster/health
# figure out which indices are in trouble
curl 'localhost:9200/_cluster/health?level=indices&pretty'
# figure out what shard is the problem
curl localhost:9200/_cat/shards

I can never remember these curl incantations so nice to have them in one place. Also, poke around in the log. Look for any errors when elasticsearch restarts.

In my case it was pretty clear about the fact that due to some obscure exception involving a “type not found [0]” it couldn’t start shard 2 in my inbot_activities_v29 index. I vaguely recall from a previous episode where I unceremoniously deleted the index and moved on with my life that the problem is probably related to some index format change in between elasticsearch updates some time ago. Doesn’t really matter: we know that somehow that shard is not happy.

Diagnosis: Elasticsearch is not starting because there is some kind of corruption with shard 2 in index inbot_activities_v29. Because of that the whole cluster is marked as red and nothing works. This is annoying and I want this problem to go away fast.

Btw. I also tried the _recovery API but it seems to lack an option to actuall recover anything. ALso, it seems to not list any information for those shards that failed to recover. In my case it listed the four other shards in the index that were indeed fine.

Step 1 – org.apache.lucene.index.CheckIndex to the rescue

We diagnosed the problem. Red index. Corrupted shard. No backups. Now what?

Ok, technically you are looking at data loss at this point. The question is how much data you are going to lose. Your last resort is deleting the affected index. Not great, but it at least gets the rest of the cluster green.

Say you don’t actually care about the 1 or 2 documents in the index that are blocking the shard from loading? Is there a way to recover the shard and nurse the broken cluster back to a working state minus those apparently corrupted documents? That might be a preferable approach to simply deleting the whole index.

The answer is yes. Lucene comes with a tool to fix corrupted indices. It’s not well integrated into elasticsearch. There’s an open ticket in elasticsearch that may involve addressing this. In any case, you can run this tool manually.

Assuming a centos based rpm install:

# OK last warning: you will probably lose data. Don't do this if you can't risk that.

# this is where the rpm dumped all the lucene jars
cd /usr/share/elasticsearch/lib

# run the tool. You may want to adapt the shard path 
java -cp lucene-core*.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /opt/elasticsearch-data/linko_elasticsearch/nodes/0/indices/inbot_activities_v29/2/index/ -fix

The tool displays some warnings about what it is about to do, and if you are lucky reports that it fixed some issues and wrote some segment. Run the tool again and it mentions everything is fine. Excellent.

Step 2 – Convincing elasticsearch everything is fine

Except, elasticsearch is still red. Restarting it doesn’t help. It stays red. This one took me a bit longer to figure out. It turns out that all those well intentioned blogposts that mention the lucene CheckIndex tool sort of leave the rest of the process as an excercise to the reader. There’s a bit more to it:

# go to wherever the translog of your problem shard is
cd /opt/elasticsearch-data/linko_elasticsearch/nodes/0/indices/inbot_activities_v29/2/translog
# note the recovery file; now would be a good time to make a backup of this file because we will remove it
sudo service elasticsearch stop
rm *recovery
sudo service elasticsearch start

After this, elasticsearch came back green for me (see step 0 for checking that). I lost a single document in the process. Very acceptable given the alternative of having to delete the entire index.