Running Elasticsearch in a Docker 1.12 Swarm

My last blog post was on running consul in Docker Swarm. The reason I wanted to that is because I want to run Elasticsearch in swarm so that I can use swarm service discovery to enable other containers to use Elasticsearch. However, I’ve been having a hard time getting that up and running because of various issues and limitations in both Elasticsearch and Docker. While consul is nice, it feels kind of wrong to have two bits of infrastructure doing service discovery. Thanks to Christian Kniep’s article, I know it can be done that way.

However, I actually managed to do it without consul eventually. Since it is completely non trivial to do this, I decided to write up the process for this as well.

Assuming you have your swarm up and running, this is how you do it:

docker network create es -d overlay

docker service create --name esm1 --network es \
  -p 9201:9200 -p 9301:9301 \
   elasticsearch \ \
  -Des.discovery.zen.minimum_master_nodes=2 \

docker service create --name esm2 --network es \
  -p 9202:9200 -p 9302:9302 \
  elasticsearch \ \
  -Des.discovery.zen.minimum_master_nodes=2 \

docker service create --name esm3 --network es \
  -p 9203:9200 -p 9303:9303 \
  elasticsearch \,esm2:9302 \
  -Des.discovery.zen.minimum_master_nodes=2 \

There is a lot of stuff going on here. So, lets look at the approach in a bit more detail. First, we want to be able to talk to the cluster using the swarm registered name rather than an ip address. Secondly, there needs to be a way for each of the cluster nodes to talk to any of the other nodes. The key problem with both elasticsearch and consul is that we have no way to know up front what the ip addresses are going to be of swarm containers. Furthermore, Docker swarm does not currently support host networking so we cannot use the external ip’s of the docker hosts either.

With Consul we fired up two clusters that used each other and via its gossip protocol, all nodes eventually find each other’s ip addresses. Unfortunately, the same strategy does not work for Elasticsearch. There are several issues that make this hard:

  • The main problem with running elasticsearch is that similar to other clustered software it needs to know the where some of the other nodes in the cluster are. This means we have need a way of addressing the individual Elasticsearch containers in the swarm. We can do this using the ip address that Docker assigns to the containers, which we can’t know until the container is running. Alternatively, we can use the container DNS entry in the swarm, which we also can’t know until the container is running because it includes the container id. This is the root cause of the chicken egg problem we face when bootstrapping the Elasticsearch cluster on top of Swarm: we have no way of configuring it with the right list of nodes to talk to.

  • Elasticsearch really does not like having to deal with round robin’ed service DNS entries for it’s internal nodes. You get a log full of errors since every time Elasticsearch pings a node, it ends up talking to a different node. This rules out what we did with consul earlier where we solved the problem by running two consul services (each with multiple nodes) that talk to each other using their swarm DNS name. Consul is smart enough to figure out the ip addresses of all the containers since it’s gossip protocol ensures that the information replicates to all the nodes. This does not work with Elasticsearch.

  • DNS entries of other Elasticsearch nodes that do not resolve when Elasticsearch starts start up, causes it to crash and exit. Swarm won’t create the DNS entry for a service until after it has started.

The solution to these problems is simple but ugly: an Elasticsearch service can only have one node in Swarm. Since we want multiple nodes in our Elasticsearch cluster, we’ll need to run multiple services: one for each Elasticsearch node. This is why in the example above, we start three services, each with only 1 replica (the default). Each of them binds on eth0 which is where the Docker overlay network ends up. Finally, Elasticsearch nodes rely on the ip address that nodes advertise to talk to each other. So, the port that it advertises needs to match the service port. It took me some time to figure it out but simply doing a -p 9301:9300 is not good enough: it really needs to be -p 9301:9301. Therefore each of the Elasticsearch services is configured with a different port. For the HTTP port we don’t need to do this so we can simply map port 9200 to a different external port. Finally, the services can only talk to other services that already exist. So, what won’t work is specifying,esm2:9302,esm3:9303 on each of the services. Instead, the first service only has itself to talk to. The second one can talk to the first one, and the third one can talk to the first and second one. This also means the services have to start in the right order.

To be clear, I don’t think that this is a particularly good way of running Elasticsearch. Also, several of the problems I outlined are being worked on and I expect that future versions of Docker may make this a little easier.

Running consul in a docker swarm with docker 1.12

Recently, Docker released version 1.12 which includes swarm functionality. When I went to a meetup about this last week, Christian Kniep demoed his solution for running consul and elasticsearch using this. Unfortunately, his solution relies on some custom docker images that he created and I spent quite a bit of time replicating what he did without relying on his docker images.

In this article, I show how you can run consul using docker swarm mode using the official consul docker image. The advantage of this is that other services in the swarm can rely on the dns name that swarm associates with the consul service. This way you can integrate consul for service discovery and configuration and containers can simply ask for what they need without having to worry about where to find consul.

Note, this is a minimalistic example and probably not the best way to run things in a production environment but it proves that it is possible. In any case, docker 1.12 is rather new and they are still ironing out bugs and issues.

Before you continue, you may want to read up on how to get a docker swarm going. In my test setup, I’m using a simple vagrant cluster with three vms each running docker 1.12.1 with the docker swarm already up and running. I strongly recommend to configure a logging driver so you can see what is going on. I used the syslog driver so I can simply tail the syslog on each vm.

Briefly, this approach is based on the idea of running two docker services for consul that can find each other via their round robined service names in the swarm.

First, we create an overlay network for the consul cluster. In swarm mode, host networking is disabled. Most of the consul documentation assumes that you use that and it won’t work. So, instead we use an overlay network.

docker network create consul-net -d overlay

First we need to bootstrap the consul cluster with a single node service:

docker service create --name consul-seed \
  -p 8301:8300 \
  --network consul-net \
  consul agent -server -bootstrap-expect=3  -retry-join=consul-seed:8301 -retry-join=consul-cluster:8300

Since we are going to run two services for consul, we need to run them on different ports. So the first one we simply map the exposed port to 8301. Secondly, we need to tell consul what network interface to bind on. It seems our overlay network ends up on eth0, so we’ll use that. The environment variable is used to figure out the ip of the container on that interface by the consul start script in the official container.

Swarm will now launch the service and you should be able to find a running container on one of your vms after a few seconds. Wait for it to initialize before proceeding.

After the consul seed service is up, we can fire up the second service.

docker service create --name consul-cluster \
  -p 8300:8300 \
  --network consul-net \
  --replicas 3 \
  consul agent -server -retry-join=consul-seed:8301 -retry-join=consul-cluster:8300 

This will take half a minute or so to fire up. After it fires up, you can run this blurb to figure out the cluster status:

docker exec `docker ps | grep consul_cluster |  docker ps | grep consul-cluster  | cut -f 1 -d ' '` consul members

Now we can remove the consul-seed service and replace it with a 3 node service. So we have six healthy nodes in total. This will allow us to do rolling restarts without downtime.

docker service rm consul-seed

docker service create --name consul-seed \
  -p 8301:8300 \
  --network consul-net \
  --replicas 3 \
  consul agent -server -retry-join=consul-cluster:8300 -retry-join=consul-seed:8301

This will take another few seconds before the cluster becomes stable. At this point you should get something like this.

root@m1:/home/ubuntu# docker exec `docker ps | grep consul_cluster |  docker ps | grep consul-cluster  | cut -f 1 -d ' '` consul members
Node          Address        Status  Type    Build  Protocol  DC
0f218b44311a  alive   server  0.6.4  2         dc1
5abf4c4e7d30  alive   server  0.6.4  2         dc1
682dd5bbf0e0  alive   server  0.6.4  2         dc1
8a911956a8ef  alive   server  0.6.4  2         dc1
e15cde65d645  alive   server  0.6.4  2         dc1
e3b1ce398302  failed  server  0.6.4  2         dc1
e9054b9e590b  alive   server  0.6.4  2         dc1

Going forward, you could run docker containers that register themselves with this consul cluster. There are a few loose ends here including ensuring the containers end up on separate hosts, figuring out how to get rid of the no longer existing node that is in a perpetually failed status, figuring out if/how to persist the consul data. Another thing to figure out would be rolling restarts and upgrading the cluster. Finally, the ports in the reported members seem to be all 8301 even though three of the consul nodes should be running on 8300. That looks wrong to me. Additionally, I’ve hardcoded eth0 as the interace and this may prove to be something that you can’t rely on. What for example if you have multiple overlay networks? I’d like a more reliable way to figure this out. It would be nice if you could specify the interface name as part of the docker service call. Finally, having 6 nodes introduces the risk of a split brain if 2 or more nodes lose their connectivity for some reason. So, it would be better to run with an odd number of nodes. Also, during a rolling restart, half the nodes disappear so you can’t actually set the quorum to n/2+1 (4) which is what you would normally do.